0% found this document useful (0 votes)
15 views13 pages

A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures

The document discusses a study on bio-inspired algorithms, specifically Particle Swarm Optimization (PSO), for data clustering using various distance measures such as Euclidean, Manhattan, and Chebyshev. It highlights the challenges of traditional clustering methods and presents PSO as a robust alternative for optimizing clustering performance on benchmark medical datasets. The paper concludes that the PSO-based clustering algorithm using Chebyshev distance yields better fitness values compared to other distance measures.

Uploaded by

Willy Wonka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures

The document discusses a study on bio-inspired algorithms, specifically Particle Swarm Optimization (PSO), for data clustering using various distance measures such as Euclidean, Manhattan, and Chebyshev. It highlights the challenges of traditional clustering methods and presents PSO as a robust alternative for optimizing clustering performance on benchmark medical datasets. The paper concludes that the PSO-based clustering algorithm using Chebyshev distance yields better fitness values compared to other distance measures.

Uploaded by

Willy Wonka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273020647

A Study of Bio-inspired Algorithm to Data Clustering using Different Distance


Measures

Article in International Journal of Computer Applications · April 2013

CITATIONS READS

10 500

2 authors:

Mohamed Jafar Sivakumar Ramakrishnan


Jamal Mohamed College A.V.V.M Sri Pushpam College (Autonomous), Thanjavur,Tamilnadu,India
10 PUBLICATIONS 103 CITATIONS 75 PUBLICATIONS 774 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sivakumar Ramakrishnan on 03 March 2015.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

A Study of Bio-inspired Algorithm to Data Clustering


using Different Distance Measures

O.A. Mohamed Jafar R. Sivakumar


Research Scholar, Department of Computer Associate Professor, Department of Computer
Science, A.V.V.M. Sri Pushpam College Science, A.V.V.M. Sri Pushpam College
(Autonomous), Poondi, Tamil Nadu, India (Autonomous), Poondi, Tamil Nadu, India

ABSTRACT potentially useful information from large databases. There are


Data mining is the process of extracting previously unknown several data mining tasks including classification, regression,
and valid information from large databases. Clustering is an time series analysis, clustering, summarization, association
important data analysis and data mining method. It is the rules and sequence discovery [1].
unsupervised classification of objects into clusters such that
the objects from same cluster are similar and objects from Data clustering is one of the important research areas in data
different clusters are dissimilar. Data clustering is a difficult mining. It is a popular unsupervised classification techniques
unsupervised learning problem because many factors such as
which partitioning an unlabeled data set into groups of similar
distance measures, criterion functions, and initial conditions
have come into play. Many algorithms have been proposed in objects. The main aim of clustering is to group sets of objects
literature. However, some traditional algorithms have into classes such that similar objects are placed in the same
drawbacks such as sensitive to initialization and easily trapped cluster while dissimilar objects are in separate clusters. It is
in local optima. Recently, bio-inspired algorithms such as ant an NP-hard problem of finding groups in data sets by
colony algorithms (ACO) and particle swarm optimization minimizing some measure of dissimilarity. There are variety
algorithms (PSO) have found success in solving clustering of clustering algorithms including K-means [2], K-medoids
problems. These algorithms have also been used in several
other real-life applications. They are global optimization [3], BIRCH [4], DBSCAN [5], CURE [6], CHAMELEON
techniques. The distance based algorithms have been studied [7], CACTUS [8], CLARANS [9], and K-Harmonic means
for the clustering problems. This paper provides a study of [10]. Some of the algorithms have shortcomings such as
particle swarm optimization algorithm to data clustering using initial point sensitivity, local optimal convergence, and global
different distance measures including Euclidean, Manhattan solutions of large problems cannot be found with reasonable
and Chebyshev for well known real-life benchmark medical amount of computational effort. In order to overcome these
data sets and an artificially generated data set. The PSO-based
problems, many methods have been proposed.
clustering algorithm using Chebyshev distance measure is
better fitness value than those of Euclidean and Manhattan
distance measures. Over the last decade, bio-inspired algorithms like PSO and
ACO have found success in solving clustering problems [11].
In recent years, they have received special attention from the
Keywords research community. Self organization, cooperation,
Data Mining, Data Clustering, Bio-inspired Algorithm, communication, and flexibility are some of the important
Particle Swarm Optimization, Distance Measures. characteristics of bio-inspired algorithms. These algorithms
have found to be robust in solving continuous optimization
1. INTRODUCTION problems. In this paper, we have studied the performance of
Recent developments in information and computing PSO algorithm to data clustering using different distance
technologies have resulted in computerizing many measures such as Euclidean, Manhattan and Chebyshev. This
applications in various business areas. Data has played a vital algorithm is experimented over four well-known real-world
role in many organizations. Every organization is benchmark medical data sets and an artificially generated data
accumulating a large volume of data and storing them as set.
databases. This large amount of stored data contains valuable
hidden knowledge, which could be used to improve the The rest of this paper is organized as follows: In section 2,
efficiency of business performance. Discovery of knowledge data clustering and mathematical model of clustering problem
from this huge volume of data is indeed a challenge. There is are described. Section 3 introduces bio-inspired algorithms
a need for accessing the data, sharing the data and extracting and fundamental principles of PSO. Data clustering using
useful information and knowledge from the data. The area of PSO algorithm is provided in section 4. A brief discussion of
knowledge discovery in databases (KDD) has arisen over the distance measures employed in PSO algorithm is given in
last decade to address this challenge. Data mining is one of section 5. The methodology is described in section 6. In
the steps in the process of knowledge discovery. It refers to section 7, the experimental results for the data sets used in this
extracting or mining useful information from large data sets. study are presented. Finally, section 8 concludes the paper
It is the process of extracting previously unknown, valid and and outlines the future work.

33
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

2. DATA CLUSTERING grid structure. They use a multi-resolution grid data


structure. The famous algorithms of this kind are
2.1 Basic Concepts of Clustering STING and CLIQUE.
Clustering techniques partition the elements into clusters, so  Model-based methods: These algorithms
that elements within a cluster are similar to one another and hypothesize a model for each of the clusters and
dissimilar to elements in other clusters. Clustering is an find the best fit of the data to the given model.
example of unsupervised learning classification. The They can be either partitional or hierarchical
desirable properties of clustering algorithms are ability to deal depending on the structure or model. They follow
the statistical modeling and neural-network based
with different data types, discovery of clusters with arbitrary
approach. EM and SOM are typical model-based
shape, scalability, able to deal with noise and outliers, methods.
insensitive to order of input records, minimal requirements for  Hard and Fuzzy clustering methods: In hard
domain knowledge to determine input parameters, clustering methods, each data point belongs to only
incorporation of user-specified constraints, interpretability and one cluster. In fuzzy clustering methods, the data
usability [1]. points can belong to more than one cluster and
associated with each of the points, are membership
Clustering is an important tool for a variety of applications in grades. Membership degrees between zero and one
are used. The degree of membership in this cluster
artificial intelligence, bioinformatics, biology, computer
depends on the closeness of the data objects to the
graphics, computer vision, data mining, data compression, cluster centers. Fuzzy-c means algorithm is the
earthquake studies, image processing, image segmentation, typical algorithm of this kind.
information retrieval, machine learning, marketing, medicine,
network partitioning, object recognition, pattern recognition, 2.3 Mathematical Model of Clustering
spatial database analysis, routing, statistics, scheduling, vector
Problem
quantization and web mining [12].
Given ‘N’ objects in Rm, allocate each object to one of ‘K’
clusters such that the sum of squared Euclidean distances
2.2 Major Clustering Methods between each object and the center of its belonging cluster for
Clustering of data can be classified into partitioning methods,
every such allocated object is minimized. The clustering
hierarchical methods, density-based methods, grid-based
problem can be mathematically described as follows [17]:
methods, model-based methods, hard clustering methods and
fuzzy clustering methods [1][12][13][14][15]. N K

 Partitioning methods: The partition clustering


Minimize J(W,C) =  w ║ x  c ║
i j
ij i j
2
(1)

techniques partition the database into predefined


number of clusters. Given a database of ‘n’ objects,
K

w
they attempt to determine 'k' groups, which satisfy
the following requirements: (1) each object must where ij =1 (2)
belong to exactly one group, and (2) each group j
must contain at least one object. The popular
algorithms of this type are K-means, K-mode, 1

K-medoids, PAM, CLARA, and CLARANS.
Hierarchical methods: Hierarchical clustering
and cj =
nj
x
x i C j
i (3)

methods create a hierarchical decomposition of the


objects. They can be either agglomerative (bottom-
up) or divisive (top-down). Agglomerative K is the number of clusters;
algorithms start with each object forming a separate
group. They successively merge the objects that are N is the number of objects;
close to one another, until all the groups are merged
into one or until a termination condition holds. m is the number of object attributes;
Divisive algorithms begin with one object in a
single cluster, then split the cluster into smaller xi is the location of the i-th object;
groups until each object is in one cluster or a
termination condition holds [16]. BIRCH, CURE, cj is the center of the j-th cluster;
ROCK, and CHAMELEON are examples of these
C = { C1, … ,Ck } denotes the set of K clusters;
methods.
 Density-based methods: These algorithms cluster
W = [wij] denotes the NxK 0-1 matrix; and
objects based on distance between objects. The data
sets can be divided into several subsets according to nj denotes the number of objects belonging to cluster Cj.
the density of the data set points. The density is
defined as the number of objects in a particular
neighborhood of the data objects. DBSCAN, 3. BIO-INSPIRED ALGORITHM
DENCLUE and OPTICS are typical density-based Bio-inspired algorithms are the meta-heuristics that mimic the
methods. way nature performs for solving optimization problems. The
 Grid-based methods: These methods quantize the field of bio-inspired computation covers many algorithmic
object space into a finite number of cells that form a approaches inspired by processes observed in nature such as

34
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

the evolution of species, emergent behavior of biological vid (t+1) = ω  vid (t) + c1  r1  (pid – xid(t))
societies, and functioning of the vertebrate immune system. It
includes well-known techniques such as ant colony + c2  r2  (pgd – xid(t)) (4)
optimization (ACO) [18] [19] artificial immune systems [20],
artificial bee colony [21], evolutionary approaches [22], xid(t+1) = xid(t) + vid (t+1) (5)
genetic algorithms [23] [24], particle swarm optimization
where d - Dimension, d Є {1,2,…. D}
(PSO) [25], tabu search [26] [27] and other methods to solve
real-world problems. They overcome many limitations of i - Index, i Є {1,2, … ,n}
traditional algorithms and have been widely accepted in
n - Population size or number of particles in the
science, engineering and business.
swarm
Recently bio-inspired algorithms have attracted many
researchers. They have found numerous applications for c1 - Cognitive component
solving problems from data mining [28], data clustering [29], c2 - Social component
classification [30], economic emissions load dispatch problem
[31], travelling salesman problem [32] and others. r1 & r2 - Uniformly generated random values in
range [0,1]
3.1 Fundamental Principles of PSO
PSO is a population-based stochastic optimization technique, vid - Velocity of particle i on dimension d
which was proposed by Kennedy and Eberhart in 1995. It is xi - Current position of the particle i on dimension d
inspired by the social behavior of animals such as a flock of
birds, a school of fish, or a swarm of bees [33]. It is a global pid - Personal best or pbest position of particle i
search procedure where the individuals, referred to as pgd - Global best or gbest position of the swarm
particles, are grouped into a swarm or population.
w - Inertia weight
The working principle of the PSO may be described as
follows: In PSO systems, the behaviors of animals are 4. DATA CLUSTERING USING PSO
imitated by particles with certain positions and velocities in a ALGORITHM
multidimensional search space. Starting with a randomly Data clustering using PSO algorithm was first introduced by
initialized population, each particle in PSO flies through the Omran et al. 2002 [35]. This algorithm uses fixed number of
searching space and remembers the best solution it has seen. clusters and uses PSO to search for the optimal centroids of
Members of a swarm communicate good positions to each these clusters. Van der Merwe and Engelbrecht [29] proposed
other and dynamically adjust their own position and velocity a new approach for PSO-based clustering data. In this
based on the good positions. The velocity adjustment is based approach, all clusters will be continuously updated and moved
upon the experiences and historical behaviors of the particles toward the best cluster centroid. Esmin, Pereira and de
themselves as well as their neighbors. The performance of Araujo [36] proposed a different approach to clustering data
each particle is measured according to a predefined fitness using PSO.
function. The particles tend to fly towards better searching In the context of clustering, a single particle represents the N
areas over the searching process. The positions of the cluster centroid vectors. That is, each particle x is
particles are distinguished as personal best and global best. constructed as follows:

For a D-dimensional search space, the position of the i-th xi = ( mi1, mi2, … mij … miNc )
particle is represented as xi = ( x i1 , x i2 , … , x id ), where where Nc is the number of clusters to be formed and mij refers
d is the dimension number. A particle in a swarm is moving to the j-th cluster centroid vector of the i-th particle in cluster
and has a velocity. The velocity of the i-th particle can be Cij. Therefore, a swarm represents a number of candidate
clusters for the current data vectors.
written as V i = ( v i1 , v i2 , … vid ). At each iteration, the
Each particle is evaluated using the following equation:
velocity and the position of a particle are updated based on its

d  zp , m j 
own previous best position and the global best position in the Nc  
swarm. Each particle maintains a memory of its previous best
   zp Є Cij 
position Pi = ( p i1 , p i2 , … , pid ). It is called as pbest. The 
j 1  Cij 
Je = (6)
best one among all the particles in the population is Nc
represented as Pg = ( p g1 , p g2 , … , p gd ). It is called as
gbest. Positions and velocities are adjusted, and the function
where x p denotes p-th data vector, | Cij | is the number of
is evaluated with the new coordinates at each time step. The data of data vector belonging to the cluster Cij and d is the
velocity and the position of the particle i are calculated using
(4) and (5) respectively [25] [34]. distance between z p and m j .

35
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Using the standard gbest PSO, data vectors can be clustered as account. Another drawback is that it does not work well for
follows: categorical variables [43].
1. Initialize each particle to contain Nc randomly selected It is also known as L2 distance. It calculates root of square
cluster centroids. differences between the coordinates of a pair of objects.
2. For t = 1 to tmax do If a = (x1, x2) and b = (y1, y2) are two points, then the
Euclidean distance between a and b is given by Eq. (7).
a) For each particle i do
b) For each data vector zp d(a,b) = (x1 – y1 )2  (x 2 – y2 )2 (7)
i) Calculate distance d(xp, mij) to all cluster centroids Cij
If the points have ‘n’ dimensions such as a = (x1, x2, … xn)
ii) Assign zp to cluster Cij such that distance
and b = (y1, y2, … yn) then the Euclidean distance between a
d(zp, mij) = minc  1,  ,Nc {d(zp , mic )} and b is given Eq. (8).

iii) Calculate the fitness using equation (6)


d(a,b) = (x1 – y1 )2  ...  (x n – yn ) 2
iv) Update the global best and local best positions
v) Update the cluster centroids using the equations (4)
and (5) n

where tmax is the maximum number of iterations.


=  (x
i 1
i – yi ) 2 (8)

5. DISTANCE MEASURES
Cluster analysis assigns a set of ‘n’ objects to clusters on the 5.2 Manhattan Distance
basis of measurements of dissimilarity between the various Manhattan distance is also called city block distance or
objects. An important component of a clustering algorithm is taxicab norm or rectilinear distance or L1 distance. It
the distance measure between data points. The distance computes the sum of absolute difference between the
measure will determine how the similarity of two elements is coordinates of a pair of objects [44]. If a = (x1, x2) and
calculated. This will influence the shape of the clusters, as b = (y1, y2) are two points, then the Manhattan distance
some elements may be close to one another according to one between a and b is given by Eq. (9).
distance and further away according to another. The
d(a,b) = | x1 – y1| + | x2 – y2| (9)
following are the important properties of distance measures
[37] [38] [39] [40] [41] [42]: If the points have ‘n’ dimensions such as a = (x1, x2, … xn)
and b = (y1, y2, … yn) then the Manhattan distance between a
Let d(a,b) denote the distance between points a and b.
and b is given by Eq. (10).
1. Symmetry. The distance from a to b is the same as
the distance from b to a. That is, d(a,b) = d(b,a). d(a,b) = | x1 – y1| + | x2 – y2| + … + | xn – yn|
2. Non-negativity. Distance is measured as a non-
negative quantity. That is, d(a,b) ≥ 0, for every a n

3.
and b.
Identification. The distance between a and a is zero.
= | x
i 1
i – yi | (10)

That is, d(a,a) = 0, for every a.


4. Definiteness. If the distance between a and b is zero 5.3 Chebyshev Distance
then a and b are the same. That is, d(a,b) = 0 only if
It is also called Lα norm or minimax approximation or
a = b.
5. Triangle inequality. The length of one side of the Chebyshev norm or Chessboard distance. It is named after
triangle formed by any three points cannot be Pafnuty Lvovich Chebyshev. It computes the absolute
greater than the total length of the other two sides. magnitude of the differences between coordinates of a pair of
That is, d(a,c) ≤ d(a,b) + d(b,c). objects [45]. If a = (x1, x2) and b = (y1, y2) are two points,
5.1 Euclidean Distance then the Chebyshev distance between a and b is given by Eq.
Euclidean distance between two points is the shortest possible (11).
distance between the two points. It is also called Pythagorean
metric since it is derived from the Pythagorean theorem. It is d(a,b) = Max { | x1 – y1| , | x2 – y2| } (11)
the commonly used distance measurement. It is invariant
If the points have ‘n’ dimensions such as a = (x1, x2, … xn)
under orthogonal transformations of the variables. Many
and b = (y1, y2, … yn) then the Chebyshev distance between a
clustering algorithms have involved the use of Euclidean
and b is given by Eq. (12).
distances. One problem with the Euclidean distance measure
is that it does not take the correlation between variables into d(a,b) = Max { | x1 – y1| , | x2 – y2| , … , | xn – yn| }

36
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

= Max i1,2,n |xi – yi | (12) measured by the fitness value and maximum average distance
(MAD).

6. METHODOLOGY A. Parameter settings


The methodology used in this paper for data clustering is the
PSO algorithm studied using different distance measures such Based on the experimental results the algorithm performs best
as Euclidean, Manhattan and Chebyshev. under the following settings: The acceleration parameters
(c1 and c2 ) are set to 2. The number of particles (p) is set to
Case 1 Euclidean distance: The distance between the data 10. Vmax is set to (Xmax – Xmin) and Vmin is set to
vector x and centroid c is calculated by Eq. (13). – (Xmax – Xmin) [25]. The inertia weight (ω) is 0.9  0.4. That
is, in general, ω decreases linearly from 0.9 to 0.4 throughout
n the search process [47]. ω is set to the following Eq. (17):
 (x i – ci ) 2 (13)
i 1 ωmax – ωmin
ω = ωmax - *I ; (17)
Case 2 Manhattan distance: The distance between the data Imax
vector x and centroid c is calculated by Eq. (14).
where ωmax and ωmin are the initial and final value of
n
I max is
| x
weighting coefficient respectively; the maximum
i – ci | (14)
i 1
number of iterations; I is the current iteration number, r 1 and
r2 are random values in the range [0,1].
Case 3 Chebyshev distance: The distance between the data
B. Data sets
vector x and centroid c is calculated by Eq. (15).
For evaluating the PSO based clustering algorithm, four
Max i 1,2,n |xi – ci | (15) well-known real-world benchmark data sets from the UCI
machine learning repository and an artificially generated data
The quality of PSO algorithm is measured according to the set have been taken:
following criteria:
Data set 1: Blood transfusion data set, which
 The fitness of the particles is measured as the consists of 748 instances and 2 different
quantization error types characterized by 4 features. The
Nc  d  zp , m j   features are recency – months since last
  zp Є Cij  donation, frequency – total number of
j 1  Cij  donations, monetary – total blood donated
Je = in c.c., time – months since first donation.
Nc
Data set 2: Pima Indians diabetes data set: This
data set is allocated to recognize diabetic
where dpj = || xp - mj || ; | Cij | is the number of
patients. It consists of 768 instances
data vectors belonging to cluster Cij ; and Nc is
which are classified into two classes
the number of clusters
consisting of 500 and 268 instances
respectively. Each instance in this data
 The maximum average distance (MAD) [46] is set has 8 features, which are number of
defined in Eq. (16 ) times pregnant, plasma glucose

 d  z p , mic  

concentration a 2 hours in an oral glucose
dmax(Zi , xi) = max  z Є C  (16) tolerance test, diastolic blood pressure
 C 
p ij
 ij  (mm Hg), triceps skin fold thickness
(mm), 2-hour serum insulin (mu U/ml),
body mass index (weight in kg /
c  1,  ,N (height in m)2, Diabetes pedigree
function, and age.
7. EXPERIMENTAL RESULTS
The main objective of this study is to access the performance Data set 3: Liver disorder data set, which consists
of the PSO algorithm to data clustering using different of 345 objects and 2 different types
distance measures such as Euclidian, Manhattan and characterized by 6 features including mcv
Chebyshev for well-known real-world benchmark data sets mean corpuscular volume, alkphos
and an artificially generated data sets. The performance is alkaline phosphotase, sgpt alamine
aminotransferase, sgot aspirate

37
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

aminotransferase, gammagt gamma- As shown in the tables 2 – 6 and Figs. 1 – 5, PSO based data
glutamyl transpeptidase, and drinks clustering algorithm based on Chebyshev distance measure is
number of half-pint equivalents of better fitness value, standard deviation and MAD for all the
alcoholic beverages drunk per day. data sets than those of Euclidean and Manhattan distances
measures. Also the experimental results in Table 7 and Fig. 6
Data set 4: Statlog (Heart) data set, which consists summarize the effect of varying the number of clusters for
of 270 objects. This data set contains 13 different distance measures for artificially generated data set.
features, which are . age, chest pain type, The fitness value should decrease when the number of clusters
resting blood pressure, serum cholesterol increase. It is also observed that PSO based data clustering
in mg/dl, fasting blood sugar > 120 mg/dl, algorithm based on Chebyshev distance measure shows
resting electrocardiographic results, minimum fitness value for varying number of clusters than
maximum heart rate achieved, exercise those of other distance measures.
induced angina, old peak=ST depression
induced by exercise relative to rest, the 8. CONCLUSION
slope of the peak exercise ST segment, Clustering means the act of partitioning an unlabeled data set
number of major vessels colored by into groups of similar objects. Some traditional algorithms
flourosopy, and thal. are sensitive to initialization and are easily trapped in local
optima. On the other hand, the particle swarm optimization
Data set 5: Artificial data set: In this experimental algorithm performs a globalized search in the entire solution
data set, there are five classes and each space. Data clustering is a difficult problem because many
class has 50 samples consisting of three factors such as distance measures, criterion functions and
features. Each feature of the class is initial conditions have come into play. In this paper, particle
distributed according to swarm optimization algorithm is experimented with four well-
known data sets Blood transfusion, Diabetes, Liver disorders,
Class1~Uniform(80,100),
Statlog (Heart) and an artificially generated data set using
Class1~Uniform(60,80),
different distance measures such as Euclidean, Manhattan and
Class1~Uniform(40,60),
Chebyshev. This algorithm performs better fitness value for
Class1~Uniform(20,40), and
Chebyshev distance measure than the Euclidean and
Class1 ~ Uniform(1,20).
Manhattan distance measures for the data sets selected for our
Table 1 summarizes these five data sets. For each data study. It is also observed that Manhattan distance measure
set, it lists the number of instances, number of attributes and performed very poorly in all the data sets. This distance
number of clusters. measure is poor in handling high dimensional data.

C. Results

The algorithm tries to minimize the fitness value. The main


purpose is to compare the performance of PSO algorithm with
three different distance measures. The algorithm is
implemented using Java. For our experimental tests, we used
a PC Pentium IV (CPU 3.06 GHZ and 1.97 GB RAM) with
the above parameter values by considering the maximum of
100 iterations, 10 particles and 10 independent test runs. The
best, worst, average and standard deviation global fitness
values and MAD are reported.

38
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Table 1 Description of data set used

Data set No. of instances No. of features No. of clusters

Blood Transfusion 748 4 2

Diabetes 768 8 2

Liver Disorders 345 6 2

Statlog (Heart) 270 13 2

Artificial 250 3 5

Table 2 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on diabetes data set

Data set Distance measure Global fitness MAD


Best Worst Average Std

Diabetes Euclidean 69.0695 109.4461 73.5667 6.2060 109.8465

Manhattan 116.6343 209.0815 120.8809 12.0974 163.2181

Chebyshev 58.4621 95.0772 61.7756 5.9170 88.9208

Table 3 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on blood transfusion data set

Data set Distance measure Global fitness MAD


Best Worst Average Std

Blood Euclidean 529.9936 788.9989 536.2758 27.1652 1045.9083


Transfusion
Manhattan 545.0133 676.7063 561.9357 21.5307 1063.4851

Chebyshev 520.4634 583.9255 526.0786 13.8409 1030.1036

Table 4 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on liver disorders data set

Data set Distance measure Global fitness MAD


Best Worst Average Std

Liver Euclidean 32.2174 66.0195 38.0796 7.1726 36.8059


Disorder
Manhattan 59.7594 119.1086 84.8941 21.5288 78.5063

Chebyshev 25.4356 43.7139 27.3727 3.2734 32.8472

39
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Table 5 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on statlog (heart) data set

Data set Distance measure Global fitness MAD


Best Worst Average Std

Statlog Euclidean 40.3209 63.0872 42.9776 4.2062 44.3863


(Heart)
Manhattan 71.9974 119.8488 75.5640 6.6570 73.7982

Chebyshev 32.9703 59.8969 33.5847 2.7279 36.7961

Table 6 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on artificial data set

Data set Distance measure Global fitness MAD


Best Worst Average Std

Artificial Euclidean 9.1511 16.2816 9.7812 1.2507 26.6418

Manhattan 14.9483 24.4094 15.2170 1.2476 44.4797

Chebyshev 6.7890 10.2467 7.1944 0.6805 18.1727

Table 7 Fitness value of different number of clusters

Data set Distance measure Number of clusters


2 3 4 5

Artificial Euclidean 21.6930 17.7273 11.4437 9.1511

Manhattan 37.4178 25.2146 18.8481 14.9483

Chebyshev 14.5203 9.6800 7.2602 6.7890

Fig. 1 Comparison of global best fitness for 10 test runs in diabetes data set

40
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Fig. 2 Comparison of global best fitness for 10 test runs in blood transfusion data set

Fig. 3 Comparison of global best fitness for 10 test runs in liver disorder data set

41
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Fig. 4 Comparison of global best fitness for 10 test runs in statlog (heart) data set

Fig. 5 Comparison of global best fitness for 10 test runs in artificial data set

42
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

Fig. 6 Effect of the different number of clusters on the fitness value

9. REFERENCES [8] Ganti, V., Gehrke, J., and Ramakrishnan, R. 1999.


[1] Han, J., and Kamber. 2001. “Data mining: concepts and “CACTUS – clustering categorical data using
techniques”, Morgan Kaufmann, San Francisco. summaries”, In International conference on knowledge
discovery and data mining, pp. 73-83, San Diego, USA.
[2] MacQueen, J. 1967. Some methods for classification and
analysis of multivariate observations. In 5th Berkeley [9] Ng, R., and Han, J. 2002. “CLARANS: a method for
symposium on mathematics, statistics and probability, clustering objects for spatial data mining”, IEEE Trans
pp. 281-296. Knowl Data Eng, 14(5), pp. 1003-1016.

[3] Kaufman, L., and Russeeuw, P. 1990. “Finding groups in [10] Gungor, Z., and Unler, A. 2007. “K-harmonic means
data: an introduction to cluster analysis”, New York: data clustering with simulated annealing heuristic”,
John Wiley & Sons. Applied mathematics and computation, 184(2),
pp. 199-209.
[4] Zhang, T., Raakrishanan, R., and Livny, M. 1996.
“BIRCH: an efficient data clustering method for very [11] Bin, W., and Zhongzhi, S. 2001. “A clustering algorithm
large databases”, In Proceedings ACM SIGMOD based on swarm intelligence”, In Proceedings of the
international conference on the management of data, pp. international conference on Info-tech and Info-net,
103-114. Beijing, China, pp. 58-66.

[5] Ester, M., Kriegel, H-P., Sander, J., and Xu X. 1996. “A [12] Jain, A., Murty, M., and Flynn, P. (1999). Data
density based algorithm for discovering clusters in large clustering: a review. ACM Computing Surveys, 31(3),
spatial databases with noise”, In Simuoudis, E., Han, J., 264-323.
& Fayyard, U. editors, second international conference [13] Jain, A., and Dubes, R. 1998. “Algorithms for clustering
on knowledge discovery and data mining, pp. 226-231, data”, Prentice Hall, New Jersey.
AAAI press, Portland.
[14] Berkhin, P. 2002. “Survey clustering data mining
[6] Guha, S., Rastogi, R., and Shim, K. 1998. “CURE: an techniques”, Technical report, Accrue software, San
efficient clustering algorithm for large databases”, In Jose, California.
Proceedings ACM SIGMOD international conference on
the management of data, pp. 73-84, Seatle, USA. [15] Xu, R., and Wunsch II, D. 2005. “Survey of clustering
algorithms”, IEEE Transactions on Neural Networks,
[7] Karypis, G., Han, E-H., and Kumar, V. 1999. 16(3), 645-678.
“CHAMELEON: a hierarchical clustering algorithm
using dynamic modeling”, Computer, 32, pp. 32-68. [16] Ding, C., and He, X. 2002. “Cluster merging and
splitting in hierarchical clustering algorithms”, IEEE
international conference, pp. 139-146.

43
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013

[17] Yongguo Liu, Jun Peng, Kefei Chen, and Yi Zhang. algorithm with neighborhood operator on travelling
2006. “An improved hybrid genetic clustering salesman problem”, Neural computing and applications.
algorithm”, SETN 2006, LNAI 3955, pp. 192-202.
[33] Poli, R., Kennedy, J., and Blackwell, T. 2007. “Particle
[18] Bonabeau, E., Dorigo, M., and Theraulaz, G. 1999. swarm optimization – an overview”, Swarm intelligence,
“Swarm intelligence: from natural to artificial systems”, 1(1), pp. 33-57.
Oxford university press, Inc., New York.
[34] Shi, Y., and Eberhart, R.C. 1998. “A modified particle
[19] Dorigo, M., and Stutzle, T. 2004. “Ant colony swarm optimizer”, In Proceedings of the IEEE congress
optimization”, MIT press, Cambridge, Massachusetts, on evolutionary computation (CEC 1998), Piscataway,
London, England. NJ, pp. 69-73.
[20] de Castro, L.N., and Timmis, J. 2002. “Artificial Immune [35] Omran, M., Salman, A., and Engelbrecht, A. 2002.
Systems: a new computational intelligence approach”, “Image classification using particle swarm optimization”,
Springer, Heidelberg. In Wang L, Tan KC, Furukhashi T, Kim J-H, Yao X
(Eds.), Proceedings of the fourth Asia-pacific conference
[21] Zhang, C., Quyang, D., and Ning, J. 2010. “An artificial on simulated evolution and learning (SEAL’02), IEEE
bee colony approach for clustering”, Expert systems and press, Piscataway, pp. 370-374.
applications, 37, pp. 4761-4767.
[36] Esmin, A.A.A., Pereira, D.L., and de Araujo, F. 2008.
[22] Paterlini, S., and Minerva, T. 2003. “Evolutionary “Study of different approach to clustering data by using
approaches for cluster analysis”, In Bonarini. A., particle swarm optimization algorithm”, In IEEE
Musulli, F., Pasi, G., (Eds.) Soft computing applications, congress on evolutionary computation, CEC 2008,
Springer-Verlag, Berlin, pp. 167-178. pp. 1817-1822.
[23] Goldberg, D.E. 1975. “Genetic algorithms in search, [37] Sokal, R.R. 1977. “Clustering and classification:
optimization and machine learning”, Addison-Wesley, Background and current directions”, Classification and
Reading, MA. clustering, Academic press, pp. 155-172.
[24] Falkenauer, E. 1998. “Genetic algorithms and grouping [38] Mardia, K.V., Kent, J.T., and Bibby, J.M. 1979.
problems”, John Wiley and Sons, Chichester. “Multivariate analysis”, Academic press.
[25] Kennedy, J., and Eberhart, R.C. 1995. “Particle swarm [39] Seber, G.A.F. 1984. “Multivariate observations”, Wiley.
optimization”, In Proceedings of the IEEE international
joint conference on neural networks, IJCNN 95, [40] Mielke, P.W. 1985. “Geometric concerns pertaining to
Piscataway, IEEE press, pp. 1942-1948. applications of statistical tests in the atmospheric
sciences”, Journal of Atmospheric Sciences, 42,
[26] Al-Sultan, K.S. 1995. “A tabu search approach to the pp. 1209-1212.
clustering problem”, Pattern recognition, 28,
pp. 1443-1451. [41] Krzanowski, W.J. 1988. “Principles of multivariate
analysis: A user’s perspective”, Oxford science
[27] Gendreau, M. 2003. “An introduction to tabu search”, In publications.
Handbook of metaheuristics, Kochenberger, G., Glover,
F., (Eds.), Dordrecht, Kluwer Academic Publishers. [42] Mimmack, Gillian M., Mason, Simon J., Galpin, and
Jacquelin S. 2001. “Choice of distance matrices in
[28] Sousa, T., Neves, A., and Silva, A. 2003. “Swarm cluster analysis: Defining regions”, Journal of climate,
optimization as a new tool for data mining”,. In 4(12), pp. 2790-2797.
Proceedings of the 17th international symposium on
parallel and distributed processing (IPDPS’03), [43] Ertoz, L., Steinbach, M., and Kumar, V. 2003. “Finding
pp. 48-53. clusters of different sizes, shapes, densities in noisy high
dimensional data”, Proceedings of the third SIAM
[29] Van der Merwe, D., and Engelbrecht, A. 2003. “Data international conference on data mining (SDM 2003),
clustering using particle swarm optimization”, In volume 112, Proceedings in Applied mathematics,
Proceedings of IEEE congress on evolutionary Society for industrial and applied mathematics.
computation (CEC 2003), Canbella, Australia, pp.
215-220. [44] Berry, M.J.A., and Linoff, G.S. 2009. “Data mining
techniques: For marketing, sales and customer
[30] Liping Yan and Jianchao Zeng 2006. “Using particle relationship management”, Second edition, Wiley.
swarm optimization and genetic programming to evolve
classification rules”, In Sixth world congress on [45] Bock, R.K., and Krischer, W. 1998. “The data analysis
intelligent control and automation (WCICA 2006), pp. brief book”, New York: Springer-Verlag.
3415-3419.
[46] Omran, M.G.H. 2005. “A PSO-based clustering
[31] Apostolopoulos, T., and Vlachos, A. 2011. “Application algorithm with application to unsupervised
of the firefly algorithm for solving the economic classification”, University of Pretoria etd.
emissions load dispatch problem”, International Journal
of Combinatorics, 2011, pp. 1-23. [47] Shi, Y., and Eberhart, R.C. 2002. “Empirical study of
particle swarm optimization”, In Proceedings of IEEE
[32] Mustafa Servet Kiran, Hazim Iscan and Mesut Gunduz congress on evolutionary computation (CEC 1999),
2012. “The analysis of discrete artificial bee colony Washington D.C., pp. 1945-1949.

44

View publication stats

You might also like