A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures
A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures
net/publication/273020647
CITATIONS READS
10 500
2 authors:
All content following this page was uploaded by Sivakumar Ramakrishnan on 03 March 2015.
33
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
w
they attempt to determine 'k' groups, which satisfy
the following requirements: (1) each object must where ij =1 (2)
belong to exactly one group, and (2) each group j
must contain at least one object. The popular
algorithms of this type are K-means, K-mode, 1
K-medoids, PAM, CLARA, and CLARANS.
Hierarchical methods: Hierarchical clustering
and cj =
nj
x
x i C j
i (3)
34
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
the evolution of species, emergent behavior of biological vid (t+1) = ω vid (t) + c1 r1 (pid – xid(t))
societies, and functioning of the vertebrate immune system. It
includes well-known techniques such as ant colony + c2 r2 (pgd – xid(t)) (4)
optimization (ACO) [18] [19] artificial immune systems [20],
artificial bee colony [21], evolutionary approaches [22], xid(t+1) = xid(t) + vid (t+1) (5)
genetic algorithms [23] [24], particle swarm optimization
where d - Dimension, d Є {1,2,…. D}
(PSO) [25], tabu search [26] [27] and other methods to solve
real-world problems. They overcome many limitations of i - Index, i Є {1,2, … ,n}
traditional algorithms and have been widely accepted in
n - Population size or number of particles in the
science, engineering and business.
swarm
Recently bio-inspired algorithms have attracted many
researchers. They have found numerous applications for c1 - Cognitive component
solving problems from data mining [28], data clustering [29], c2 - Social component
classification [30], economic emissions load dispatch problem
[31], travelling salesman problem [32] and others. r1 & r2 - Uniformly generated random values in
range [0,1]
3.1 Fundamental Principles of PSO
PSO is a population-based stochastic optimization technique, vid - Velocity of particle i on dimension d
which was proposed by Kennedy and Eberhart in 1995. It is xi - Current position of the particle i on dimension d
inspired by the social behavior of animals such as a flock of
birds, a school of fish, or a swarm of bees [33]. It is a global pid - Personal best or pbest position of particle i
search procedure where the individuals, referred to as pgd - Global best or gbest position of the swarm
particles, are grouped into a swarm or population.
w - Inertia weight
The working principle of the PSO may be described as
follows: In PSO systems, the behaviors of animals are 4. DATA CLUSTERING USING PSO
imitated by particles with certain positions and velocities in a ALGORITHM
multidimensional search space. Starting with a randomly Data clustering using PSO algorithm was first introduced by
initialized population, each particle in PSO flies through the Omran et al. 2002 [35]. This algorithm uses fixed number of
searching space and remembers the best solution it has seen. clusters and uses PSO to search for the optimal centroids of
Members of a swarm communicate good positions to each these clusters. Van der Merwe and Engelbrecht [29] proposed
other and dynamically adjust their own position and velocity a new approach for PSO-based clustering data. In this
based on the good positions. The velocity adjustment is based approach, all clusters will be continuously updated and moved
upon the experiences and historical behaviors of the particles toward the best cluster centroid. Esmin, Pereira and de
themselves as well as their neighbors. The performance of Araujo [36] proposed a different approach to clustering data
each particle is measured according to a predefined fitness using PSO.
function. The particles tend to fly towards better searching In the context of clustering, a single particle represents the N
areas over the searching process. The positions of the cluster centroid vectors. That is, each particle x is
particles are distinguished as personal best and global best. constructed as follows:
For a D-dimensional search space, the position of the i-th xi = ( mi1, mi2, … mij … miNc )
particle is represented as xi = ( x i1 , x i2 , … , x id ), where where Nc is the number of clusters to be formed and mij refers
d is the dimension number. A particle in a swarm is moving to the j-th cluster centroid vector of the i-th particle in cluster
and has a velocity. The velocity of the i-th particle can be Cij. Therefore, a swarm represents a number of candidate
clusters for the current data vectors.
written as V i = ( v i1 , v i2 , … vid ). At each iteration, the
Each particle is evaluated using the following equation:
velocity and the position of a particle are updated based on its
d zp , m j
own previous best position and the global best position in the Nc
swarm. Each particle maintains a memory of its previous best
zp Є Cij
position Pi = ( p i1 , p i2 , … , pid ). It is called as pbest. The
j 1 Cij
Je = (6)
best one among all the particles in the population is Nc
represented as Pg = ( p g1 , p g2 , … , p gd ). It is called as
gbest. Positions and velocities are adjusted, and the function
where x p denotes p-th data vector, | Cij | is the number of
is evaluated with the new coordinates at each time step. The data of data vector belonging to the cluster Cij and d is the
velocity and the position of the particle i are calculated using
(4) and (5) respectively [25] [34]. distance between z p and m j .
35
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
Using the standard gbest PSO, data vectors can be clustered as account. Another drawback is that it does not work well for
follows: categorical variables [43].
1. Initialize each particle to contain Nc randomly selected It is also known as L2 distance. It calculates root of square
cluster centroids. differences between the coordinates of a pair of objects.
2. For t = 1 to tmax do If a = (x1, x2) and b = (y1, y2) are two points, then the
Euclidean distance between a and b is given by Eq. (7).
a) For each particle i do
b) For each data vector zp d(a,b) = (x1 – y1 )2 (x 2 – y2 )2 (7)
i) Calculate distance d(xp, mij) to all cluster centroids Cij
If the points have ‘n’ dimensions such as a = (x1, x2, … xn)
ii) Assign zp to cluster Cij such that distance
and b = (y1, y2, … yn) then the Euclidean distance between a
d(zp, mij) = minc 1, ,Nc {d(zp , mic )} and b is given Eq. (8).
5. DISTANCE MEASURES
Cluster analysis assigns a set of ‘n’ objects to clusters on the 5.2 Manhattan Distance
basis of measurements of dissimilarity between the various Manhattan distance is also called city block distance or
objects. An important component of a clustering algorithm is taxicab norm or rectilinear distance or L1 distance. It
the distance measure between data points. The distance computes the sum of absolute difference between the
measure will determine how the similarity of two elements is coordinates of a pair of objects [44]. If a = (x1, x2) and
calculated. This will influence the shape of the clusters, as b = (y1, y2) are two points, then the Manhattan distance
some elements may be close to one another according to one between a and b is given by Eq. (9).
distance and further away according to another. The
d(a,b) = | x1 – y1| + | x2 – y2| (9)
following are the important properties of distance measures
[37] [38] [39] [40] [41] [42]: If the points have ‘n’ dimensions such as a = (x1, x2, … xn)
and b = (y1, y2, … yn) then the Manhattan distance between a
Let d(a,b) denote the distance between points a and b.
and b is given by Eq. (10).
1. Symmetry. The distance from a to b is the same as
the distance from b to a. That is, d(a,b) = d(b,a). d(a,b) = | x1 – y1| + | x2 – y2| + … + | xn – yn|
2. Non-negativity. Distance is measured as a non-
negative quantity. That is, d(a,b) ≥ 0, for every a n
3.
and b.
Identification. The distance between a and a is zero.
= | x
i 1
i – yi | (10)
36
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
= Max i1,2,n |xi – yi | (12) measured by the fitness value and maximum average distance
(MAD).
37
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
aminotransferase, gammagt gamma- As shown in the tables 2 – 6 and Figs. 1 – 5, PSO based data
glutamyl transpeptidase, and drinks clustering algorithm based on Chebyshev distance measure is
number of half-pint equivalents of better fitness value, standard deviation and MAD for all the
alcoholic beverages drunk per day. data sets than those of Euclidean and Manhattan distances
measures. Also the experimental results in Table 7 and Fig. 6
Data set 4: Statlog (Heart) data set, which consists summarize the effect of varying the number of clusters for
of 270 objects. This data set contains 13 different distance measures for artificially generated data set.
features, which are . age, chest pain type, The fitness value should decrease when the number of clusters
resting blood pressure, serum cholesterol increase. It is also observed that PSO based data clustering
in mg/dl, fasting blood sugar > 120 mg/dl, algorithm based on Chebyshev distance measure shows
resting electrocardiographic results, minimum fitness value for varying number of clusters than
maximum heart rate achieved, exercise those of other distance measures.
induced angina, old peak=ST depression
induced by exercise relative to rest, the 8. CONCLUSION
slope of the peak exercise ST segment, Clustering means the act of partitioning an unlabeled data set
number of major vessels colored by into groups of similar objects. Some traditional algorithms
flourosopy, and thal. are sensitive to initialization and are easily trapped in local
optima. On the other hand, the particle swarm optimization
Data set 5: Artificial data set: In this experimental algorithm performs a globalized search in the entire solution
data set, there are five classes and each space. Data clustering is a difficult problem because many
class has 50 samples consisting of three factors such as distance measures, criterion functions and
features. Each feature of the class is initial conditions have come into play. In this paper, particle
distributed according to swarm optimization algorithm is experimented with four well-
known data sets Blood transfusion, Diabetes, Liver disorders,
Class1~Uniform(80,100),
Statlog (Heart) and an artificially generated data set using
Class1~Uniform(60,80),
different distance measures such as Euclidean, Manhattan and
Class1~Uniform(40,60),
Chebyshev. This algorithm performs better fitness value for
Class1~Uniform(20,40), and
Chebyshev distance measure than the Euclidean and
Class1 ~ Uniform(1,20).
Manhattan distance measures for the data sets selected for our
Table 1 summarizes these five data sets. For each data study. It is also observed that Manhattan distance measure
set, it lists the number of instances, number of attributes and performed very poorly in all the data sets. This distance
number of clusters. measure is poor in handling high dimensional data.
C. Results
38
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
Diabetes 768 8 2
Artificial 250 3 5
Table 2 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on diabetes data set
Table 3 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on blood transfusion data set
Table 4 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on liver disorders data set
39
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
Table 5 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on statlog (heart) data set
Table 6 Best, worst, average, standard deviation of global fitness and maximum average distance (MAD) for 10 test runs of
PSO algorithm on artificial data set
Fig. 1 Comparison of global best fitness for 10 test runs in diabetes data set
40
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
Fig. 2 Comparison of global best fitness for 10 test runs in blood transfusion data set
Fig. 3 Comparison of global best fitness for 10 test runs in liver disorder data set
41
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
Fig. 4 Comparison of global best fitness for 10 test runs in statlog (heart) data set
Fig. 5 Comparison of global best fitness for 10 test runs in artificial data set
42
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
[3] Kaufman, L., and Russeeuw, P. 1990. “Finding groups in [10] Gungor, Z., and Unler, A. 2007. “K-harmonic means
data: an introduction to cluster analysis”, New York: data clustering with simulated annealing heuristic”,
John Wiley & Sons. Applied mathematics and computation, 184(2),
pp. 199-209.
[4] Zhang, T., Raakrishanan, R., and Livny, M. 1996.
“BIRCH: an efficient data clustering method for very [11] Bin, W., and Zhongzhi, S. 2001. “A clustering algorithm
large databases”, In Proceedings ACM SIGMOD based on swarm intelligence”, In Proceedings of the
international conference on the management of data, pp. international conference on Info-tech and Info-net,
103-114. Beijing, China, pp. 58-66.
[5] Ester, M., Kriegel, H-P., Sander, J., and Xu X. 1996. “A [12] Jain, A., Murty, M., and Flynn, P. (1999). Data
density based algorithm for discovering clusters in large clustering: a review. ACM Computing Surveys, 31(3),
spatial databases with noise”, In Simuoudis, E., Han, J., 264-323.
& Fayyard, U. editors, second international conference [13] Jain, A., and Dubes, R. 1998. “Algorithms for clustering
on knowledge discovery and data mining, pp. 226-231, data”, Prentice Hall, New Jersey.
AAAI press, Portland.
[14] Berkhin, P. 2002. “Survey clustering data mining
[6] Guha, S., Rastogi, R., and Shim, K. 1998. “CURE: an techniques”, Technical report, Accrue software, San
efficient clustering algorithm for large databases”, In Jose, California.
Proceedings ACM SIGMOD international conference on
the management of data, pp. 73-84, Seatle, USA. [15] Xu, R., and Wunsch II, D. 2005. “Survey of clustering
algorithms”, IEEE Transactions on Neural Networks,
[7] Karypis, G., Han, E-H., and Kumar, V. 1999. 16(3), 645-678.
“CHAMELEON: a hierarchical clustering algorithm
using dynamic modeling”, Computer, 32, pp. 32-68. [16] Ding, C., and He, X. 2002. “Cluster merging and
splitting in hierarchical clustering algorithms”, IEEE
international conference, pp. 139-146.
43
International Journal of Computer Applications (0975 – 8887)
Volume 66– No.12, March 2013
[17] Yongguo Liu, Jun Peng, Kefei Chen, and Yi Zhang. algorithm with neighborhood operator on travelling
2006. “An improved hybrid genetic clustering salesman problem”, Neural computing and applications.
algorithm”, SETN 2006, LNAI 3955, pp. 192-202.
[33] Poli, R., Kennedy, J., and Blackwell, T. 2007. “Particle
[18] Bonabeau, E., Dorigo, M., and Theraulaz, G. 1999. swarm optimization – an overview”, Swarm intelligence,
“Swarm intelligence: from natural to artificial systems”, 1(1), pp. 33-57.
Oxford university press, Inc., New York.
[34] Shi, Y., and Eberhart, R.C. 1998. “A modified particle
[19] Dorigo, M., and Stutzle, T. 2004. “Ant colony swarm optimizer”, In Proceedings of the IEEE congress
optimization”, MIT press, Cambridge, Massachusetts, on evolutionary computation (CEC 1998), Piscataway,
London, England. NJ, pp. 69-73.
[20] de Castro, L.N., and Timmis, J. 2002. “Artificial Immune [35] Omran, M., Salman, A., and Engelbrecht, A. 2002.
Systems: a new computational intelligence approach”, “Image classification using particle swarm optimization”,
Springer, Heidelberg. In Wang L, Tan KC, Furukhashi T, Kim J-H, Yao X
(Eds.), Proceedings of the fourth Asia-pacific conference
[21] Zhang, C., Quyang, D., and Ning, J. 2010. “An artificial on simulated evolution and learning (SEAL’02), IEEE
bee colony approach for clustering”, Expert systems and press, Piscataway, pp. 370-374.
applications, 37, pp. 4761-4767.
[36] Esmin, A.A.A., Pereira, D.L., and de Araujo, F. 2008.
[22] Paterlini, S., and Minerva, T. 2003. “Evolutionary “Study of different approach to clustering data by using
approaches for cluster analysis”, In Bonarini. A., particle swarm optimization algorithm”, In IEEE
Musulli, F., Pasi, G., (Eds.) Soft computing applications, congress on evolutionary computation, CEC 2008,
Springer-Verlag, Berlin, pp. 167-178. pp. 1817-1822.
[23] Goldberg, D.E. 1975. “Genetic algorithms in search, [37] Sokal, R.R. 1977. “Clustering and classification:
optimization and machine learning”, Addison-Wesley, Background and current directions”, Classification and
Reading, MA. clustering, Academic press, pp. 155-172.
[24] Falkenauer, E. 1998. “Genetic algorithms and grouping [38] Mardia, K.V., Kent, J.T., and Bibby, J.M. 1979.
problems”, John Wiley and Sons, Chichester. “Multivariate analysis”, Academic press.
[25] Kennedy, J., and Eberhart, R.C. 1995. “Particle swarm [39] Seber, G.A.F. 1984. “Multivariate observations”, Wiley.
optimization”, In Proceedings of the IEEE international
joint conference on neural networks, IJCNN 95, [40] Mielke, P.W. 1985. “Geometric concerns pertaining to
Piscataway, IEEE press, pp. 1942-1948. applications of statistical tests in the atmospheric
sciences”, Journal of Atmospheric Sciences, 42,
[26] Al-Sultan, K.S. 1995. “A tabu search approach to the pp. 1209-1212.
clustering problem”, Pattern recognition, 28,
pp. 1443-1451. [41] Krzanowski, W.J. 1988. “Principles of multivariate
analysis: A user’s perspective”, Oxford science
[27] Gendreau, M. 2003. “An introduction to tabu search”, In publications.
Handbook of metaheuristics, Kochenberger, G., Glover,
F., (Eds.), Dordrecht, Kluwer Academic Publishers. [42] Mimmack, Gillian M., Mason, Simon J., Galpin, and
Jacquelin S. 2001. “Choice of distance matrices in
[28] Sousa, T., Neves, A., and Silva, A. 2003. “Swarm cluster analysis: Defining regions”, Journal of climate,
optimization as a new tool for data mining”,. In 4(12), pp. 2790-2797.
Proceedings of the 17th international symposium on
parallel and distributed processing (IPDPS’03), [43] Ertoz, L., Steinbach, M., and Kumar, V. 2003. “Finding
pp. 48-53. clusters of different sizes, shapes, densities in noisy high
dimensional data”, Proceedings of the third SIAM
[29] Van der Merwe, D., and Engelbrecht, A. 2003. “Data international conference on data mining (SDM 2003),
clustering using particle swarm optimization”, In volume 112, Proceedings in Applied mathematics,
Proceedings of IEEE congress on evolutionary Society for industrial and applied mathematics.
computation (CEC 2003), Canbella, Australia, pp.
215-220. [44] Berry, M.J.A., and Linoff, G.S. 2009. “Data mining
techniques: For marketing, sales and customer
[30] Liping Yan and Jianchao Zeng 2006. “Using particle relationship management”, Second edition, Wiley.
swarm optimization and genetic programming to evolve
classification rules”, In Sixth world congress on [45] Bock, R.K., and Krischer, W. 1998. “The data analysis
intelligent control and automation (WCICA 2006), pp. brief book”, New York: Springer-Verlag.
3415-3419.
[46] Omran, M.G.H. 2005. “A PSO-based clustering
[31] Apostolopoulos, T., and Vlachos, A. 2011. “Application algorithm with application to unsupervised
of the firefly algorithm for solving the economic classification”, University of Pretoria etd.
emissions load dispatch problem”, International Journal
of Combinatorics, 2011, pp. 1-23. [47] Shi, Y., and Eberhart, R.C. 2002. “Empirical study of
particle swarm optimization”, In Proceedings of IEEE
[32] Mustafa Servet Kiran, Hazim Iscan and Mesut Gunduz congress on evolutionary computation (CEC 1999),
2012. “The analysis of discrete artificial bee colony Washington D.C., pp. 1945-1949.
44