Clustering
Sridhar S
Department of IST
Anna University
Clustering
Definition
Clustering is the process of organizing
objects into groups whose members are
similar in some way.
A cluster is therefore a collection of objects
which are similar between them and are
dissimilar to the objects belonging to
other clusters.
A data mining technique called Class Discovery
Why Clustering?
Learning
from Data
Unsupervised learning
What it does?
Pattern detection
Simplifications
concept construction
Supervised vs. Unsupervised
Classification
Supervised: set of classes (clusters) is given, assign new pattern
(point) to proper cluster, label it with label of its cluster
Examples: classify bacteria to select proper antibiotics, assign
signature to book and place in proper shelf
Unsupervised: for given set of patterns, discover a set of clusters
(training set) and assign addtional patterns to proper cluster
Examples: buying behavior, stock groups, closed groups of
researchers citing each other, more?
Examples
Clustering of documents or research groups
by citation index, evolution of papers.
Problem: find minimal set of papers describing
essential ideas
Clustering of items from excavations
according to cultural epoques
Clustering of tooth fragments in anthropology
Carl von Linne: Systema Naturae, 1735,
botanics, later zoology
etc ...
The use of Clustering
Data
Mining
Information Retrieval
Text Mining
Web Analysis
Marketing
Medical Diagnostic
Image Analysis
Bioinformatics
Image Analysis MST
Clustering
An image file
before/after color
clustering using
HEMST (Hierarchical
EMST clustering
algorithm) and
SEMST (Standard
EMST clustering
algorithm).
Overview of clustering
Feature Selection
identifying the most effective subset of the original features to
use in clustering
Feature Extraction
transformations of the input features to produce new salient
features.
Inter-pattern Similarity
measured by a distance function defined on pairs of patterns.
Grouping
methods to group similar patterns in the same cluster
Conducting Cluster Analysis
Formulate the Problem
Select a Distance Measure
Select a Clustering Procedure
Decide on the Number of Clusters
Interpret and Profile Clusters
Assess the Validity of Clustering
Formulating the Problem
Most important is selecting the variables on
which the clustering is based.
Inclusion of even one or two irrelevant
variables may distort a clustering solution.
Select a Similarity Measure
Similarity measure can be correlations or distances
The most commonly used measure of similarity is
the Euclidean distance. The city-block distance is
also used.
If variables measured in vastly different units, we
must standardize data. Also eliminate outliers
Use of different similarity/distance measures may
lead to different clustering results. Hence, it is
advisable to use different measures and compare
the results.
Distance Measures
The
Euclidean Distance between p and q is
defined as:
De (p,q) = [(x s)2 + (y - t)2]1/2
q (s,t)
De (p,q)
p (x,y)
D4 distance
The
D4 distance (also called city-block
distance) between p and q is defined as:
D4 (p,q) = | x s | + | y t |
D8 distance
The
D8 distance (also called chessboard
distance) between p and q is defined as:
D8 (p,q) = max(| x s |,| y t |)
Example
Metric Distances
What
properties should a similarity distance
have?
D(A,B) = D(B,A)
Symmetry
D(A,A) = 0
Constancy of Self-Similarity
D(A,B) >= 0
Positivity
D(A,B) D(A,C) + D(B,C) Triangular Inequality
Hierarchical Clustering
Produces
a set of nested clusters organized
as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
5
4
3
0.2
0.15
0.1
1
3
0.05
0
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
Divisive Algorithms
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Single Linkage Algorithm
Distance between two clusters
Single-link
distance between clusters Ci
and Cj is the minimum distance between
any object in Ci and any object in Cj
The
distance is defined by the two most
similar objects
Dsl Ci , C j min x , y d ( x, y ) x Ci , y C j
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Objects 1 and 2 should be clustered
Second Iteration
{1,2}
{1,2}
8.1
16.0
17.9
8.1
9.8
9.8
16.0
9.8
8.0
17.9
9.8
8.0
Objects 4 and 5 should be clustered
Third Iteration
{1,2}
{4,5}
{1,2}
8.1
16.0
8.1
9.8
{4,5}
16.0
9.8
Objects {1,2} and {3} are clustered
Dendrogram
Complete Linkage Algorithm
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Objects 1 and 2 should be clustered
Second Iteration
{1,2}
{1,2}
11.7
20.0
21.5
11.7
9.8
9.8
20.0
9.8
8.0
21.5
9.8
8.0
Objects 4 and 5 should be clustered
Third Iteration
{1,2}
{4,5}
{1,2}
11.7
21.5
11.7
9.8
{4,5}
21.5
9.8
Objects {3} and {4,5} are clustered
Dendrogram
Average Linkage Algorithm
Distance between two clusters
Group
average distance between clusters
Ci and Cj is the average distance between
any object in Ci and any object in Cj
1
Davg Ci , C j
Ci C j
d ( x, y )
xCi , yC j
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Objects 1 and 2 should be clustered
Second Iteration
{1,2}
{1,2}
9.9
18.0
19.7
9.9
9.8
9.8
18
9.8
8.0
19.7
9.8
8.0
Objects 4 and 5 should be clustered
Third Iteration
{1,2}
{4,5}
{1,2}
9.9
18.9
9.9
9.8
{4,5}
18.9
9.8
Objects {3} and {4,5} are clustered
Dendrogram
Hierarchical Clustering: Group
Average
Compromise between Single and
Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
Wards Algorithm
Cluster Similarity: Wards
Method
Based on Minimum variance
Find mean vector for example of
(4,4) and (8,4) = (12/2,4+4/2) = (6,4)
Variance of merging objects 1 and 2 is
(4-6)2+(8-6)2 + (4-6)2+(8-6)2
=8
First Iteration For 4 clusters
Clusters
Squared Error
{1,2},{3},{4},{5}
8.0
{1,3},{2},{4},{5}
68.5
{1,4},{2},{3},{5}
200
{1,5},{2},{3},{4}
232
{2,3},{1},{4},{5}
32.5
{2,4},{1},{3},{5}
128
{2,5},{1},{3},{4}
160
{3,4},{1},{2},{4}
48.5
{3,5},{1},{2},{5}
48.5
{4,5},{1},{2},{3}
32.0
First Iteration For 3 clusters
Clusters
Squared Error
{1,2,3},{4},{5}
72.7
{1,2,4}, {3},{5}
224
{1,2,5},{3},{4}
266.7
{1,2},{3,4},{5}
56.5
{1,2},{3,4},{5}
56.5
{1,2},{4,5},{3}
40
First Iteration For 2 clusters
Clusters
Squared Error
{1,2,3},{4,5}
104.7
{1,2,4,5},{3}
380.0
{1,2},{3,4,5}
94
Dendrogram
Hierarchical Clustering: Time and Space requirements
O(N2)
space since it uses the proximity
matrix.
N is the number of points.
O(N3)
time in many cases
There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time
for some approaches
Overall Assessment
Once
a decision is made to combine two
clusters, it cannot be undone
No
objective function is directly minimized
Different
schemes have problems with one or
more of the following:
Sensitivity to noise and outliers Difficulty handling
different sized clusters and convex shapes,
Breaking large clusters
Divisive Hierarchical
Algorithms
Spanning Tree
A
spanning tree of a graph is a tree
and is a subgraph that contains all the
vertices.
A graph may have many spanning
trees; for example, the complete
graph on four vertices has sixteen
spanning trees:
Spanning Tree cont.
Minimum Spanning Trees
Suppose
that the edges of the graph have
weights or lengths. The weight of a tree will
be the sum of weights of its edges.
Based on the example, we can see that
different trees have different lengths.
The question is: how to find the minimum
length spanning tree?
Minimum Spanning Trees
The
question can be solved by many different
algorithms, here is three classical minimumspanning tree algorithms :
Kruskal's Algorithm
Prim's Algorithm
Kruskal's Algorithm
Joseph Bernard Kruskal, Jr
Kruskal Approach:
Select the minimum weight edge
that does not form a cycle
Kruskal's Algorithm:
sort the edges of G in increasing
order by length
keep a subgraph S of G, initially
empty
for each edge e in sorted order
if the endpoints of e are
disconnected in S
add e to S
Kruskal's Algorithm - Example
Kruskal's Algorithm - Example
Prims Algorithm
Robert
Clay Prim
Prim Approach:
Choose an arbitrary start node v
At any point in time, we have connected
component N containing v and other nodes V-N
Choose the minimum weight edge from N to V-N
Prim's Algorithm:
let T be a single vertex x
while (T has fewer than n vertices)
{
find the smallest edge connecting T
to G-T
add it to T
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
Prim's Algorithm - Example
8
13
12
9
2
11
20
40
50
1
14
10
MST: Divisive Hierarchical
Clustering
Build
MST (Minimum Spanning Tree)
Start with a tree that consists of any point
In successive steps, look for the closest pair of points (p, q) such
that one point (p) is in the current tree but the other (q) is not
Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical
Clustering
Use
MST for constructing hierarchy of
clusters
Partitional Clustering
Introduction
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
Partitional Clustering
Forgys Algorithm
k
samples called seed points.
k is the number of clusters to be constructed
Partitional Clustering
Forgys Algorithm
1.
2.
3.
4.
Initialize the cluster centroids to the seed
points.
For each sample, find the cluster centroid
nearest it. Put the sample in the cluster
identified with this nearest cluster centroid.
If no sample changed clusters in step 2, stop.
Compute the centroids of the resulting clusters
and go to step 2.
Partitional Clustering
Forgys Algorithm
Initialization
Sample
Nearest Cluster Centroid
(4,4)
(4,4)
(8,4)
(8,4)
(15,8)
(24,4)
(24,12)
Partitional Clustering
Forgys Algorithm
First iteration
Sample
Nearest Cluster Centroid
(4,4)
(4,4)
(8,4)
(8,4)
(15,8)
(8,4)
(24,4)
(8,4)
(24,12)
(8,4)
Partitional Clustering
Forgys Algorithm
Second iteration
Sample
Nearest Cluster Centroid
(4,4)
(4,4)
(8,4)
(4,4)
(15,8)
(17.75, 7)
(24,4)
(17.75, 7)
(24,12)
(17.75, 7)
Partitional Clustering
Forgys Algorithm
Third iteration
Sample
Nearest Cluster Centroid
(4,4)
(6,4)
(8,4)
(6,4)
(15,8)
(21, 8)
(24,4)
(21, 8)
(24,12)
(21, 8)
Partitional Clustering
Forgys Algorithm
It has been proved that Forgys algorithm
terminates.
Eventually no sample change clusters.
If the number of samples is large, it may take
the algorithm considerable time to produce
stable clusters.
k-means Algorithm
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
k-means Algorithm
Similar
to Forgys algorithm
k is the number of clusters to be constructed
Differences
The centroids of the clusters are recomputed as
soon as a sample joins a cluster.
Makes only two passes through the data set
Forgys algorithm is iterative
k-means Algorithm
Begin with k clusters, each consisting of one of
the first k samples. For each of the remaining nk samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest
centroid. After each sample is assigned,
recompute the centroid of the altered cluster.
Go through the data a second time. For each
sample, find the centroid nearest it. Put the
sample in the cluster identified with the nearest
centroild. (During this step, do not recompute
any centroid.)
k-means Algorithm
Set
k=2 and assume that the data are
ordered so that the first two samples are (8,4)
and (24,4).
Distances for use
by the k-means algorithm
Sample
(8,4)
Distance to
Distance to
Centroid (9,
Centroid (24,8)
5.3)
1.6
16.5
(24,4)
15.1
4.0
(15,8)
6.6
9.0
(4,4)
6.6
40.4
(24,12)
16.4
4.0
Isodata Algorithm
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
Isodata Algorithm
An
enhancement of the approach taken by
Forgys algorithm and the k-means algorithm.
k is allowed to range over an interval
Discard clusters with too few elements
Merge clusters
Too large or too close
Split clusters
Too few or containing very dissimilar samples
Model-based clustering
Assume
data generated from k probability
distributions
Goal: find the distribution parameters
Algorithm: Expectation Maximization (EM)
Output: Distribution parameters and a soft
assignment of points to clusters
Model-based clustering
Assume
k probability distributions with
parameters: (1,, k)
Given data X, compute (1,, k) such that
Pr(X|1,, k) [likelihood] or ln(Pr(X|1,, k))
[loglikelihood] is maximized.
Every point xX need not be generated by a
single distribution but it can be generated by
multiple distributions with some probability [soft
clustering]
EM Algorithm
Initialize k distribution parameters (1,, k); Each distribution
parameter corresponds to a cluster center
Iterate between two steps
Expectation step: (probabilistically) assign points to
clusters
Maximation step: estimate model parameters that
maximize the likelihood for the given assignment of points
EM Algorithm
Initialize k cluster centers
Iterate between two steps
Expectation step: assign points to clusters
Pr( x | C )
Pr( xi Ck ) Pr( xi | Ck )
wk
Pr( x C )
i
Maximation step: estimate model parameters
1
rk
n
i 1
Pr( xi C k )
Pr( xi C j )
k
Validity of clusters
Why
validity of clusters?
Given some data, any clustering algorithm
generates clusters
So we need to make sure the clustering results
are valid and meaningful.
Measuring
the validity of clustering results
usually involve
Optimality of clusters
Verification of biological meaning of clusters
Optimality of clusters
Optimal
clusters should
minimize distance within clusters (intracluster)
maximize distance between clusters (intercluster)
Example
of intracluster measure
Squared error se
k
se
i 1 pci
p mi
where mi is the mean of all instances in cluster ci
References
A. K. Jain and M. N. Murty and P. J. Flynn, Data
clustering: a review, ACM Computing Surveys, 31:3,
pp. 264 - 323, 1999.
T. R. Golub et. al, Molecular Classification of
Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring, Science, 286:5439,
pp. 531 537, 1999.
Gasch,A.P. and Eisen,M.B. (2002) Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biol., 3,
122.
M. Eisen et. al, Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8, 1998.