0% found this document useful (0 votes)
14 views37 pages

Stat401 ch6

Uploaded by

765musik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

Stat401 ch6

Uploaded by

765musik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Ch 6.

Cluster Analysis
 Cluster analysis classify observations into a small number of
homogeneous groups (called clusters) based on observed variables.
- The cluster analysis partitions observations into mutually exclusive groups.

 Purpose of cluster analysis: organize multivariate data into groups and


describe similarities and differences between groups.
- Ex 1) A marketing department of a company is interested in grouping
potential customers of their product.
- Ex 2) In psychiatry, the classification of mental status of patients help
finding the causes and leading to improved methods of therapy.

 Cluster analysis should be judged largely on its usefulness.


- Alternative clustering may exist for the same set of observations.
- Some classification will be more useful than others.
Ch 6. Cluster Analysis

Cluster A

1, 2, 4,…

Cluster B

3, 5, 6,…
Ch 6. Cluster Analysis

 The simplest way to identify groups is to examine data using graphs.


- Graphs can be made with the raw data or by using the results of a principal
components analysis.
- Graphical techniques are often useful to search clusters or to provide
evidence to justify clustering results.

 Cluster analysis classifies previously unclassified materials.


- Assume that the number and composition of clusters is unknown.
6.2. Agglomerative Hierarchical
Clustering Techniques
 Hierarchical clustering
- The hierarchical clustering consists of a series of partitions that proceed
from a single ‘cluster’ containing all observations, to n clusters each
containing a single observation.

 Agglomerative hierarchical clustering techniques produce partitions by


a series of successive fusion of n observations into groups.
- Fusions are irreversible.
- When an agglomerative hierarchical clustering fuses two observations
into the same group, they cannot be subsequently divided into different
groups.
- Needs to determine an appropriate number of clusters.

 Hierarchic classifications can be represented by a dendrogram.


- Dendrogram illustrates the fusion made at each stage of the analysis.
Example of Dendrogram

 Agglomerative hierarchical clustering

Stage 1 2 3 4 5

A
A, D
D
A, B, C, D, E
B

B, C, E
C
C, E
E

Agglomerative
6.2. Agglomerative Hierarchical
Clustering Techniques
 An agglomerative hierarchical clustering procedure produces a series of
partitions of the data, Pn, Pn-1, …, P1.
- The first, Pn, consists of n single member clusters.
- The last, P1, consists of a single group containing all n observations.

 Basic operation of the agglomerative hierarchical clustering


- START with Clusters C1, C2, …, Cn, each containing a single observation.
(1) Find the nearest pair of distinct clusters, say Ci and Cj, merge Ci and Cj, and
decrease the number of clusters by one.
(2) If the number of clusters equals one then stop, else return to (1).
- At each stage, the method fuses observations that are closest (or most similar).
- Have to decide how to define distance (or similarity) between an observation
and a group of observations or between two groups of observations.
6.2.1. Measuring inter-cluster dissimilarity

 Agglomerative hierarchical clustering techniques differ in how they


measure the distances between or similarity of two clusters.

 Single linkage clustering


- Use an inter-group measure
d AB  min d ij ,
i A
jB
where dAB is the distance between two clusters A and B,
and dij is the distance between observations i and j.

 Complete linkage clustering


- Use an inter-group measure
d AB  max d ij .
i A
jB

 Both of single and complete linkage clustering are invariant to monotone


transformation of the original inter-observations distances.
6.2.1. Measuring inter-cluster dissimilarity

 Group average clustering


- Measure inter-cluster distance or similarity by
1
d AB 
n A nB
 d
i A jB
ij ,

where nA and nB are the number of observations in clusters A and B.

 Illustrative example
- A dissimilarity matrix for five individuals:
1 2 3 4 5
1 0 
 
2 2 0 
D 6 
3 5 0
 
4 10 9 4 0 
5 9 8 5 3 0 

6.2.1. Measuring inter-cluster dissimilarity

 Everitt (2010)

Single Linkage

Complete Linkage

Group Average
6.2.2. Illustrative example of the application
of single linkage
 Stage 1: Since d12 is the smallest entry in the matrix D, individuals 1 and 2
are merged to form a cluster.
 Stage 2.
- The distances between this group and the three remaining individuals, 3, 4,
and 5 are obtained as follows:
d(12)3 = min (d13, d23) = d23 = 5, 1 2 3 4 5
1 0 
d(12)4 = min (d14, d24) = d24 = 9,  
2 2 0 
d(12)5 = min (d15, d25) = d25 = 8. D
3 6 5 0 
- Form a new distance matrix D1:  
4 10 9 4 0 
12 3 4 5
5  9 8 5 3 0 
12  0 
 
D1  3  5 0 .
4  9 4 0 
 
5  8 5 3 0 

- Since d45 is the smallest entry in D1, individuals 4 and 5 are merged to form
a second cluster.
6.2.2. Illustrative example of the application
of single linkage
 Stage 3.
- The distances between (45) and the remaining (12) and 3 are obtained as
follows:
d(12)(45) = min (d14, d15, d24, d25) = d25 = 8,
1 2 3 4 5
d3(45) = min (d34, d35) = d34 = 4. 1 0 
- Form a new distance matrix D2:  
2 2 0 
D
12 3 45 3 6 5 0 
 
12  0  4 10 9 4 0 
D2   .
3  5 0  5  9 8 5 3 0 
45  8 4 0 

- Since d3(45) is the smallest entry in D2, individual 3 is merged with the (45)
cluster.
 Stage 4.
- Fusion of the two remaining groups takes place to form a single group
containing all five individuals.
6.2.2. Illustrative example of the application
of single linkage
 The partitions produced at each stage are:
Stage Groups
P5 [1],[2],[3],[4],[5]
P4 [12],[3],[4],[5]
P3 [12],[3],[45]
P2 [12],[345]
P1 [12345]

- The single linkage dendrogram :


6.2.2. Illustrative example of the application
of complete linkage
 Stage 1: Since d12 is the smallest entry in the matrix D, individuals 1 and 2
are merged to form a cluster.
 Stage 2.
- The distances between this group and the three remaining individuals, 3, 4,
and 5 are obtained as follows:
d(12)3 = max (d13, d23) = d13 = 6,
d(12)4 = max (d14, d24) = d14 = 10, 1 2 3 4 5
1 0 
d(12)5 = max (d15, d25) = d15 = 9.  
2 2 0 
- Form a new distance matrix D1: D  
3 6 5 0
12 3 4 5  
4 10 9 4 0 
12  0 
  5  9 8 5 3 0 
D1  3  6 0 .
4  10 4 0 
 
5  9 5 3 0 

- Since d45 is the smallest entry in D1, individuals 4 and 5 are merged to form
a second cluster.
6.2.2. Illustrative example of the application
of complete linkage
 Stage 3.
- The distances between (45) and the remaining (12) and 3 are obtained as
follows:
d(12)(45) = max (d14, d15, d24, d25) = d14 = 10, 1 2 3 4 5
d3(45) = max (d34, d35) = d35 = 5. 1 0 
 
- Form a new distance matrix D2: 2 2 0 
D
3 6 5 0 
12 3 45  
12  0  4 10 9 4 0 
D2 
3
 . 5  9 8 5 3 0 
 6 0 
45  10 5 0 

- Since d3(45) is the smallest entry in D2, individual 3 is merged with the (45)
cluster.
 Stage 4.
- Fusion of the two remaining groups takes place to form a single group
containing all five individuals.
6.2.2. Illustrative example of the application
of complete linkage
 The partitions produced at each stage are:
Stage Groups
P5 [1],[2],[3],[4],[5]
P4 [12],[3],[4],[5]
P3 [12],[3],[45]
P2 [12],[345]
P1 [12345]

- The complete linkage dendrogram:

- Note that the partitions are the


same with the single linkage,
but the distances used to merge
groups are different.
6.2.2. Illustrative example of the application
of group average
 Stage 1: Since d12 is the smallest entry in the matrix D, individuals 1 and 2
are merged to form a cluster.
 Stage 2.
- The distances between this group and the three remaining individuals, 3, 4,
and 5 are obtained as follows:
d(12)3 = (d13+d23)/2 = 5.5, 1 2 3 4 5
1 0 
d(12)4 = (d14+d24)/2 = 9.5,  
2 2 0 
d(12)5 = (d15+d25)/2 = 8.5. D
3 6 5 0 
- Form a new distance matrix D1:  
4 10 9 4 0 
12 3 4 5
5  9 8 5 3 0 
12  0 
 
D1  3  5.5 0 .
4  9.5 4 0 
 
5  8.5 5 3 0 

- Since d45 is the smallest entry in D1, individuals 4 and 5 are merged to form
a second cluster.
6.2.2. Illustrative example of the application
of group average
 Stage 3.
- The distances between (45) and the remaining (12) and 3 are obtained as
follows:
d(12)(45) = (d14+d15+ d24+d25)/4 = 9, 1 2 3 4 5
1 0 
d3(45) = (d34+ d35)/2 = 4.5.  
2 2 0 
- Form a new distance matrix D2: D  
3 6 5 0
12 3 45  
4 10 9 4 0 
12  0 
D2   . 5  9 8 5 3 0 
3  5.5 0 
45  9 4.5 0 

- Since d3(45) is the smallest entry in D2, individual 3 is merged with the (45)
cluster.
 Stage 4.
- Fusion of the two remaining groups takes place to form a single group
containing all five individuals.
6.2.2. Illustrative example of the application
of group average
 The partitions produced at each stage are:
Stage Groups
P5 [1],[2],[3],[4],[5]
P4 [12],[3],[4],[5]
P3 [12],[3],[45]
P2 [12],[345]
P1 [12345]

- The group average dendrogram:

- Note that the partitions are the


same with the single linkage and
complete linkage, but the distances
used to merge groups are different.
6.2.3. Some properties of agglomerative
hierarchical clustering techniques

 Single linkage has a tendency to incorporate intermediate points into an


existing cluster rather than initiating a new one (called chaining).
- It leads to the formation of long ‘straggly’ clusters.
- Ex) Everitt (2010)
6.2.3. Some properties of agglomerative
hierarchical clustering techniques

 Hierarchical techniques using complete linkage and group average tend to


produce solutions in which the clusters are ‘spherical’ even when the data
contain relatively well-separated clusters.

- Ex) Everitt (2010)


6.2.3. Some properties of agglomerative
hierarchical clustering techniques

 Empirical investigations indicate that no single method could be claimed


superior for all types of data.
- The presence of outliers led to very poor performance by group average,
but left the single linkage virtually unaffected.
- When the data contained a true cluster structure masked by the addition of
‘noise’, single linkage gave poor results.
6.2.5. Partitions from a hierarchy:
the number-of-groups problem

 We are not interested in the complete hierarchy but only in one or two
partitions obtained from it.
- Determining an appropriate number of groups is not straightforward.
- In hierarchical clustering, partitions are obtained by ‘cutting’ a dendrogram.

 One informal approach


- Examine the size of the difference
between fusion levels in the dendrogram.
- Large changes in the dendrogram may
indicate a particular number of clusters.
6.2.6. Examples of the application of
agglomerative hierarchical clustering techniques

 Example 1. Chest, waist, and hip measurements for 20 individuals (Everitt,


2010)
6.2.6. Examples of the application of
agglomerative hierarchical clustering techniques

 Example 2. Skull data (Everitt, 2010)


- 32 skulls found in south-western and eastern districts of Tibet.
- It was known that the data could be divided into two groups:
Observation #1-17 vs. observation #18-32.
- Five variables are measured as follows:
Greatest length of skull (X1),
Greatest horizontal breadth of skull (X2),
Height of skull (X3),
Upper face height (X4),
Face breadth (X5)
- Ignore the a priori grouping and investigate how a cluster analysis solution
reproduces the prior partition (1-17 vs. 18-32).
6.3. Optimization Methods

 Clustering techniques based on optimization methods


- For a chosen number of groups, produce a partition of the observations by
optimizing a numerical criterion.

 With a single variable, the partition may be chosen by minimizing the


within-group sum of squares of the variable.
6.3. Optimization Methods

 In the multivariate data, the commonly used criterion considers the three
matrices that can be calculated for each partition of the data into g groups:
T   xij  x xij  x  ,

g ni

i 1 j 1

W   xij  xi xij  xi  ,

g ni

i 1 j 1


g
B   ni  xi  x  xi  x  .
i 1

where xij is the vector of variable values for the jth observation in the ith group,
x is the mean vector of all n observations,
xi is the mean vector of the observation in group i,
and ni is the number of observations in group i.
- Note that T = W + B,
where T represents total dispersion, W represents within-group dispersion, and
B represents between-group dispersion.
6.3. Optimization Methods

 For p = 1, the equation represents a separation of the total sum of squares


for a variable into the within- and between- groups sum of squares (in
ANOVA).
- A natural criterion is to choose the partition corresponding to the minimum
value of the within-group sum of squares (or equivalently, the maximum
value of the between-group sum of squares).

 For p > 1, a number of criteria have been suggested. Among them,


(1) Minimization of tr(W): minimization of the sum of the within-group sum
of squares for each variable
(2) Minimization of det(W): minimization of the determinants
6.3. Optimization Methods
 xij1  xi1 
 
g ni
 g ni  xij 2  xi 2 
- W    xij  xi  xij  xi     xij1  xi1, xij 2  xi 2 , , xijp  xip 
i 1 j 1 i 1 j 1
  
 
x  x
 ijp ip 
  x  x 2 x  xi1  xij 2  xi 2   x  xi1  xijp  xip  
 ij1 i1 ij1 ij1

g ni  
x ij 2  xi 2  x  
2
  x x  x
=   ij 2 i2 ijp ip 

i 1 j 1    
 
  xijp  xip  
2

 
 g ni ni ni

   ij1 i1    xij1  xi1  xij 2  xi 2    ij1 i1  ijp ip  
g g


2
x  x x  x x  x
 i 1 j 1 i 1 j 1 i 1 j 1 
 ni g ni 
  xij 2  xi 2     xij 2  xi 2  xijp  xip  
g
2

= i 1 j 1 i 1 j 1 
   
 
 g ni 
  xijp  xip 
2
 
 i 1 j 1 
6.3. Optimization Methods
 K-means methods is a criterion to minimize tr(W).

 K-means method
(1) Partition the individuals into K initial clusters.
(2) Proceed through the list of individuals, assigning an individual to the
cluster whose centroid (mean) is nearest.
- Distance is usually computed using the Euclidean distance with either
standardized or unstandardized observations.
(3) Recalculate the centroid for the cluster receiving the new individual and
for the cluster losing the individual.
- If an individual is moved from the initial configuration, the cluster
centroid (means) must be updated before proceeding.
(4) Repeat Steps (2) and (3) until no more reassignments take place.

 Note that the Euclidean distance between two arbitrary points P and Q
with coordinates P = (x1, x2, … xp) and Q = (y1, y2, … yp) is
d  P, Q   x1  y1 2  x2  y2 2    x p  y p 2 .
6.3. Optimization Methods
 Example. K-means method with K = 2
X1 X2
1 5 3
Individuals 2 -1 1
3 1 -2
4 -3 -2
(1) Divide the four individuals into 2 clusters.
- Arbitrarily partition the individuals into two clusters, such as (12) and (34).
- Compute the coordinates of the cluster centroid:
Coordinates of centroid
Cluster X1 X2
(12) {5+(-1)}/2 = 2 (3+1)/2 = 2
(34) {1+(-3)}/2 = -1 {-2+(-2)}/2 = -2
6.3. Optimization Methods
 Example. K-means method with K = 2 (continued)
(2) Compute the squared distances
d1(12)2 = (5-2)2 + (3-2)2 = 10,
d1(34)2 = (5-(-1))2 + (3-(-2))2 = 61.
- Since individual 1 is closer to cluster (12) than to cluster (34),
it is not reassigned.
d2(12)2 = (-1-2)2 + (1-2)2 = 10,
d2(34)2 = (-1-(-1))2 + (1-(-2))2 = 9.
- Since individual 2 is closer to cluster (34) than to cluster (12),
it is reassigned to cluster (34).

- If an item is moved from the initial configuration, the cluster


centroids (means) must be updated before proceeding.
- Now this gives clusters (1) and (234), and do not check the squared
distances of individuals 3 and 4.
6.3. Optimization Methods
 Example. K-means method with K = 2
6.3. Optimization Methods
 Example. K-means method with K = 2 (continued)
(3) Recalculate the coordinates of the cluster centroid:
Coordinates of centroid
Cluster X1 X2
1 5 3
(234) {(-1)+1+(-3)}/3 = -1 {1+(-2)+(-2)}/3 = -1

(4) Check each individual for reassignment.


Squared distances to group centroids
Cluster 1 2 3 4
1 0 40 41 89
(234) 52 4 5 5
- Each individual is assigned to the cluster with the nearest centroid, and the
process stops.
- The final K = 2 clusters are 1 and (234).
6.3. Optimization Methods
 Example. K-means method with K = 2
6.3. Optimization Methods

 Example. Tibetan skull data using K-means method (revisited)


- 32 skulls found in south-western and eastern districts of Tibet
- It was known that the data could be divided into two groups:
Observations #1-17 vs. observations #18-32.
- Five measurements:
Greatest length of skull (X1),
Greatest horizontal breadth of skull (X2),
Height of skull (X3),
Upper face height (X4),
Face breadth (X5)
- Ignore the a priori grouping and investigate how a cluster analysis
solution reproduces the prior partition (1-17 vs. 18-32).
6.3. Optimization Methods

 K-means method
- Rather than starting with a partition of all individuals into K preliminary
groups in Step (1), could specify K initial centroids and then proceed to
Step (2).
- The final assignment of individuals to clusters could depend on the initial
partition or the initial selection of centroids.
- To avoid biases, randomly partition the individuals into initial groups or
randomly select initial centroids.
- To check the stability of the clustering, it is desirable to rerun the algorithm
with a new initial partition.
6.3. Optimization Methods

 K-means method
- If two or more initial centroids inadvertently lie within a single cluster,
their resulting clusters will be poorly differentiated.
- The existence of an outlier might produce at least one group with very
dispersed individuals.
- Even if the population is known to consist of K groups, data from the rarest
group may not appear in the sample. Forcing the data into K groups would
lead to nonsensical clusters.
- It is recommended not to fix the number of clusters, K, in advance.
- In cases where a single run of the algorithm requires the user to specify K,
it is always a good idea to rerun the algorithm for several choices.

You might also like