Lecture week 10
Unsupervised Learning: Clustering
Unsupervised Learning
Unsupervised learning differs in that no target variable exists.
All variables are treated as inputs, and therefore unsupervised learning is
used to find patterns in the data. This allows the data to be reduced and
segmented into its representative classes.
Unsupervised Learning
Clustering: Grouping similar data points together. For example, clustering
customers into different segments based on their purchasing behavior.
Dimensionality reduction: Reducing the number of variables in the data
while retaining essential information. Techniques like Principal Component
Analysis (PCA) are used for this purpose.
Anomaly detection: Identifying unusual or rare data points that do not
conform to the general pattern of the data.
Clustering
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
Cluster Analysis
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering
Unsupervised scholarship: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Applications of Clustering
Customer Segmentation in Marketing(Help marketers discover distinct groups in
their customer bases, and then use this knowledge to develop targeted marketing
programs)
Image Compression
Recommendation Systems
Document Classification
Anomaly Detection in Fraud Detection
Clustering Applications
Land use
Identification of areas of similar
land use in an earth observation
database
Insurance
Identifying groups of motor
insurance policy holders with a
high average claim cost
Clustering Applications
City-planning
Identifying groups of houses
according to their house type,
value, and geographical location
Earth-quake studies
Observed earth quake epicenters
should be clustered along
continent faults
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
10
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
11
Types of Clustering
Types of Clustering
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points A Partitional Clustering
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Traditional Dendrogram
Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Non-traditional Dendrogram
Clustering
Hierarchal clustering
E
C D
B
A
Measure the Quality of
Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal ratio.
Weights should be associated with different variables based on
applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
The K-Means Clustering
Method
Given k, the k-means algorithm is implemented in
four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)
Assigneach object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment
The K-Means Clustering Method
(Cont’d)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10
10
K=2 9
9
8
8
Arbitrarily choose K 7
7
object as initial
6
6
5
5
cluster center 4 Update 4
3
2
the 3
cluster
2
1 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 20
4 5 6 7 8 9 10
The K-Means Algorithm
1. Choose a value for K, the total number of clusters to be
determined.
2. Choose K instances within the dataset at random. These are
the initial cluster centers.
3. Use simple Euclidean distance to assign the remaining
instances to their closest cluster center.
4. Use the instance in each cluster to calculate a new mean for
each cluster.
5. If the new mean values are identical to the mean values of the
previous iteration the process terminates.
6. Otherwise, use the new means as cluster centers and repeat
steps 3-5.
How to Initialize Centroids
Random Points in Feature Space
Random Points From Data Set
Look For Dense Regions of Space
Space them uniformly around the feature space
22
Working of the K-Mean Algorithm
Instance #: 1 2 3 4 5 6
X: 1 1 2 2 3 3
Y: 1.5 4.5 1.5 3.5 2.5 6.0
Let’s pick Instances #1 and #3 as the initial centroids.
Distance with Centroid1 Distance with Centroid2
Instance #2 3.00 3.16
Instance #4 2.24 2.00
Instance #5 2.24 1.41
Instance #6 6.02 5.41
New centroids are (1, 3) and (2.5, 3.4)
23
Centroid Selection
K1=> 1st (1,1.5)
Solution K2=> 3rd (2,1.5)
# X Y Distance Distance Cluster
with K1 with K2 Assigned
1 1 1.5 0 -
2 1 4.5 3 3.16
3 2 1.5 - 0
4 2 3.5 2.23 2
5 3 2.5 2.25 1.484
6 3 6.0 4.924 4.609
Centroid Selection
K1=> 1st (1,1.5)
Solution(1st iteration) K2=> 3rd (2,1.5)
# X Y Distance Distance Cluster
with C1 with C2 Assigned
1 1 1.5 0 - C1
2 1 4.5 3 3.16 C1
C1 C2
3 2 1.5 - 0 C2 1,2 3,4,5,6
4 2 3.5 2.23 2 C2
5 3 2.5 2.25 1.484 C2
6 3 6.0 4.924 4.609 C2
Mean Calculation = C1 Mean Calculation = C2
X=(1+1)/2=1 X=(2+2+3+3)/4=2.5
Y=(1.5+4.5)/2=3 Y=(1.5+3.5+2.5+6)/4=3.375
New C1 (1,3) New C2 (2.5,3.375)
Centroid Selection
New C1 (1,3)
Solution(2nd iteration) New C2 (2.5,3.375)
# X Y Distance Distance Cluster
with C1 with C2 Assigned
1 1 1.5 1.5 2.4 C1
2 1 4.5 1.5 1.875 C1
C1 C2
3 2 1.5 1.802 1.94 C1 1,2,3 4,5,6
4 2 3.5 1.118 0.51 C2
5 3 2.5 2.06 1.007 C2
6 3 6.0 3.60 2.67 C2
Mean Calculation = C1 Mean Calculation = C2
New C1 (1.33,2.5) New C2 (2.667,4)
Centroid Selection
New C1 (1.33,2.5)
Solution(3rd iteration) New C2 (2.667,4)
# X Y Distance Distance Cluster
with C1 with C2 Assigned
1 1 1.5 1.5 3 C1
2 1 4.5 2.02 1.74 C2 C1 C2
3 2 1.5 1.203 2.587 C1 1,3 2,4,5,6
4 2 3.5 1.203 0.833 C2
5 3 2.5 1.67 1.536 C2
6 3 6.0 3.87 2.02 C2
Mean Calculation = C1 Mean Calculation = C2
New C1 New C2
Practice Exercise
data X Y Distance Distance Cluster
K=2 # with C1 with C2 Assigned
Data Points (X, Y):
(1, 2) Centroid 1 1 1 2 C1
(2, 3) 2 2 3 C2
(3, 4)
(8, 7) 3 3 4 C2
(9, 8) Centroid 2 4 8 7 C2
(10, 9) 5 9 8 C1
6 10 9 C2
SSE
For k=3, apply k-means algorithm
Instance X-Coordinate Y-Coordinate
1 4 7
2 -1 3
3 2 -2
4 5 1
5 7 6
6 3 8
7 -5 -2
8 3 6
9 0 -3
10 -3 -1
11 4 2
12 6 0
30
K-Means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, Cosine similarity,
Correlation, etc.
K-means will converge for common similarity measures mentioned
above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change
clusters’
Outlier removal and feature normalization are important data pre-
processing steps before applying K-Means.
31
Two different K-means Clustering
Original Points
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
32
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial
Centroids Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
33
Importance of Choosing Initial Centroids
…
Iteration 5
1
2
3
4
3
2.5
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
34
Importance of Choosing Initial Centroids
…
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
Iteration 3 Iteration 4 Iteration 5
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5
y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
35
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
K
SSE dist 2 (mi , x )
i 1 xCi
x is a data point in cluster Ci and mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering with
higher K
36
SSE
SSE is the total sum of the squared Euclidean distances between each data
point and the centroid of the cluster it is assigned to.
It quantifies how well the data points fit their respective clusters.
A lower SSE indicates that the points are closer to their centroids, suggesting
that the clustering is better.
Determining the Right K
Elbow Method:
Plot a line chart of the SSE for a range of value of k. If the line chart looks like an
arm, then the "elbow" on the arm is the value of k that is the best
38
Limitations of K-means
K-means has problems when clusters are of differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains outliers.
39
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
40
Overcoming K-means Limitations
Original Points K-means Clusters
41
Differing Sizes
42
Overcoming Differing Size
43
Differing Density
44
Overcoming Differing Density
45
Summary K-Means Method
Strengths of K-Means:
Fast and Scalable: K-means can handle large datasets efficiently, especially with optimizations like Mini-Batch K-
means.
Easy to Understand and Implement: The algorithm’s concept is straightforward, making it a good starting point for
clustering tasks.
Works Well for Simple Data: It performs well when clusters are well-separated, have similar sizes, and exhibit
spherical shapes.
Key Limitations:
Choosing the Right k: You need to specify the number of clusters in advance, which can be tricky without domain
knowledge or appropriate techniques like the elbow method or silhouette score.
Sensitive to Initial Centroids: Poor initialization of centroids can lead to suboptimal clustering. The K-means++
initialization method can mitigate this issue.
Sensitive to Outliers: K-means can be easily influenced by outliers because it relies on the mean to define cluster
centroids. K-medoids or other robust algorithms may be better for handling outliers.
Assumes Equal Cluster Sizes, Densities, and Spherical Shapes: K-means assumes that clusters
46
are spherical, of equal
size, and have similar densities. It may struggle with complex, non-spherical, or unevenly sized clusters.
When to Use K-Means:
When clusters are roughly spherical and have similar sizes and densities.
When you have a good idea of the number of clusters (k) or can estimate it
using methods like the elbow method.
When speed and scalability are important, especially with large datasets.
Pre-processing and Post-processing
Pre-processing
Normalize data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split ‘loose’ clusters, i.e., clusters with relatively high SSE
Merge clusters that are ‘close’ and that have relatively low SSE
48