0% found this document useful (0 votes)
6 views48 pages

10 Lecture AI 10

The document discusses unsupervised learning, focusing on clustering, which groups similar data points without predefined classes. It covers various clustering techniques, such as partitional and hierarchical clustering, and their applications in fields like marketing, image compression, and anomaly detection. Additionally, it explains the K-means clustering algorithm, including its steps, centroid initialization, and the importance of quality metrics in evaluating clustering results.

Uploaded by

Azan Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views48 pages

10 Lecture AI 10

The document discusses unsupervised learning, focusing on clustering, which groups similar data points without predefined classes. It covers various clustering techniques, such as partitional and hierarchical clustering, and their applications in fields like marketing, image compression, and anomaly detection. Additionally, it explains the K-means clustering algorithm, including its steps, centroid initialization, and the importance of quality metrics in evaluating clustering results.

Uploaded by

Azan Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Lecture week 10

Unsupervised Learning: Clustering


Unsupervised Learning
 Unsupervised learning differs in that no target variable exists.
 All variables are treated as inputs, and therefore unsupervised learning is
used to find patterns in the data. This allows the data to be reduced and
segmented into its representative classes.
Unsupervised Learning

 Clustering: Grouping similar data points together. For example, clustering


customers into different segments based on their purchasing behavior.

 Dimensionality reduction: Reducing the number of variables in the data


while retaining essential information. Techniques like Principal Component
Analysis (PCA) are used for this purpose.

 Anomaly detection: Identifying unusual or rare data points that do not


conform to the general pattern of the data.
Clustering
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters

 Cluster analysis
 Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
Cluster Analysis
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering

 Unsupervised scholarship: no predefined classes


 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
Applications of Clustering

 Customer Segmentation in Marketing(Help marketers discover distinct groups in


their customer bases, and then use this knowledge to develop targeted marketing
programs)
 Image Compression
 Recommendation Systems
 Document Classification
 Anomaly Detection in Fraud Detection
Clustering Applications

 Land use
 Identification of areas of similar
land use in an earth observation
database

 Insurance
 Identifying groups of motor
insurance policy holders with a
high average claim cost
Clustering Applications

 City-planning
 Identifying groups of houses
according to their house type,
value, and geographical location

 Earth-quake studies
 Observed earth quake epicenters
should be clustered along
continent faults
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

10
Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
11
Types of Clustering
Types of Clustering

 Partitional Clustering
 A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

 Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Traditional Dendrogram


Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Non-traditional Dendrogram


Clustering
Hierarchal clustering
E

C D
B

A
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, typically metric: d(i, j)

 There is a separate “quality” function that measures the


“goodness” of a cluster.
 The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal ratio.
 Weights should be associated with different variables based on
applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
The K-Means Clustering
Method
 Given k, the k-means algorithm is implemented in
four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)
 Assigneach object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
The K-Means Clustering Method
(Cont’d)
 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10
10
K=2 9
9
8
8

Arbitrarily choose K 7
7

object as initial
6
6
5
5
cluster center 4 Update 4
3

2
the 3

cluster
2
1 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 20
4 5 6 7 8 9 10
The K-Means Algorithm

1. Choose a value for K, the total number of clusters to be


determined.
2. Choose K instances within the dataset at random. These are
the initial cluster centers.
3. Use simple Euclidean distance to assign the remaining
instances to their closest cluster center.
4. Use the instance in each cluster to calculate a new mean for
each cluster.
5. If the new mean values are identical to the mean values of the
previous iteration the process terminates.
6. Otherwise, use the new means as cluster centers and repeat
steps 3-5.
How to Initialize Centroids
 Random Points in Feature Space
 Random Points From Data Set
 Look For Dense Regions of Space
 Space them uniformly around the feature space

22
Working of the K-Mean Algorithm
Instance #: 1 2 3 4 5 6
X: 1 1 2 2 3 3
Y: 1.5 4.5 1.5 3.5 2.5 6.0

 Let’s pick Instances #1 and #3 as the initial centroids.


Distance with Centroid1 Distance with Centroid2
 Instance #2 3.00 3.16
 Instance #4 2.24 2.00
 Instance #5 2.24 1.41
 Instance #6 6.02 5.41

 New centroids are (1, 3) and (2.5, 3.4)


23
Centroid Selection
K1=> 1st  (1,1.5)
Solution K2=> 3rd  (2,1.5)

# X Y Distance Distance Cluster


with K1 with K2 Assigned
1 1 1.5 0 -

2 1 4.5 3 3.16

3 2 1.5 - 0

4 2 3.5 2.23 2

5 3 2.5 2.25 1.484

6 3 6.0 4.924 4.609


Centroid Selection
K1=> 1st  (1,1.5)
Solution(1st iteration) K2=> 3rd  (2,1.5)

# X Y Distance Distance Cluster


with C1 with C2 Assigned
1 1 1.5 0 - C1
2 1 4.5 3 3.16 C1
C1 C2
3 2 1.5 - 0 C2 1,2 3,4,5,6
4 2 3.5 2.23 2 C2
5 3 2.5 2.25 1.484 C2
6 3 6.0 4.924 4.609 C2

Mean Calculation = C1 Mean Calculation = C2


X=(1+1)/2=1 X=(2+2+3+3)/4=2.5
Y=(1.5+4.5)/2=3 Y=(1.5+3.5+2.5+6)/4=3.375
New C1 (1,3) New C2 (2.5,3.375)
Centroid Selection
New C1 (1,3)
Solution(2nd iteration) New C2 (2.5,3.375)

# X Y Distance Distance Cluster


with C1 with C2 Assigned
1 1 1.5 1.5 2.4 C1
2 1 4.5 1.5 1.875 C1
C1 C2
3 2 1.5 1.802 1.94 C1 1,2,3 4,5,6
4 2 3.5 1.118 0.51 C2
5 3 2.5 2.06 1.007 C2
6 3 6.0 3.60 2.67 C2

Mean Calculation = C1 Mean Calculation = C2


New C1 (1.33,2.5) New C2 (2.667,4)
Centroid Selection
New C1 (1.33,2.5)
Solution(3rd iteration) New C2 (2.667,4)

# X Y Distance Distance Cluster


with C1 with C2 Assigned
1 1 1.5 1.5 3 C1
2 1 4.5 2.02 1.74 C2 C1 C2
3 2 1.5 1.203 2.587 C1 1,3 2,4,5,6
4 2 3.5 1.203 0.833 C2
5 3 2.5 1.67 1.536 C2
6 3 6.0 3.87 2.02 C2

Mean Calculation = C1 Mean Calculation = C2


New C1 New C2
Practice Exercise
data X Y Distance Distance Cluster
K=2 # with C1 with C2 Assigned
Data Points (X, Y):
(1, 2)  Centroid 1 1 1 2 C1
(2, 3) 2 2 3 C2
(3, 4)
(8, 7) 3 3 4 C2
(9, 8)  Centroid 2 4 8 7 C2
(10, 9) 5 9 8 C1
6 10 9 C2
SSE
For k=3, apply k-means algorithm
Instance X-Coordinate Y-Coordinate
1 4 7
2 -1 3
3 2 -2
4 5 1
5 7 6
6 3 8
7 -5 -2
8 3 6
9 0 -3
10 -3 -1
11 4 2
12 6 0
30
K-Means Clustering – Details
 Initial centroids are often chosen randomly.
 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the cluster.
 ‘Closeness’ is measured by Euclidean distance, Cosine similarity,
Correlation, etc.
 K-means will converge for common similarity measures mentioned
above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until relatively few points change
clusters’
 Outlier removal and feature normalization are important data pre-
processing steps before applying K-Means.
31
Two different K-means Clustering
Original Points

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x
32
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial
Centroids Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
33
Importance of Choosing Initial Centroids

Iteration 5
1
2
3
4
3

2.5

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

34
Importance of Choosing Initial Centroids

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

35
Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE)
 For each point, the error is the distance to the nearest cluster
 To get SSE, we square these errors and sum them.
K
SSE    dist 2 (mi , x )
i 1 xCi

 x is a data point in cluster Ci and mi is the representative point for cluster Ci


 can show that mi corresponds to the center (mean) of the cluster
 Given two clusters, we can choose the one with the smallest error
 One easy way to reduce SSE is to increase K, the number of clusters
 A good clustering with smaller K can have a lower SSE than a poor clustering with
higher K
36
SSE

 SSE is the total sum of the squared Euclidean distances between each data
point and the centroid of the cluster it is assigned to.
 It quantifies how well the data points fit their respective clusters.
 A lower SSE indicates that the points are closer to their centroids, suggesting
that the clustering is better.
Determining the Right K
 Elbow Method:
 Plot a line chart of the SSE for a range of value of k. If the line chart looks like an
arm, then the "elbow" on the arm is the value of k that is the best

38
Limitations of K-means

 K-means has problems when clusters are of differing


 Sizes
 Densities
 Non-globular shapes

 K-means has problems when the data contains outliers.

39
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

40
Overcoming K-means Limitations

Original Points K-means Clusters

41
Differing Sizes

42
Overcoming Differing Size

43
Differing Density

44
Overcoming Differing Density

45
Summary K-Means Method
 Strengths of K-Means:
 Fast and Scalable: K-means can handle large datasets efficiently, especially with optimizations like Mini-Batch K-
means.

 Easy to Understand and Implement: The algorithm’s concept is straightforward, making it a good starting point for
clustering tasks.

 Works Well for Simple Data: It performs well when clusters are well-separated, have similar sizes, and exhibit
spherical shapes.

 Key Limitations:
 Choosing the Right k: You need to specify the number of clusters in advance, which can be tricky without domain
knowledge or appropriate techniques like the elbow method or silhouette score.
 Sensitive to Initial Centroids: Poor initialization of centroids can lead to suboptimal clustering. The K-means++
initialization method can mitigate this issue.
 Sensitive to Outliers: K-means can be easily influenced by outliers because it relies on the mean to define cluster
centroids. K-medoids or other robust algorithms may be better for handling outliers.
 Assumes Equal Cluster Sizes, Densities, and Spherical Shapes: K-means assumes that clusters
46
are spherical, of equal
size, and have similar densities. It may struggle with complex, non-spherical, or unevenly sized clusters.
When to Use K-Means:

 When clusters are roughly spherical and have similar sizes and densities.
 When you have a good idea of the number of clusters (k) or can estimate it
using methods like the elbow method.
 When speed and scalability are important, especially with large datasets.
Pre-processing and Post-processing

 Pre-processing
 Normalize data
 Eliminate outliers

 Post-processing
 Eliminate small clusters that may represent outliers
 Split ‘loose’ clusters, i.e., clusters with relatively high SSE
 Merge clusters that are ‘close’ and that have relatively low SSE

48

You might also like