PRACTICAL - 8
Aim:
To identify/prepare a dataset and implement the K-means clustering algorithm in
Python.
To visualize the resulting clusters and centroids.
To analyze the clustering result and study the effect of varying the number of clusters
K.
Theory:
Clustering is an unsupervised learning task that groups data points so that points in
the same group (cluster) are more similar to each other than to those in other groups.
K-means seeks to partition nnn observations into K clusters {C1,…,CK} by
minimizing the within-cluster sum of squares (WCSS):
where μi is the centroid of cluster Ci.
Algorithm steps:
1. Initialize K centroids (randomly select K points).
2. Assign each data point to the nearest centroid.
3. Update each centroid as the mean of points assigned to it.
4. Repeat steps 2–3 until centroids move less than a tolerance or max iterations
reached.
Dataset
Generated synthetically using sklearn.datasets.make_blobs:
o n_samples = 300
o centers = 4
o random_state = 42
This produces four well-separated Gaussian clusters in 2D.
Program:
Step 1:
47
Department of Computer Science & Engineering
Student Name: Pranav Nama Enrollment No: EN22CS303035
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:
Step 8:
Step 9:
Step 10:
48
Department of Computer Science & Engineering
Student Name: Pranav Nama Enrollment No: EN22CS303035
Step 11:
Step 12:
Output: K-Means Clustering Result:
The plot shows four clusters (different colors) and their centroids (red X’s)
Analysis of Results
1. Cluster quality
o Clusters are compact and well-separated, reflecting the way the data were
generated.
o Centroids lie near the “centre” of each blob.
2. Effect of changing K
49
Department of Computer Science & Engineering
Student Name: Pranav Nama Enrollment No: EN22CS303035
o Under-clustering (K<4):
e.g. K=3 merges two true blobs into one cluster → increased WCSS.
o Over-clustering (K>4):
e.g. K=5 splits a true blob into two smaller clusters → may overfit noise.
o Use the Elbow method (plot WCSS vs. K) to pick the “elbow” point where
adding another cluster yields diminishing returns.
3. Suggested extension
o Compute and plot WCSS for K=1to K=8, identify elbow.
o Compute silhouette scores for different K to assess cluster separation.
Conclusion
Successful implementation of K-means and visualization of four natural clusters in the
dataset.
Proper choice of K is crucial: too small merges distinct groups; too large over-splits.
Elbow and silhouette analyses help select an optimal K.
50
Department of Computer Science & Engineering
Student Name: Pranav Nama Enrollment No: EN22CS303035