Dr.
Rafiq Zakaria Campus
Maulana Azad College of Arts, Science & Commerce
P.G. Dept. of Computer Science
M.Sc. III Semester
Data Mining and Warehousing
Practical 03
Date : 9th October 2024
Aim : To study Data Clustering using Python.
Description:
Data clustering is a powerful technique used to group similar data points together. Here’s a
practical guide to performing clustering using Python, specifically with the `scikit-learn` library.
1. Install Required Libraries
Make sure you have the necessary libraries installed:
pip install numpy pandas matplotlib scikit-learn
2. Import Libraries
Start by importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
3. Generate Sample Data
For this example, we’ll create synthetic data using `make_blobs`:
# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
4. Visualize the Data
Visualizing the data can help understand the structure before clustering:
plt.scatter(X[:, 0], X[:, 1], s=30)
plt.title('Sample Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
5. Perform K-Means Clustering
Now, let's apply the K-Means clustering algorithm:
# Choose the number of clusters
k=4
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
# Get the cluster labels
y_kmeans = kmeans.predict(X)
6. Visualize the Clusters
You can visualize the resulting clusters:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=30, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
7. Evaluate the Clustering
Evaluate the clustering using the silhouette score:
score = silhouette_score(X, y_kmeans)
print(f'Silhouette Score: {score}')
8. Choosing the Right Number of Clusters
To choose the optimal number of clusters, you can use the Elbow method:
inertia = []
K = range(1, 11)
for k in K:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(8, 4))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters K')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')
plt.show()
Conclusion
In this practical, you learned how to perform clustering using K-Means in Python. Adjusting
parameters and preprocessing your data can yield better clustering results.
Prepared by Khan Shagufta (Assistant professor PG Dept of Comp Sci)