Exp No: 6 Build a Clustering model and calculate the Error measures and Bias
Aim:
To build a clustering model and calculate error measures.
Theory:
Clustering:
Clustering is an unsupervised machine learning technique used to group similar data points
into clusters based on their characteristics.
The goal is to identify patterns and organize data without using pre-labeled outputs.
Data points within a cluster are more similar to each other than to those in other clusters.
K-Means Clustering
K-Means is a centroid-based clustering algorithm that partitions data into K clusters by
minimizing the distance between data points and their nearest cluster center.
The optimal value of K is determined using the Elbow Method, which plots inertia (sum of
squared distances to the nearest cluster center) against the number of clusters.
The point where the inertia curve bends (or the rate of decrease slows) is considered the
optimal number of clusters.
Hierarchical Clustering
Hierarchical clustering builds a tree-like structure (dendrogram) to represent data groupings
through agglomerative (bottom-up) or divisive (top-down) approaches.
Agglomerative Clustering starts with each point as an individual cluster and merges them
step by step.
The optimal number of clusters is found by cutting the dendrogram at the appropriate height
where clusters are well separated.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based algorithm that groups dense regions and identifies points in low-
density regions as outliers.
It requires two key parameters:
o ε (epsilon): Maximum distance between two points to be considered neighbors.
o min_samples: Minimum points required to form a cluster.
DBSCAN is effective in detecting arbitrary-shaped clusters and outliers but is sensitive to
parameter selection.
Spectral Clustering
Spectral clustering is a graph-based method that converts data into a similarity graph and
uses eigenvalues for dimensionality reduction before applying clustering algorithms like K-
Means.
It is suitable for non-linearly separable and complex cluster structures.
Requires defining a similarity measure (e.g., Gaussian kernel or K-nearest neighbors) to build
the graph.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from scipy.cluster.hierarchy import dendrogram, linkage
# Load dataset
df = pd.read_csv("/mnt/data/updated_pollution_dataset.csv")
# Drop non-numeric columns
df_numeric = df.select_dtypes(include=[np.number])
# Handle missing values (if any)
df_numeric.fillna(df_numeric.mean(), inplace=True)
# Normalize data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_numeric)
# Detect outliers using Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outlier_labels = iso_forest.fit_predict(df_scaled)
df['Outlier'] = outlier_labels
# Determine optimal clusters using Elbow Method
inertia = [KMeans(n_clusters=k, random_state=42, n_init=10).fit(df_scaled).inertia_ for k in range(1,
11)]
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.show()
# Hierarchical Clustering Dendrogram
plt.figure(figsize=(10, 5))
dendrogram(linkage(df_scaled, method='ward'))
plt.title("Dendrogram for Hierarchical Clustering")
plt.show()
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(df_scaled)
# Apply Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=3)
hierarchical_labels = hierarchical.fit_predict(df_scaled)
# Apply DBSCAN Clustering
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(df_scaled)
# PCA for visualization
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)
from sklearn.cluster import SpectralClustering
# Apply Spectral Clustering
spectral = SpectralClustering(n_clusters=3, random_state=42, affinity='nearest_neighbors')
spectral_labels = spectral.fit_predict(df_scaled)
# Visualizing Spectral Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=spectral_labels, palette='coolwarm', edgecolor='k')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("Spectral Clusters Visualization using PCA")
plt.legend(title="Clusters")
plt.show()
# Visualizing K-Means Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=kmeans_labels, palette='viridis', edgecolor='k')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("KMeans Clusters Visualization using PCA")
plt.legend(title="Clusters")
plt.show()
# Visualizing Outliers
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=outlier_labels, palette={1: 'blue', -1: 'red'},
edgecolor='k')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("Outlier Detection using Isolation Forest")
plt.legend(title="Outlier Status", labels=["Normal", "Outlier"])
plt.show()
# Visualizing DBSCAN Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=dbscan_labels, palette='Set1', edgecolor='k')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("DBSCAN Clusters Visualization using PCA")
plt.legend(title="Clusters")
plt.show()
# Add cluster labels to original data
df['KMeans_Cluster'] = kmeans_labels
df['Hierarchical_Cluster'] = hierarchical_labels
df['DBSCAN_Cluster'] = dbscan_labels
ScreenShots:
Inference:
Outlier Detection: Approximately 10% of the data were identified as outliers, likely representing
extreme pollution levels or rare environmental conditions.
Optimal Clusters: The Elbow method and Dendrogram suggest 3 optimal clusters, indicating three
distinct air quality profiles.
K-Means & Hierarchical Clustering: Both methods identified low, moderate, and high pollution
groups, with K-Means providing clearer separation.
DBSCAN Clustering: Successfully captured irregular patterns and outliers, useful for detecting rare
pollution spikes.
Spectral Clustering : effectively identified complex, non-linear pollution patterns, revealing subtle
variations in air quality levels across different regions.
The dataset reveals three primary pollution categories, with outliers representing unusual air quality
events or anomalies.
RESULT:
Thus clustering model has been built .