0% found this document useful (0 votes)
16 views10 pages

Exp 6

The document outlines the process of building a clustering model using various algorithms such as K-Means, Hierarchical, DBSCAN, and Spectral Clustering to analyze air quality data. It includes steps for data preprocessing, determining optimal clusters using the Elbow Method, and visualizing results through PCA. The findings indicate three distinct pollution profiles and identify approximately 10% of the data as outliers, highlighting the model's effectiveness in detecting unusual air quality events.

Uploaded by

jemimam278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Exp 6

The document outlines the process of building a clustering model using various algorithms such as K-Means, Hierarchical, DBSCAN, and Spectral Clustering to analyze air quality data. It includes steps for data preprocessing, determining optimal clusters using the Elbow Method, and visualizing results through PCA. The findings indicate three distinct pollution profiles and identify approximately 10% of the data as outliers, highlighting the model's effectiveness in detecting unusual air quality events.

Uploaded by

jemimam278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Exp No: 6 Build a Clustering model and calculate the Error measures and Bias

Aim:

To build a clustering model and calculate error measures.

Theory:

Clustering:

 Clustering is an unsupervised machine learning technique used to group similar data points
into clusters based on their characteristics.

 The goal is to identify patterns and organize data without using pre-labeled outputs.

 Data points within a cluster are more similar to each other than to those in other clusters.

K-Means Clustering

 K-Means is a centroid-based clustering algorithm that partitions data into K clusters by


minimizing the distance between data points and their nearest cluster center.

 The optimal value of K is determined using the Elbow Method, which plots inertia (sum of
squared distances to the nearest cluster center) against the number of clusters.

 The point where the inertia curve bends (or the rate of decrease slows) is considered the
optimal number of clusters.

Hierarchical Clustering

 Hierarchical clustering builds a tree-like structure (dendrogram) to represent data groupings


through agglomerative (bottom-up) or divisive (top-down) approaches.

 Agglomerative Clustering starts with each point as an individual cluster and merges them
step by step.

 The optimal number of clusters is found by cutting the dendrogram at the appropriate height
where clusters are well separated.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

 DBSCAN is a density-based algorithm that groups dense regions and identifies points in low-
density regions as outliers.

 It requires two key parameters:

o ε (epsilon): Maximum distance between two points to be considered neighbors.

o min_samples: Minimum points required to form a cluster.

 DBSCAN is effective in detecting arbitrary-shaped clusters and outliers but is sensitive to


parameter selection.

Spectral Clustering
 Spectral clustering is a graph-based method that converts data into a similarity graph and
uses eigenvalues for dimensionality reduction before applying clustering algorithms like K-
Means.

 It is suitable for non-linearly separable and complex cluster structures.

 Requires defining a similarity measure (e.g., Gaussian kernel or K-nearest neighbors) to build
the graph.

Code:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

from sklearn.decomposition import PCA

from sklearn.ensemble import IsolationForest

from scipy.cluster.hierarchy import dendrogram, linkage

# Load dataset

df = pd.read_csv("/mnt/data/updated_pollution_dataset.csv")

# Drop non-numeric columns

df_numeric = df.select_dtypes(include=[np.number])

# Handle missing values (if any)

df_numeric.fillna(df_numeric.mean(), inplace=True)

# Normalize data

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df_numeric)

# Detect outliers using Isolation Forest

iso_forest = IsolationForest(contamination=0.1, random_state=42)


outlier_labels = iso_forest.fit_predict(df_scaled)

df['Outlier'] = outlier_labels

# Determine optimal clusters using Elbow Method

inertia = [KMeans(n_clusters=k, random_state=42, n_init=10).fit(df_scaled).inertia_ for k in range(1,


11)]

plt.plot(range(1, 11), inertia, marker='o')

plt.xlabel("Number of Clusters")

plt.ylabel("Inertia")

plt.title("Elbow Method for Optimal k")

plt.show()

# Hierarchical Clustering Dendrogram

plt.figure(figsize=(10, 5))

dendrogram(linkage(df_scaled, method='ward'))

plt.title("Dendrogram for Hierarchical Clustering")

plt.show()

# Apply K-Means Clustering

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)

kmeans_labels = kmeans.fit_predict(df_scaled)

# Apply Hierarchical Clustering

hierarchical = AgglomerativeClustering(n_clusters=3)

hierarchical_labels = hierarchical.fit_predict(df_scaled)

# Apply DBSCAN Clustering

dbscan = DBSCAN(eps=1.5, min_samples=5)

dbscan_labels = dbscan.fit_predict(df_scaled)
# PCA for visualization

pca = PCA(n_components=2)

df_pca = pca.fit_transform(df_scaled)

from sklearn.cluster import SpectralClustering

# Apply Spectral Clustering

spectral = SpectralClustering(n_clusters=3, random_state=42, affinity='nearest_neighbors')

spectral_labels = spectral.fit_predict(df_scaled)

# Visualizing Spectral Clusters

plt.figure(figsize=(8, 6))

sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=spectral_labels, palette='coolwarm', edgecolor='k')

plt.xlabel("PCA Component 1")

plt.ylabel("PCA Component 2")

plt.title("Spectral Clusters Visualization using PCA")

plt.legend(title="Clusters")

plt.show()

# Visualizing K-Means Clusters

plt.figure(figsize=(8, 6))

sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=kmeans_labels, palette='viridis', edgecolor='k')

plt.xlabel("PCA Component 1")

plt.ylabel("PCA Component 2")

plt.title("KMeans Clusters Visualization using PCA")

plt.legend(title="Clusters")

plt.show()

# Visualizing Outliers

plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=outlier_labels, palette={1: 'blue', -1: 'red'},
edgecolor='k')

plt.xlabel("PCA Component 1")

plt.ylabel("PCA Component 2")

plt.title("Outlier Detection using Isolation Forest")

plt.legend(title="Outlier Status", labels=["Normal", "Outlier"])

plt.show()

# Visualizing DBSCAN Clusters

plt.figure(figsize=(8, 6))

sns.scatterplot(x=df_pca[:, 0], y=df_pca[:, 1], hue=dbscan_labels, palette='Set1', edgecolor='k')

plt.xlabel("PCA Component 1")

plt.ylabel("PCA Component 2")

plt.title("DBSCAN Clusters Visualization using PCA")

plt.legend(title="Clusters")

plt.show()

# Add cluster labels to original data

df['KMeans_Cluster'] = kmeans_labels

df['Hierarchical_Cluster'] = hierarchical_labels

df['DBSCAN_Cluster'] = dbscan_labels

ScreenShots:
Inference:

Outlier Detection: Approximately 10% of the data were identified as outliers, likely representing
extreme pollution levels or rare environmental conditions.

Optimal Clusters: The Elbow method and Dendrogram suggest 3 optimal clusters, indicating three
distinct air quality profiles.

K-Means & Hierarchical Clustering: Both methods identified low, moderate, and high pollution
groups, with K-Means providing clearer separation.

DBSCAN Clustering: Successfully captured irregular patterns and outliers, useful for detecting rare
pollution spikes.

Spectral Clustering : effectively identified complex, non-linear pollution patterns, revealing subtle
variations in air quality levels across different regions.

The dataset reveals three primary pollution categories, with outliers representing unusual air quality
events or anomalies.

RESULT:

Thus clustering model has been built .

You might also like