Name: Vidya Janani V
Register Number: 913121205090
EX.NO: 4 CLUSTERING THE GIVEN DATA USING PYTHON/R
Date: 12.03.2024
AIM:
To perform clustering of the given data using K-Means in Python and R
STEPS:
1. Data Preparation: Load and pre-process the data. Ensure it's in a suitable format for
clustering
2. Library Imports: Import necessary Python libraries, such as sklearn for K-Means
and matplotlib for visualization.
3. K-Means Clustering: Initialize and fit a K-Means model, specifying the number of
clusters (K)
4. Visualization: Visualize the clusters to identify patterns and structures within the data
PYTHON:
ELBOW METHOD:
The Elbow Method to find the optimal number of clusters (K) for K-Means clustering. It
loods a dataset, selects specific features, and calculates the Within-Cluster Variance (WSS)
for Kvalues ranging from 1 to 10. The resulting WSS values are plotted to visualize the
"elbow" point where the rate of decrease in WSS slows down, indicating the optimal K. This
helps in determining the most suitable number of clusters for the given dataset.
K-MEANS
The Python code performs K-Means clustering with a specified number of clusters (K) on a
dataset with two selected features. It adds cluster assignments to the original dataset and
visualizes the data points with different colors for each cluster. Additionally, it plots the
cluster centroids. The "k" variable should be replaced with the chosen number of clusters, and
the code provides a visual representation of the clustering results.
SCATTER PLOT
1. A scatter plot will be displayed, where data points are colored differently based on their
assigned clusters, showing the clusters formed by K-Means.
2. The cluster centroids will be marked as red "x" symbols on the plot.
3. The title of the plot will indicate the number of clusters used for K-Means clustering
(specified by the 'k' variable). 4. A legend will be displayed in the upper right corner of the
plot, indicating the labels for datapoints and centroids.
Thus the k means clustering was performed for the global air pollution dataset
21PCS02 – Exploratory Data Analysis Laboratory
Name: Vidya Janani V
Register Number: 913121205090
Python Code:
1.Elbow method
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the dataset
data = pd.read_csv("job_placement.csv")
# Display the first few rows of the dataset
print(data.head())
# Preprocessing the data
# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])
# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)
# Applying PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
# Elbow Method to find the optimal number of clusters
inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(pca_data)
inertia.append(kmeans.inertia_)
# Plotting the Elbow Method
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
21PCS02 – Exploratory Data Analysis Laboratory
Name: Vidya Janani V
Register Number: 913121205090
Output
2. K-Means Clustering
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the dataset
data = pd.read_csv("job_placement.csv")
# Display the first few rows of the dataset
print(data.head())
# Preprocessing the data
# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])
# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)
# Applying PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
21PCS02 – Exploratory Data Analysis Laboratory
Name: Vidya Janani V
Register Number: 913121205090
# Applying K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(pca_data)
# Visualizing the clusters
plt.figure(figsize=(8, 6))
plt.scatter(pca_data[:, 0], pca_data[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()
Output
21PCS02 – Exploratory Data Analysis Laboratory
Name: Vidya Janani V
Register Number: 913121205090
3. Scatter plot
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the dataset
data = pd.read_csv("job_placement.csv")
# Display the first few rows of the dataset
print(data.head())
# Preprocessing the data
# Dropping non-numeric columns if any and handling missing values
data = data.dropna()
numeric_data = data.select_dtypes(include=['float64', 'int64'])
# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)
# Applying PCA for dimensionality reduction
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
# Applying K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(pca_data)
# Visualizing the clusters
plt.figure(figsize=(10, 6))
# Plotting points with cluster centers
plt.scatter(pca_data[:, 0], pca_data[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red',
marker='X', label='Centroids')
plt.title('K-means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
plt.show()
21PCS02 – Exploratory Data Analysis Laboratory
Name: Vidya Janani V
Register Number: 913121205090
Output
Result:
In this experiment , Clustering the given data using Python /R was implemented and the
output was verified successfully.
21PCS02 – Exploratory Data Analysis Laboratory