Experiment 10
Develop a program to implement k-means clustering using Wisconsin Breast Cancer
data set and visualize the clustering result.
Introduction to K-Means Clustering
What is Clustering?
Clustering is an unsupervised machine learning technique used to group data points into
clusters based on their similarity. The goal is to identify hidden patterns or natural
groupings in the data.
One of the most widely used clustering algorithms is K-Means Clustering, which divides the
dataset into K clusters, where each data point belongs to the nearest cluster center.
What is K-Means Clustering?
K-Means is a centroid-based clustering algorithm that partitions data into K clusters by
minimizing the variance within each cluster.
Working of K-Means Algorithm
1. Choose the number of clusters (K).
2. Randomly initialize K cluster centroids.
3. Assign each data point to the nearest centroid based on distance (e.g., Euclidean
distance).
4. Update the centroids by computing the mean of all points assigned to each cluster.
5. Repeat Steps 3 and 4 until convergence (when centroids no longer change
significantly).
Mathematical Representation
The objective is to minimize the sum of squared distances (SSD) between data points
and their assigned cluster centroid
where:
K = Number of clusters
xj = Data point
μi = Centroid of cluster Ci
Choosing the Optimal Number of Clusters (K)
Selecting the right value of K is crucial. Some common methods include:
1. Elbow Method:
Plots the within-cluster sum of squares (WCSS) for different K values.
The "elbow point" where WCSS stops decreasing significantly is chosen as the
optimal K.
2. Silhouette Score:
Measures how well-separated the clusters are.
A higher score indicates better clustering.
3. Gap Statistics:
Compares clustering performance to randomly generated reference data.
Distance Metrics in K-Means
K-Means typically uses Euclidean Distance to measure how close a data point is to a
centroid:
Other distance metrics include:
Manhattan Distance
Cosine Similarity
Mahalanobis Distance
Advantages of K-Means Clustering
Efficient and Scalable – Works well with large datasets.
Easy to Implement – Simple and interpretable.
Handles High-Dimensional Data – Can work on complex datasets
Challenges of K-Means Clustering
• Sensitive to Initial Centroid Selection – Different initializations may lead to different
results.
• Not Suitable for Non-Spherical Clusters – Assumes clusters are circular and evenly
sized.
• Outliers Affect Centroids – Presence of outliers can distort clustering results.
Visualization of Clusters
After applying K-Means Clustering, the results can be visualized using:
Scatter Plots: Plot the clusters with different colors.
Centroid Markers: Display cluster centers for better interpretation.
2D/3D PCA Visualization: Reduce dimensions for better visualization.
Applications of K-Means Clustering
Customer Segmentation – Grouping customers based on purchasing behavior.
Image Compression – Reducing image colors to dominant clusters.
Anomaly Detection – Identifying fraudulent transactions.
Medical Diagnosis – Classifying patients based on symptoms and medical data
Program
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report
# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning) # For edgecolor
warnings.filterwarnings("ignore", category=FutureWarning) # For n_init
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply KMeans clustering with explicit n_init
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X_scaled)
# Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
# PCA for 2D visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Create DataFrame for plotting
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y
# Plot: K-Means Clustering
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1', s=100,
edgecolor='black', alpha=0.7, marker='o')
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
# Plot: True Labels
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label', palette='coolwarm',
s=100, edgecolor='black', alpha=0.7, marker='o')
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()
# Plot: K-Means with Centroids
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1', s=100,
edgecolor='black', alpha=0.7, marker='o')
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()