0% found this document useful (0 votes)
13 views26 pages

Unit 4

The document provides an overview of unsupervised learning in machine learning, detailing its definition, types, algorithms, advantages, disadvantages, and real-time applications. It contrasts unsupervised learning with supervised learning, highlighting key differences in methodology and application. Additionally, it discusses clustering techniques, their types, and specific use cases in various fields such as customer segmentation, anomaly detection, and image analysis.

Uploaded by

achilles2006ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

Unit 4

The document provides an overview of unsupervised learning in machine learning, detailing its definition, types, algorithms, advantages, disadvantages, and real-time applications. It contrasts unsupervised learning with supervised learning, highlighting key differences in methodology and application. Additionally, it discusses clustering techniques, their types, and specific use cases in various fields such as customer segmentation, anomaly detection, and image analysis.

Uploaded by

achilles2006ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MACHINE LEARNING COURSE CODE-A8703 MODULE-04

SYLLABUS: Unsupervised Learning: Introduction, Unsupervised vs Supervised Learning,


Application of Unsupervised Learning, Clustering K-Means, K-Medoid, DBSCAN.

What is Unsupervised Learning


Unsupervised learning is a type of machine learning where the algorithm is trained on unlabelled
data. The goal is to infer the natural structure present within a set of data points. Unlike supervised
learning, there are no predefined labels or outcomes associated with the data. Unsupervised learning
can be used to discover patterns, groupings, and associations within the data.
Types of Unsupervised Learning
1. Clustering: Grouping data points into clusters based on their similarities.
2. Dimensionality Reduction: Reducing the number of features while preserving the
significant information.
3. Anomaly Detection: Identifying rare items or events which differ significantly from the
majority of the data.
4. Association: Discovering interesting relationships between variables in large databases.
List of Unsupervised Learning Algorithms
1. Clustering Algorithms:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Gaussian Mixture Models (GMM)
5. Mean Shift Clustering
2. Dimensionality Reduction Algorithms:
1. Principal Component Analysis (PCA)
2. t-Distributed Stochastic Neighbour Embedding (t-SNE)
3. Linear Discriminant Analysis (LDA)
4. Independent Component Analysis (ICA)
5. Auto encoders
3. Association Rule Learning Algorithms:
1. Apriori Algorithm
2. Eclat Algorithm
4. Anomaly Detection Algorithms:
Isolation Forest
1.
Dr M.Ramachandro 1|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
2. Local Outlier Factor (LOF)
3. One-Class SVM
Advantages
1. No Need for Labelled Data: Does not require labelled data, which can be expensive and time-
consuming to produce.
2. Discover Hidden Patterns: Can reveal hidden patterns and structures in the data that were
not previously apparent.
3. Data Exploration: Useful for exploratory data analysis to understand the underlying
structure of the data.
4. Flexibility: Can be applied to various types of data, including images, text, and numerical data.
Disadvantages
1. Interpretability: The results can be harder to interpret compared to supervised learning
outcomes.
2. Evaluation: No straightforward way to evaluate the quality of the results due to the absence
of ground truth labels.
3. Scalability: Some unsupervised learning algorithms may struggle with very large datasets.
4. Complexity: Algorithms can be complex and computationally intensive.
Real-time Applications
1. Customer Segmentation: Clustering customers based on purchasing behavior for targeted
marketing.
2. Anomaly Detection in Networks: Identifying unusual patterns that may indicate fraudulent
activity or network intrusions.
3. Market Basket Analysis: Finding associations between products to understand buying
patterns.
4. Document Clustering: Grouping similar documents for organizing large collections of text
data.
5. Image Compression: Using techniques like Principal Component Analysis (PCA) to reduce
the dimensionality of images, making them easier to store and transmit.
Limitations
1. No Output Guidance: Lack of labelled output makes it challenging to know what the results
should look like.
2. Arbitrary Decisions: The number of clusters in clustering or the dimensions to reduce to in
dimensionality reduction can be somewhat arbitrary and may require domain knowledge.

Dr M.Ramachandro 2|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
3. Sensitivity to Initial Conditions: Some algorithms, like k-means clustering, can be sensitive
to the initial placement of centroids.
4. Scalability Issues: Processing large datasets can be computationally expensive, especially for
algorithms like hierarchical clustering.
5. Quality of Results: The quality of results is highly dependent on the quality and nature of the
data. Outliers or noise in the data can significantly affect the outcome.

Dr M.Ramachandro 3|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Comparison between Unsupervised Vs Supervised Learning

S.No Unsupervised Learning Supervised Learning

1 Definition: Unsupervised learning algorithms are Definition: Supervised learning involves training a
designed to find hidden patterns or intrinsic model on a labeled dataset, which means that each
structures in input data without any labeled training example is paired with an output label. The
responses. These algorithms are particularly useful
goal is for the model to learn the mapping from
for tasks like clustering, dimensionality reduction,
and anomaly detection. inputs to outputs and make predictions on new,
unseen data.
2 List of Unsupervised Learning Algorithms List of Supervised Learning Algorithm:
1. Clustering Algorithms: 1. Linear Regression
1. K-Means Clustering 2. Logistic Regression
2. Hierarchical Clustering 3. Decision Trees
3. DBSCAN (Density-Based Spatial 4. Random Forests
Clustering of Applications with 5. Support Vector Machines (SVM)
Noise) 6. K-Nearest Neighbours (KNN)
4. Gaussian Mixture Models (GMM) 7. Naive Bayes
5. Mean Shift Clustering 8. Gradient Boosting Machines (GBM)
2. Dimensionality Reduction Algorithms: 9. Neural Networks
1. Principal Component Analysis
(PCA)
2. t-Distributed Stochastic Neighbor
Embedding (t-SNE)
3. Linear Discriminant Analysis
(LDA)
4. Independent Component Analysis
(ICA)
5. Auto encoders
3. Association Rule Learning Algorithms:
o Apriori Algorithm
o Eclat Algorithm
4. Anomaly Detection Algorithms:
o Isolation Forest
o Local Outlier Factor (LOF)
o One-Class SVM
3
Advantages Advantages:
1. No Need for Labelled Data: Unsupervised 1. Predictive Accuracy: Generally, provides
learning does not require labelled data, higher predictive accuracy when labelled
which can be costly and time-consuming to data is available.
obtain. 2. Interpretability: Many supervised learning
2. Discovery of Hidden Patterns: Capable of models offer interpretable results.
uncovering hidden structures and patterns 3. Evaluation: Clear and well-defined metrics
in data that may not be evident. (e.g., accuracy, precision, recall) to evaluate
3. Data Exploration: Useful for exploring data model performance.
to find natural groupings or to understand 4. Control: Ability to train models with
data distribution. specific desired outputs, making them
4. Flexibility: Can be applied to various types highly controllable for predictions.
of data, including images, text, and
numerical data.

Dr M.Ramachandro 4|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
4 Disadvantages Disadvantages:
1. Interpretability: The results can be 1. Labelled Data Requirement: Requires a
difficult to interpret compared to large amount of labelled data, which can be
supervised learning outcomes. costly and time-consuming to obtain.
2. Evaluation: No straightforward way to 2. Overfitting: Risk of overfitting to the
evaluate the quality of the results due to the training data, especially with complex
absence of ground truth labels. models and small datasets.
3. Scalability: Some unsupervised learning 3. Limited Generalization: May not perform
algorithms may struggle with very large well on data that is significantly different
datasets. from the training data.
4. Complexity: Algorithms can be complex
and computationally intensive.
5 Real-time Applications Real-time Applications:
1. Customer Segmentation: Clustering 1. Email Spam Detection: Using labelled
customers based on purchasing behaviour examples of spam and non-spam emails.
for targeted marketing campaigns. 2. Fraud Detection: Identifying fraudulent
2. Anomaly Detection in Networks: transactions based on historical labelled
Identifying unusual patterns that may data.
indicate fraudulent activity or network 3. Image Recognition: Classifying images into
intrusions. categories like identifying objects, faces, or
3. Market Basket Analysis: Finding handwritten digits.
associations between products to 4. Medical Diagnosis: Predicting diseases
understand buying patterns. based on labelled medical data (e.g., patient
4. Document Clustering: Grouping similar records, lab results).
documents for organizing large collections 5. Speech Recognition: Converting spoken
of text data. language into text using labelled audio data.
5. Image Compression: Using techniques like
PCA to reduce the dimensionality of images,
making them easier to store and transmit.
6. Recommender Systems: Building systems
that recommend products or content based
on user behaviour and preferences.
6 Limitations: Limitations:
1. No Output Guidance: Lack of labelled 1. Data Dependency: Performance heavily
output makes it challenging to know what depends on the quality and quantity of the
the results should look like. labelled data.
2. Arbitrary Decisions: The number of 2. Time and Resource Intensive: Collecting
clusters in clustering or the dimensions to and labelling large datasets can be
reduce to in dimensionality reduction can expensive and time-consuming.
be somewhat arbitrary and may require 3. Bias: If the training data is biased, the model
domain knowledge. may produce biased predictions.
3. Sensitivity to Initial Conditions: Some
algorithms, like k-means clustering, can be
sensitive to the initial placement of
centroids.
4. Scalability Issues: Processing large
datasets can be computationally expensive,
especially for algorithms like hierarchical
clustering.
5. Quality of Results: The quality of results is
highly dependent on the quality and nature

Dr M.Ramachandro 5|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
of the data. Outliers or noise in the data can
significantly affect the outcome.
6. Interpretation Challenges: The lack of
labels makes it difficult to interpret the
results, as there is no clear indication of
what the discovered patterns represent.

Dr M.Ramachandro 6|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Applications of Unsupervised Learning


1. Customer Segmentation
Use Case: E-commerce and retail companies use clustering techniques to segment customers based
on purchasing behavior, demographics, and other factors.
Benefit: Helps in targeting marketing strategies, personalized recommendations, and improving
customer satisfaction.
Example: Using K-Means clustering to identify different customer segments for tailored marketing
campaigns.
2. Market Basket Analysis
Use Case: Retailers analyse transaction data to discover products frequently bought together.
Benefit: Helps in designing promotions, cross-selling, and optimizing inventory.
Example: Using the Apriori algorithm to identify that customers who buy bread often buy butter and
jam.
3. Anomaly Detection in Fraud Detection
Use Case: Financial institutions use anomaly detection to identify fraudulent transactions.
Benefit: Early detection of fraudulent activities, reducing financial losses.
Example: Using Isolation Forest to flag unusual transaction patterns that may indicate fraud.
4. Document Clustering
Use Case: Organizing large collections of documents (e.g., news articles, research papers) into
meaningful clusters.
Benefit: Facilitates document retrieval, content recommendation, and topic discovery.
Example: Using Hierarchical Clustering to group news articles by topic.
5. Image Segmentation
Use Case: Dividing an image into segments to simplify its analysis.
Benefit: Enhances object detection, medical image analysis, and computer vision tasks.
Example: Using K-Means clustering to segment different parts of an MRI scan.
6. Genomics and Bioinformatics
Use Case: Grouping similar genetic expressions or identifying patterns in genetic data.
Benefit: Understanding genetic factors in diseases, drug discovery, and personalized medicine.
Example: Using PCA to reduce dimensionality of gene expression data and identify significant
patterns.
7. Recommendation Systems
Use Case: Providing personalized recommendations for products, movies, music, etc.
Benefit: Enhances user experience and increases engagement and sales.

Dr M.Ramachandro 7|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Example: Using collaborative filtering to recommend movies based on user viewing history and
clustering similar users together.
8. Dimensionality Reduction for Visualization
Use Case: Reducing the number of dimensions in high-dimensional data for visualization purposes.
Benefit: Helps in understanding the data, identifying patterns, and making it accessible to non-
experts.
Example: Using t-SNE to visualize high-dimensional data in 2D or 3D space.
9. Pattern Recognition in Speech and Audio
Use Case: Identifying patterns in audio data for tasks like speech recognition, music classification,
and sound event detection.
Benefit: Enhances user interfaces, aids in automated transcription, and improves audio content
analysis.
Example: Using spectral clustering to group similar audio segments.
10. Anomaly Detection in Network Security
Use Case: Detecting unusual network traffic patterns that may indicate security threats.
Benefit: Improves network security by identifying potential intrusions or attacks.
Example: Using One-Class SVM to detect anomalies in network traffic data.

Limitations of Unsupervised Learning


1. Lack of Interpretability: The results can be difficult to interpret without domain knowledge.
2. No Clear Evaluation Metric: Unlike supervised learning, there's no straightforward way to
evaluate the performance of an unsupervised learning model.
3. Scalability: Some algorithms may not scale well with very large datasets.
4. Quality of Results: Highly dependent on the quality and structure of the input data. Poor data
quality can lead to misleading patterns.
5. Complexity: Implementing and fine-tuning unsupervised learning models can be complex
and time-consuming.

Dr M.Ramachandro 8|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

What is Clustering?
Clustering is an unsupervised learning technique used to group similar data points into clusters.
Each cluster consists of data points that are more similar to each other than to those in other clusters.
This technique is used to identify patterns or structures in data when labels are not available.
Types of Clustering
1. Partitioning Clustering
o K-Means Clustering: Divides data into K clusters, where each data point belongs to
the cluster with the nearest mean.
o K-Medoids Clustering: Similar to K-Means, but uses medoids (actual data points)
instead of means to represent clusters.
2. Hierarchical Clustering
o Agglomerative Clustering: A "bottom-up" approach where each data point starts in
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
o Divisive Clustering: A "top-down" approach where all data points start in one cluster,
and splits are performed recursively.
3. Density-Based Clustering
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters
data points based on density, finding clusters of arbitrary shape and marking outliers
as noise.
o OPTICS (Ordering Points to Identify the Clustering Structure): An extension of
DBSCAN that handles varying densities.
4. Grid-Based Clustering
o STING (Statistical Information Grid): Divides the data space into a grid and performs
clustering operations on the grid cells.
o CLIQUE (Clustering in Quest): Combines grid-based and density-based approaches
for high-dimensional data.
5. Model-Based Clustering
o Gaussian Mixture Models (GMM): Assumes that the data is generated from a mixture
of several Gaussian distributions with unknown parameters.
Advantages
1. Discovery of Patterns: Identifies hidden patterns and structures in data without pre-labeled
classes.
2. Data Reduction: Simplifies data by grouping similar items, making it easier to analyze.
3. Versatility: Applicable to various types of data and industries.
Dr M.Ramachandro 9|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
4. Feature Engineering: Helps in creating new features for other machine learning algorithms.
Disadvantages
1. Choosing the Right Algorithm: Different clustering algorithms can yield different results,
and selecting the appropriate one can be challenging.
2. Number of Clusters: Many algorithms require the number of clusters to be specified
beforehand, which can be difficult to determine.
3. Scalability: Some clustering algorithms do not scale well with large datasets.
4. Sensitivity to Initial Conditions: Methods like K-Means are sensitive to the initial placement
of centroids.
Real-Time Applications
1. Customer Segmentation
o Retail and E-commerce: Grouping customers based on purchasing behavior for
targeted marketing.
o Example: Using K-Means clustering to identify customer segments for personalized
offers.
2. Document Clustering
o Information Retrieval: Grouping similar documents for search optimization.
o Example: Hierarchical clustering of news articles to categorize them by topic.
3. Image Segmentation
o Computer Vision: Dividing an image into segments to identify objects or regions.
o Example: Using DBSCAN to segment regions in satellite images for land cover
classification.
4. Anomaly Detection
o Network Security: Identifying unusual patterns that may indicate security threats.
o Example: Clustering network traffic data to detect intrusions.
5. Market Basket Analysis
o Retail: Identifying products that are frequently bought together.
o Example: Using hierarchical clustering to group products based on co-purchase
patterns.
6. Bioinformatics
o Genomics: Grouping genes or proteins with similar expression patterns.
o Example: Using hierarchical clustering to analyze gene expression data.
7. Social Network Analysis
o Community Detection: Identifying groups of related individuals in social networks.
Dr M.Ramachandro 10 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
o Example: Using density-based clustering to find communities in social media data.
Limitations
1. Scalability Issues: Some clustering algorithms are computationally intensive and may not
perform well on large datasets.
2. Handling of Different Data Types: Many clustering algorithms assume data points are in a
continuous space and may struggle with categorical or mixed data types.
3. Interpretability: The results of clustering can sometimes be difficult to interpret, especially
with high-dimensional data.
4. Predefined Number of Clusters: Algorithms like K-Means require the number of clusters to
be specified beforehand, which can be impractical without prior knowledge.
5. Sensitivity to Noise and Outliers: Algorithms like K-Means are sensitive to outliers, which
can skew the results.

Dr M.Ramachandro 11 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

K-Means Clustering Algorithm:


K-Means is a popular partitioning clustering algorithm used to divide a dataset into K distinct, non-
overlapping subsets (clusters). The algorithm aims to minimize the variance within each cluster,
measured by the sum of squared distances between data points and their respective cluster
centroids.
How K-Means Works
1. Initialization: Randomly select K initial centroids from the data points.
2. Assignment: Assign each data point to the nearest centroid based on the Euclidean distance.
3. Update: Recalculate the centroids as the mean of all data points assigned to each cluster.
4. Repeat: Repeat the assignment and update steps until the centroids no longer change or the
maximum number of iterations is reached.
Types of K-Means
1. Standard K-Means: The basic form of the algorithm described above.
2. K-Means++: An improved initialization technique that spreads out the initial centroids,
reducing the chances of poor clustering results.
3. Mini-Batch K-Means: A variation that processes small random batches of the dataset, which
improves efficiency and scalability for large datasets.
Advantages
1. Simplicity: Easy to understand and implement.
2. Efficiency: Computationally efficient, especially with K-Means++ for initialization.
3. Scalability: Performs well on large datasets with relatively low computational cost.
4. Speed: Converges quickly compared to other clustering algorithms.
Disadvantages
1. Choosing K: Requires the number of clusters (K) to be specified in advance, which can be
challenging without prior knowledge.
2. Sensitive to Initialization: Poor initialization can lead to suboptimal clustering results,
though K-Means++ mitigates this issue.
3. Assumes Spherical Clusters: Assumes clusters are spherical and evenly sized, which may
not hold true for all datasets.
4. Sensitive to Outliers: Outliers can significantly affect the positions of the centroids.
5. Global Optimum: Only guarantees a local optimum, which may not be the global optimum.

Dr M.Ramachandro 12 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Real-Time Applications
1. Customer Segmentation
o Retail and E-commerce: Grouping customers based on purchasing behavior for
personalized marketing strategies.
o Example: Identifying customer segments for targeted promotions.
2. Image Compression
o Computer Vision: Reducing the number of colors in an image by clustering similar
colors.
o Example: Compressing images by replacing similar colors with cluster centroids.
3. Document Clustering
o Information Retrieval: Grouping similar documents to improve search and
organization.
o Example: Organizing news articles into topics.
4. Market Segmentation
o Marketing: Identifying distinct market segments for tailored advertising campaigns.
o Example: Clustering survey responses to identify different consumer segments.
5. Anomaly Detection
o Network Security: Detecting unusual patterns in network traffic.
o Example: Identifying anomalies in network traffic data to detect potential security
threats.
6. Genomics
o Bioinformatics: Grouping genes with similar expression patterns.
o Example: Clustering gene expression data to identify functional gene groups.
Limitations
1. Fixed Number of Clusters: The need to specify the number of clusters (K) can be impractical
without prior knowledge.
2. Scalability with Large K: Performance degrades with a very large number of clusters.
3. Sensitive to Scale: Results can be influenced by the scale of the data, requiring normalization
or standardization.
4. Non-Convex Shapes: Struggles with clusters that are not spherical or have varying sizes and
densities.
5. Handling Categorical Data: Designed for continuous numerical data, making it less effective
for categorical or mixed data types without pre-processing.

Dr M.Ramachandro 13 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

EXAMPLE OF K-MEANS CLUSTERING


1. Use K-Means Clustering to cluster the following data into two groups
2. Data points: {2,4,10,12,3,20,30,11,25}
3. The distance function used in Euclidean Distance
a. Distance Formula

4. Initial cluster Centroid are M1=4 and M2=11


5. Take an example of Dataset
6. Where the Values are x2=4, 11

With Respect to M1=4 With Respect to M2=11


X2 X1 X2 X1
4 2 11 2
4 4 11 4
4 10 11 10
4 12 11 12
4 3 11 3
4 20 11 20
4 30 11 30
4 11 11 11
4 25 11 25
Figure 1: The above table M1, M2 values are given, similarly we have data points,
we will consider the x1 values.

Dr M.Ramachandro 14 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Data Points Distance Cluster

M1 M2

2 2 9 C1

4 0 7 C1

10 6 1 C2

12 8 1 C2

3 1 8 C1

20 16 9 C2

30 26 19 C2

11 7 0 C2

25 21 14 C2

Table1: The above table Green Color indicates Cluster 1, and the Yellow color
indicates Cluster 2
1. Therefore: C1= {2,4,3} = 2+4+3 = 9/3 = 3
C2= {10, 12, 20,30,11,25} = 10+12+20+30+11+25 = 108/6 =18
2. New Centroids are: M1= 3, M2 = 18

Data Points Distance Cluster New Cluster

M1 M2

2 1 16 C1 C1

4 1 14 C1 C1

10 7 8 C2 C1

12 9 6 C2 C2

3 0 15 C1 C1

20 17 2 C2 C2

30 27 12 C2 C2
11 8 7 C2 C1

25 22 7 C2 C2

Therefore: from the table the data points are 10 and 11 are not in same cluster, we need to do
the same process until get the same place in both the clusters (old Cluster and New Cluster)
Dr M.Ramachandro 15 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

1. Therefore: C1= {2,4,10,3} = 2+4+10+3 = 19/4 = 4.75


C2= {12, 20, 30, 11, 25} = 12+20+30+11+25 = 98/5 =19.6
2. New Centroids are: M1= 4.75, M2 = 19.6

Data Points Distance Cluster New Cluster

M1=C1 M2=C2

2 2.75 17.6 C1 C1

4 0.75 15.6 C1 C1

10 5.25 9.6 C1 C1

12 7.25 7.6 C2 C1

3 1.75 16.6 C1 C1

20 15.25 0.4 C2 C2

30 25.25 10.4 C2 C2

11 6.25 8.6 C2 C1

25 20.25 5.4 C2 C2

CURRENT CENTROIDS ARE


M1= 4.75, M2=19.6
1. Therefore: C1= {2,4,10,11,12,3} = 2+4+10+11+12+3 = 42/6 =7
C2= {20, 30, 25} = 20+30+25 = 75/3 =25
2. New Centroids are: M1= 8.66, M2 = 25

Dr M.Ramachandro 16 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

NEW TABLE
Data Points Distance Cluster New Cluster

M1=C1 M2=C2

2 5 23 C1 C1
4 3 21 C1 C1

10 3 15 C1 C1

12 5 13 C1 C1

3 4 22 C1 C1

20 13 5 C2 C2

30 23 5 C2 C2

11 4 14 C1 C1
25 18 0 C2 C2

Final Clusters are: C1 = {2,4,10,12,3,11}, C2 = {20,30,25}

Dr M.Ramachandro 17 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

K-MEDOID CLUSTERING ALGORITHM:


K-Medoid clustering is a partitioning method which clusters the data points into k clusters. Unlike k-
means clustering which uses the mean of the data points within a cluster as the center (centroid), k-
medoid uses actual data points to represent the center (medoid). A medoid is defined as the object
of a cluster, whose average dissimilarity to all the objects in the cluster is minimal.
Advantages of K-Medoid Clustering
1. Robustness to Noise and Outliers: Since k-medoids use actual data points as cluster centers,
it is less influenced by outliers and noise compared to k-means.
2. Interpretability: The medoids are actual data points, making the resulting clusters more
interpretable.
3. Flexibility in Dissimilarity Measures: K-medoids can use arbitrary distance or dissimilarity
measures, making it more flexible than k-means which typically uses Euclidean distance.
Disadvantages of K-Medoid Clustering
1. Computational Complexity: K-medoids is computationally more expensive than k-means,
especially for large datasets. The algorithm requires pairwise distance computations which
can be prohibitive.
2. Scalability: Due to its higher computational complexity, k-medoids may not scale well to very
large datasets.
3. Initialization Sensitivity: Similar to k-means, the initial selection of medoids can affect the
final clustering result.
Limitations of K-Medoid Clustering
1. Performance on Large Datasets: K-medoids can be slow and inefficient on large datasets
because it has a higher computational cost due to the pairwise distance calculations.
2. Fixed Number of Clusters: The algorithm requires the number of clusters (k) to be specified
in advance, which might not always be known.
3. Convergence to Local Optima: The algorithm can get stuck in local optima, resulting in
suboptimal cluster configurations.
Real-Time Applications of K-Medoid Clustering
1. Market Segmentation: Used to identify distinct groups of customers based on purchasing
behavior and demographics.
2. Bioinformatics: Clustering genes or proteins with similar expression patterns or
functionalities.

Dr M.Ramachandro 18 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
3. Document Clustering: Grouping similar documents together for information retrieval or
organization purposes.
4. Image Segmentation: Identifying and clustering similar regions within an image for object
recognition or image processing tasks.
5. Healthcare: Clustering patients with similar medical histories or symptoms for personalized
treatment plans.
Examples K-Medoids Clustering Algorithm
1. Take an example of dataset
2. Apply K-Medoid Clustering algorithm to form two clusters
3. Use Manhattan Distance to find the between data point and medoid
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
STEPS:
STEP1:
1. Select Two Mediods Randomly
2. C1= (3, 4), C2= (7,4)
3. (x1, y1) and (x2, y2) are data points
4. Manhattan Distance formula = |x1-x2| + |y1-y2|

Dr M.Ramachandro 19 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

a. Where the values are x1=2, x2=3, y1=6, y2=4 similarly we need to
find all the values, need to substitute the values in formula.
5. Manhattan Distance [ (2,6), (3,4)] = |2-3| + |6-4| = 3

a. Manhattan Distance [ (2,6), (3,4)] = |2-3| + |6-4| = 3

b. Manhattan Distance [ (3,4), (3,4)] = |3-3| + |4-4| = 0


c. Similarly, we need to find up to X10 Values
I X Y C1 C2 CLUSTER

X1 2 6 3 7 C1
X2 3 4 0 4 C1

X3 3 8 4 8 C1
X4 4 7 4 6 C1

X5 6 2 5 3 C2
X6 6 4 3 1 C2

X7 7 3 5 1 C2
X8 7 4 4 0 C2

X9 8 5 6 2 C2
X10 7 6 6 2 C2

TABLE 2: Calculating the values with respect to Cluster 1 values (3,4), Cluster 2
values (7,4), Finding the cluster group based on C1 & C2 taken minimum values.
6. Clusters are
a. C1 = {(2,6, (3,4), (3,8), (4,7)}
b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}

c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|


d. Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4),

(3,8)) + {(Cost (3,4), (4,7)) + (Cost (7,4), (6,2)) + (Cost (7,4), (6,4)) +
(Cost (7,4), (7,3)) +(Cost (7,4), (7,4)) + (Cost (7,4), (8,5)) +(Cost (7,4),
(7,6))}
Dr M.Ramachandro 20 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
7. Find the values using Manhattan Distance formula = |x1-x2| + |y1-y2|
a. Where x1=3, x2=2, y1=4, y2=6
b. Need to find for all the values

c. Total cost = 3+4+4+2+3+1+1+2 = 20


STEP 2:
1. Select Two Mediods Randomly
2. New Medoids are - C1= (3, 4), P2= (7,3)
a. Here the new medoids are C1 & P2
b. The C1 Values are Same, P2 values only changed
3. Previous Medoids are - C1= (3, 4), C2= (7,4)
4. (x1, y1) and (x2, y2) are data points
5. Manhattan Distance formula = |x1-x2| + |y1-y2|
a. Where the values are x1=2, y1=6, x2=7, y2=3 similarly we need to
find all the values and need to substitute the values in formula.
6. Manhattan Distance [ (2,6), (7,3)] = |2-7| + |6-3| = 8

a. Manhattan Distance [ (3,4), (7,3)] = |3-7| + |4-3| = 5

b. Manhattan Distance [ (3,8), (7,3)] = |3-7| + |8-3| = 9


c. Similarly, we need to find up to X10 Values

7. New Clusters are


a. C1 = {(2,6, (3,4), (3,8), (4,7)}
b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}

c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|

d. Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4),

(3,8)) + {(Cost (3,4), (4,7)) + (Cost (7,3), (6,2)) + (Cost (7,3), (6,4)) +
(Cost (7,3), (7,3)) +(Cost (7,3), (7,4)) + (Cost (7,3), (8,5)) +(Cost (7,3),
(7,6))}
8. Find the values using Manhattan Distance formula = |x1-x2| + |y1-y2|

Dr M.Ramachandro 21 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

a. Where x1=3, y1=4, x2=2, y2=6

b. Need to find for all the values


c. Current Total Cost = 3+4+4+2+2+1+3+3= 22
I X Y C1 P CLUSTER FIND

X1 2 6 3 8 C1
X2 3 4 0 5 C1

X3 3 8 4 9 C1
X4 4 7 4 7 C1

X5 6 2 5 2 P
X6 6 4 3 2 P

X7 7 3 5 0 P
X8 7 4 4 1 P

X9 8 5 6 3 P
X10 7 6 6 3 P

STEP 3:
1. Cost of Swapping of Medoid C2 with P
2. S= Current Total Cost – Previous Total Cost
S= 22-20 = 2>0
Note: In K-Mediod clustering algorithm, we find the Current Total Cost =22 and
Previous Total Cost = 20. Here the problem is the current total cost greater than
the previous total cost, So finally replacing C2 with P is not good idea, so we will
go for final Mediods are C1= (3,4), and C2= (7,4).

Dr M.Ramachandro 22 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

DBSCAN Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering
algorithm designed to identify clusters of varying shapes and sizes within a dataset. It relies on
density criteria to form clusters, distinguishing itself by effectively handling noise (outliers).
How It Works
1. Parameters:
o Epsilon (ε): The maximum distance between two points for them to be considered
neighbors.
o Min Pts: The minimum number of points required to form a dense region (a cluster).
2. Core Points, Border Points, and Noise:
o Core Point: A point is a core point if it has at least MinPts points within its ε-
neighborhood.
o Border Point: A point is a border point if it has fewer than MinPts points within its ε-
neighborhood but is within the ε-neighborhood of a core point.
o Noise: Points that are neither core points nor border points are classified as noise.
3. Algorithm Steps:
1. Randomly select a point.
2. If the point is a core point, create a new cluster. Expand the cluster by adding all points
that are density-reachable from the core point (directly or indirectly connected via
other core points).
3. If the point is a border point and not part of any cluster, it is assigned to the nearest
cluster.
4. Continue until all points are processed.
Advantages
1. Ability to Find Arbitrarily Shaped Clusters: Unlike k-means, DBSCAN can find clusters of
arbitrary shapes and sizes, making it suitable for more complex datasets.
2. No Need to Specify Number of Clusters: Unlike k-means, DBSCAN does not require the
number of clusters to be specified beforehand.
3. Robust to Noise: Effectively identifies outliers as noise points, improving the quality of
clustering results.
4. Minimal Parameters: Only requires two parameters (ε and MinPts), simplifying the model
tuning process.

Dr M.Ramachandro 23 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Disadvantages
1. Parameter Sensitivity: The effectiveness of DBSCAN heavily depends on the choice of ε and
MinPts, which can be difficult to determine.
2. Varying Density: Struggles with datasets containing clusters of varying densities, as the same
ε and MinPts may not be appropriate for all clusters.
3. High-Dimensional Data: Performance degrades with high-dimensional data due to the curse
of dimensionality, making distance measures less meaningful.
4. Scalability: Computationally intensive for large datasets, particularly in higher dimensions,
due to the need to compute distances between all points.
Real-Time Applications
1. Geospatial Data Analysis:
o Application: Identifying geographical clusters, such as hotspots of disease outbreaks
or crime areas.
o Example: Analyzing spatial distribution of earthquakes to detect regions with high
seismic activity.
2. Market Segmentation:
o Application: Clustering customers based on purchasing behavior to identify distinct
market segments.
o Example: Segmenting customers of an e-commerce platform based on their browsing
and purchasing patterns.
3. Anomaly Detection:
o Application: Detecting anomalies in data, such as fraud detection in financial
transactions or identifying network intrusions.
o Example: Identifying unusual patterns in network traffic that may indicate security
breaches.
4. Image Processing:
o Application: Clustering similar images or regions within images for tasks like image
segmentation.
o Example: Grouping pixels in satellite images to identify different land cover types.
Limitations
1. Difficulty in Parameter Selection:
o Choosing appropriate values for ε and MinPts can be challenging and may require
domain knowledge or experimentation.
2. Inefficiency with Varying Density:
Dr M.Ramachandro 24 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
o Struggles with datasets containing clusters of varying densities, as the same ε and
MinPts values may not work well for all clusters.
3. Computational Complexity:
o DBSCAN has a time complexity of O(n log n)with spatial indexing structures, but
without such optimizations, it can be O(n2) making it less efficient for very large
datasets.
4. High-Dimensional Data:
o The performance of DBSCAN degrades with high-dimensional data because distance
metrics become less meaningful in high-dimensional spaces, leading to poor clustering
results.

Dr M.Ramachandro 25 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04

Important Questions
2 Marks Questions
1. Mention any two real-world application of unsupervised learning.
2. List any two unsupervised learning algorithms in machine learning
3. What is the main difference between K-Means and K-Medoid clustering?
4. Differentiate between supervised and unsupervised learning.
5. What is DBSCAN stand for?
6. What is the primary goal of machine learning?
7. What is conditional probability?
8. How are the initial centroids selected in K-Means clustering?
9. What are the advantages of using K-Means clustering?
10. What is the purpose of using clustering in data analysis?

5 Marks Questions
1. How machine learning works and its importance in modern technology, mention any five real time
applications
2. Compare and contrast supervised learning and unsupervised learning in terms of data requirements,
algorithm complexity, and typical applications.
3. Why K-Means clustering algorithm is a widely used algorithm for grouping data points into clusters.
Describe the steps involved in performing K-Means clustering on a dataset.
4. How the K-Medoid clustering algorithm works take an example of data set with working procedure
5. Describe the DBSCAN clustering algorithm. Explain the concepts of core points, border points, and
noise points.
6. Discuss different methods for evaluating the quality of clustering results. Provide examples of metrics
used.
7. Define clustering and explain its significance in data analysis. Discuss its use in various domains.

Dr M.Ramachandro 26 | P a g e

You might also like