Unit 4
Unit 4
Dr M.Ramachandro 2|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
3. Sensitivity to Initial Conditions: Some algorithms, like k-means clustering, can be sensitive
to the initial placement of centroids.
4. Scalability Issues: Processing large datasets can be computationally expensive, especially for
algorithms like hierarchical clustering.
5. Quality of Results: The quality of results is highly dependent on the quality and nature of the
data. Outliers or noise in the data can significantly affect the outcome.
Dr M.Ramachandro 3|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
1 Definition: Unsupervised learning algorithms are Definition: Supervised learning involves training a
designed to find hidden patterns or intrinsic model on a labeled dataset, which means that each
structures in input data without any labeled training example is paired with an output label. The
responses. These algorithms are particularly useful
goal is for the model to learn the mapping from
for tasks like clustering, dimensionality reduction,
and anomaly detection. inputs to outputs and make predictions on new,
unseen data.
2 List of Unsupervised Learning Algorithms List of Supervised Learning Algorithm:
1. Clustering Algorithms: 1. Linear Regression
1. K-Means Clustering 2. Logistic Regression
2. Hierarchical Clustering 3. Decision Trees
3. DBSCAN (Density-Based Spatial 4. Random Forests
Clustering of Applications with 5. Support Vector Machines (SVM)
Noise) 6. K-Nearest Neighbours (KNN)
4. Gaussian Mixture Models (GMM) 7. Naive Bayes
5. Mean Shift Clustering 8. Gradient Boosting Machines (GBM)
2. Dimensionality Reduction Algorithms: 9. Neural Networks
1. Principal Component Analysis
(PCA)
2. t-Distributed Stochastic Neighbor
Embedding (t-SNE)
3. Linear Discriminant Analysis
(LDA)
4. Independent Component Analysis
(ICA)
5. Auto encoders
3. Association Rule Learning Algorithms:
o Apriori Algorithm
o Eclat Algorithm
4. Anomaly Detection Algorithms:
o Isolation Forest
o Local Outlier Factor (LOF)
o One-Class SVM
3
Advantages Advantages:
1. No Need for Labelled Data: Unsupervised 1. Predictive Accuracy: Generally, provides
learning does not require labelled data, higher predictive accuracy when labelled
which can be costly and time-consuming to data is available.
obtain. 2. Interpretability: Many supervised learning
2. Discovery of Hidden Patterns: Capable of models offer interpretable results.
uncovering hidden structures and patterns 3. Evaluation: Clear and well-defined metrics
in data that may not be evident. (e.g., accuracy, precision, recall) to evaluate
3. Data Exploration: Useful for exploring data model performance.
to find natural groupings or to understand 4. Control: Ability to train models with
data distribution. specific desired outputs, making them
4. Flexibility: Can be applied to various types highly controllable for predictions.
of data, including images, text, and
numerical data.
Dr M.Ramachandro 4|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
4 Disadvantages Disadvantages:
1. Interpretability: The results can be 1. Labelled Data Requirement: Requires a
difficult to interpret compared to large amount of labelled data, which can be
supervised learning outcomes. costly and time-consuming to obtain.
2. Evaluation: No straightforward way to 2. Overfitting: Risk of overfitting to the
evaluate the quality of the results due to the training data, especially with complex
absence of ground truth labels. models and small datasets.
3. Scalability: Some unsupervised learning 3. Limited Generalization: May not perform
algorithms may struggle with very large well on data that is significantly different
datasets. from the training data.
4. Complexity: Algorithms can be complex
and computationally intensive.
5 Real-time Applications Real-time Applications:
1. Customer Segmentation: Clustering 1. Email Spam Detection: Using labelled
customers based on purchasing behaviour examples of spam and non-spam emails.
for targeted marketing campaigns. 2. Fraud Detection: Identifying fraudulent
2. Anomaly Detection in Networks: transactions based on historical labelled
Identifying unusual patterns that may data.
indicate fraudulent activity or network 3. Image Recognition: Classifying images into
intrusions. categories like identifying objects, faces, or
3. Market Basket Analysis: Finding handwritten digits.
associations between products to 4. Medical Diagnosis: Predicting diseases
understand buying patterns. based on labelled medical data (e.g., patient
4. Document Clustering: Grouping similar records, lab results).
documents for organizing large collections 5. Speech Recognition: Converting spoken
of text data. language into text using labelled audio data.
5. Image Compression: Using techniques like
PCA to reduce the dimensionality of images,
making them easier to store and transmit.
6. Recommender Systems: Building systems
that recommend products or content based
on user behaviour and preferences.
6 Limitations: Limitations:
1. No Output Guidance: Lack of labelled 1. Data Dependency: Performance heavily
output makes it challenging to know what depends on the quality and quantity of the
the results should look like. labelled data.
2. Arbitrary Decisions: The number of 2. Time and Resource Intensive: Collecting
clusters in clustering or the dimensions to and labelling large datasets can be
reduce to in dimensionality reduction can expensive and time-consuming.
be somewhat arbitrary and may require 3. Bias: If the training data is biased, the model
domain knowledge. may produce biased predictions.
3. Sensitivity to Initial Conditions: Some
algorithms, like k-means clustering, can be
sensitive to the initial placement of
centroids.
4. Scalability Issues: Processing large
datasets can be computationally expensive,
especially for algorithms like hierarchical
clustering.
5. Quality of Results: The quality of results is
highly dependent on the quality and nature
Dr M.Ramachandro 5|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
of the data. Outliers or noise in the data can
significantly affect the outcome.
6. Interpretation Challenges: The lack of
labels makes it difficult to interpret the
results, as there is no clear indication of
what the discovered patterns represent.
Dr M.Ramachandro 6|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Dr M.Ramachandro 7|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Example: Using collaborative filtering to recommend movies based on user viewing history and
clustering similar users together.
8. Dimensionality Reduction for Visualization
Use Case: Reducing the number of dimensions in high-dimensional data for visualization purposes.
Benefit: Helps in understanding the data, identifying patterns, and making it accessible to non-
experts.
Example: Using t-SNE to visualize high-dimensional data in 2D or 3D space.
9. Pattern Recognition in Speech and Audio
Use Case: Identifying patterns in audio data for tasks like speech recognition, music classification,
and sound event detection.
Benefit: Enhances user interfaces, aids in automated transcription, and improves audio content
analysis.
Example: Using spectral clustering to group similar audio segments.
10. Anomaly Detection in Network Security
Use Case: Detecting unusual network traffic patterns that may indicate security threats.
Benefit: Improves network security by identifying potential intrusions or attacks.
Example: Using One-Class SVM to detect anomalies in network traffic data.
Dr M.Ramachandro 8|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
What is Clustering?
Clustering is an unsupervised learning technique used to group similar data points into clusters.
Each cluster consists of data points that are more similar to each other than to those in other clusters.
This technique is used to identify patterns or structures in data when labels are not available.
Types of Clustering
1. Partitioning Clustering
o K-Means Clustering: Divides data into K clusters, where each data point belongs to
the cluster with the nearest mean.
o K-Medoids Clustering: Similar to K-Means, but uses medoids (actual data points)
instead of means to represent clusters.
2. Hierarchical Clustering
o Agglomerative Clustering: A "bottom-up" approach where each data point starts in
its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
o Divisive Clustering: A "top-down" approach where all data points start in one cluster,
and splits are performed recursively.
3. Density-Based Clustering
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters
data points based on density, finding clusters of arbitrary shape and marking outliers
as noise.
o OPTICS (Ordering Points to Identify the Clustering Structure): An extension of
DBSCAN that handles varying densities.
4. Grid-Based Clustering
o STING (Statistical Information Grid): Divides the data space into a grid and performs
clustering operations on the grid cells.
o CLIQUE (Clustering in Quest): Combines grid-based and density-based approaches
for high-dimensional data.
5. Model-Based Clustering
o Gaussian Mixture Models (GMM): Assumes that the data is generated from a mixture
of several Gaussian distributions with unknown parameters.
Advantages
1. Discovery of Patterns: Identifies hidden patterns and structures in data without pre-labeled
classes.
2. Data Reduction: Simplifies data by grouping similar items, making it easier to analyze.
3. Versatility: Applicable to various types of data and industries.
Dr M.Ramachandro 9|Page
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
4. Feature Engineering: Helps in creating new features for other machine learning algorithms.
Disadvantages
1. Choosing the Right Algorithm: Different clustering algorithms can yield different results,
and selecting the appropriate one can be challenging.
2. Number of Clusters: Many algorithms require the number of clusters to be specified
beforehand, which can be difficult to determine.
3. Scalability: Some clustering algorithms do not scale well with large datasets.
4. Sensitivity to Initial Conditions: Methods like K-Means are sensitive to the initial placement
of centroids.
Real-Time Applications
1. Customer Segmentation
o Retail and E-commerce: Grouping customers based on purchasing behavior for
targeted marketing.
o Example: Using K-Means clustering to identify customer segments for personalized
offers.
2. Document Clustering
o Information Retrieval: Grouping similar documents for search optimization.
o Example: Hierarchical clustering of news articles to categorize them by topic.
3. Image Segmentation
o Computer Vision: Dividing an image into segments to identify objects or regions.
o Example: Using DBSCAN to segment regions in satellite images for land cover
classification.
4. Anomaly Detection
o Network Security: Identifying unusual patterns that may indicate security threats.
o Example: Clustering network traffic data to detect intrusions.
5. Market Basket Analysis
o Retail: Identifying products that are frequently bought together.
o Example: Using hierarchical clustering to group products based on co-purchase
patterns.
6. Bioinformatics
o Genomics: Grouping genes or proteins with similar expression patterns.
o Example: Using hierarchical clustering to analyze gene expression data.
7. Social Network Analysis
o Community Detection: Identifying groups of related individuals in social networks.
Dr M.Ramachandro 10 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
o Example: Using density-based clustering to find communities in social media data.
Limitations
1. Scalability Issues: Some clustering algorithms are computationally intensive and may not
perform well on large datasets.
2. Handling of Different Data Types: Many clustering algorithms assume data points are in a
continuous space and may struggle with categorical or mixed data types.
3. Interpretability: The results of clustering can sometimes be difficult to interpret, especially
with high-dimensional data.
4. Predefined Number of Clusters: Algorithms like K-Means require the number of clusters to
be specified beforehand, which can be impractical without prior knowledge.
5. Sensitivity to Noise and Outliers: Algorithms like K-Means are sensitive to outliers, which
can skew the results.
Dr M.Ramachandro 11 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Dr M.Ramachandro 12 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Real-Time Applications
1. Customer Segmentation
o Retail and E-commerce: Grouping customers based on purchasing behavior for
personalized marketing strategies.
o Example: Identifying customer segments for targeted promotions.
2. Image Compression
o Computer Vision: Reducing the number of colors in an image by clustering similar
colors.
o Example: Compressing images by replacing similar colors with cluster centroids.
3. Document Clustering
o Information Retrieval: Grouping similar documents to improve search and
organization.
o Example: Organizing news articles into topics.
4. Market Segmentation
o Marketing: Identifying distinct market segments for tailored advertising campaigns.
o Example: Clustering survey responses to identify different consumer segments.
5. Anomaly Detection
o Network Security: Detecting unusual patterns in network traffic.
o Example: Identifying anomalies in network traffic data to detect potential security
threats.
6. Genomics
o Bioinformatics: Grouping genes with similar expression patterns.
o Example: Clustering gene expression data to identify functional gene groups.
Limitations
1. Fixed Number of Clusters: The need to specify the number of clusters (K) can be impractical
without prior knowledge.
2. Scalability with Large K: Performance degrades with a very large number of clusters.
3. Sensitive to Scale: Results can be influenced by the scale of the data, requiring normalization
or standardization.
4. Non-Convex Shapes: Struggles with clusters that are not spherical or have varying sizes and
densities.
5. Handling Categorical Data: Designed for continuous numerical data, making it less effective
for categorical or mixed data types without pre-processing.
Dr M.Ramachandro 13 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Dr M.Ramachandro 14 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
M1 M2
2 2 9 C1
4 0 7 C1
10 6 1 C2
12 8 1 C2
3 1 8 C1
20 16 9 C2
30 26 19 C2
11 7 0 C2
25 21 14 C2
Table1: The above table Green Color indicates Cluster 1, and the Yellow color
indicates Cluster 2
1. Therefore: C1= {2,4,3} = 2+4+3 = 9/3 = 3
C2= {10, 12, 20,30,11,25} = 10+12+20+30+11+25 = 108/6 =18
2. New Centroids are: M1= 3, M2 = 18
M1 M2
2 1 16 C1 C1
4 1 14 C1 C1
10 7 8 C2 C1
12 9 6 C2 C2
3 0 15 C1 C1
20 17 2 C2 C2
30 27 12 C2 C2
11 8 7 C2 C1
25 22 7 C2 C2
Therefore: from the table the data points are 10 and 11 are not in same cluster, we need to do
the same process until get the same place in both the clusters (old Cluster and New Cluster)
Dr M.Ramachandro 15 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
M1=C1 M2=C2
2 2.75 17.6 C1 C1
4 0.75 15.6 C1 C1
10 5.25 9.6 C1 C1
12 7.25 7.6 C2 C1
3 1.75 16.6 C1 C1
20 15.25 0.4 C2 C2
30 25.25 10.4 C2 C2
11 6.25 8.6 C2 C1
25 20.25 5.4 C2 C2
Dr M.Ramachandro 16 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
NEW TABLE
Data Points Distance Cluster New Cluster
M1=C1 M2=C2
2 5 23 C1 C1
4 3 21 C1 C1
10 3 15 C1 C1
12 5 13 C1 C1
3 4 22 C1 C1
20 13 5 C2 C2
30 23 5 C2 C2
11 4 14 C1 C1
25 18 0 C2 C2
Dr M.Ramachandro 17 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Dr M.Ramachandro 18 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
3. Document Clustering: Grouping similar documents together for information retrieval or
organization purposes.
4. Image Segmentation: Identifying and clustering similar regions within an image for object
recognition or image processing tasks.
5. Healthcare: Clustering patients with similar medical histories or symptoms for personalized
treatment plans.
Examples K-Medoids Clustering Algorithm
1. Take an example of dataset
2. Apply K-Medoid Clustering algorithm to form two clusters
3. Use Manhattan Distance to find the between data point and medoid
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
STEPS:
STEP1:
1. Select Two Mediods Randomly
2. C1= (3, 4), C2= (7,4)
3. (x1, y1) and (x2, y2) are data points
4. Manhattan Distance formula = |x1-x2| + |y1-y2|
Dr M.Ramachandro 19 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
a. Where the values are x1=2, x2=3, y1=6, y2=4 similarly we need to
find all the values, need to substitute the values in formula.
5. Manhattan Distance [ (2,6), (3,4)] = |2-3| + |6-4| = 3
X1 2 6 3 7 C1
X2 3 4 0 4 C1
X3 3 8 4 8 C1
X4 4 7 4 6 C1
X5 6 2 5 3 C2
X6 6 4 3 1 C2
X7 7 3 5 1 C2
X8 7 4 4 0 C2
X9 8 5 6 2 C2
X10 7 6 6 2 C2
TABLE 2: Calculating the values with respect to Cluster 1 values (3,4), Cluster 2
values (7,4), Finding the cluster group based on C1 & C2 taken minimum values.
6. Clusters are
a. C1 = {(2,6, (3,4), (3,8), (4,7)}
b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}
(3,8)) + {(Cost (3,4), (4,7)) + (Cost (7,4), (6,2)) + (Cost (7,4), (6,4)) +
(Cost (7,4), (7,3)) +(Cost (7,4), (7,4)) + (Cost (7,4), (8,5)) +(Cost (7,4),
(7,6))}
Dr M.Ramachandro 20 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
7. Find the values using Manhattan Distance formula = |x1-x2| + |y1-y2|
a. Where x1=3, x2=2, y1=4, y2=6
b. Need to find for all the values
d. Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4),
(3,8)) + {(Cost (3,4), (4,7)) + (Cost (7,3), (6,2)) + (Cost (7,3), (6,4)) +
(Cost (7,3), (7,3)) +(Cost (7,3), (7,4)) + (Cost (7,3), (8,5)) +(Cost (7,3),
(7,6))}
8. Find the values using Manhattan Distance formula = |x1-x2| + |y1-y2|
Dr M.Ramachandro 21 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
X1 2 6 3 8 C1
X2 3 4 0 5 C1
X3 3 8 4 9 C1
X4 4 7 4 7 C1
X5 6 2 5 2 P
X6 6 4 3 2 P
X7 7 3 5 0 P
X8 7 4 4 1 P
X9 8 5 6 3 P
X10 7 6 6 3 P
STEP 3:
1. Cost of Swapping of Medoid C2 with P
2. S= Current Total Cost – Previous Total Cost
S= 22-20 = 2>0
Note: In K-Mediod clustering algorithm, we find the Current Total Cost =22 and
Previous Total Cost = 20. Here the problem is the current total cost greater than
the previous total cost, So finally replacing C2 with P is not good idea, so we will
go for final Mediods are C1= (3,4), and C2= (7,4).
Dr M.Ramachandro 22 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
DBSCAN Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering
algorithm designed to identify clusters of varying shapes and sizes within a dataset. It relies on
density criteria to form clusters, distinguishing itself by effectively handling noise (outliers).
How It Works
1. Parameters:
o Epsilon (ε): The maximum distance between two points for them to be considered
neighbors.
o Min Pts: The minimum number of points required to form a dense region (a cluster).
2. Core Points, Border Points, and Noise:
o Core Point: A point is a core point if it has at least MinPts points within its ε-
neighborhood.
o Border Point: A point is a border point if it has fewer than MinPts points within its ε-
neighborhood but is within the ε-neighborhood of a core point.
o Noise: Points that are neither core points nor border points are classified as noise.
3. Algorithm Steps:
1. Randomly select a point.
2. If the point is a core point, create a new cluster. Expand the cluster by adding all points
that are density-reachable from the core point (directly or indirectly connected via
other core points).
3. If the point is a border point and not part of any cluster, it is assigned to the nearest
cluster.
4. Continue until all points are processed.
Advantages
1. Ability to Find Arbitrarily Shaped Clusters: Unlike k-means, DBSCAN can find clusters of
arbitrary shapes and sizes, making it suitable for more complex datasets.
2. No Need to Specify Number of Clusters: Unlike k-means, DBSCAN does not require the
number of clusters to be specified beforehand.
3. Robust to Noise: Effectively identifies outliers as noise points, improving the quality of
clustering results.
4. Minimal Parameters: Only requires two parameters (ε and MinPts), simplifying the model
tuning process.
Dr M.Ramachandro 23 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Disadvantages
1. Parameter Sensitivity: The effectiveness of DBSCAN heavily depends on the choice of ε and
MinPts, which can be difficult to determine.
2. Varying Density: Struggles with datasets containing clusters of varying densities, as the same
ε and MinPts may not be appropriate for all clusters.
3. High-Dimensional Data: Performance degrades with high-dimensional data due to the curse
of dimensionality, making distance measures less meaningful.
4. Scalability: Computationally intensive for large datasets, particularly in higher dimensions,
due to the need to compute distances between all points.
Real-Time Applications
1. Geospatial Data Analysis:
o Application: Identifying geographical clusters, such as hotspots of disease outbreaks
or crime areas.
o Example: Analyzing spatial distribution of earthquakes to detect regions with high
seismic activity.
2. Market Segmentation:
o Application: Clustering customers based on purchasing behavior to identify distinct
market segments.
o Example: Segmenting customers of an e-commerce platform based on their browsing
and purchasing patterns.
3. Anomaly Detection:
o Application: Detecting anomalies in data, such as fraud detection in financial
transactions or identifying network intrusions.
o Example: Identifying unusual patterns in network traffic that may indicate security
breaches.
4. Image Processing:
o Application: Clustering similar images or regions within images for tasks like image
segmentation.
o Example: Grouping pixels in satellite images to identify different land cover types.
Limitations
1. Difficulty in Parameter Selection:
o Choosing appropriate values for ε and MinPts can be challenging and may require
domain knowledge or experimentation.
2. Inefficiency with Varying Density:
Dr M.Ramachandro 24 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
o Struggles with datasets containing clusters of varying densities, as the same ε and
MinPts values may not work well for all clusters.
3. Computational Complexity:
o DBSCAN has a time complexity of O(n log n)with spatial indexing structures, but
without such optimizations, it can be O(n2) making it less efficient for very large
datasets.
4. High-Dimensional Data:
o The performance of DBSCAN degrades with high-dimensional data because distance
metrics become less meaningful in high-dimensional spaces, leading to poor clustering
results.
Dr M.Ramachandro 25 | P a g e
MACHINE LEARNING COURSE CODE-A8703 MODULE-04
Important Questions
2 Marks Questions
1. Mention any two real-world application of unsupervised learning.
2. List any two unsupervised learning algorithms in machine learning
3. What is the main difference between K-Means and K-Medoid clustering?
4. Differentiate between supervised and unsupervised learning.
5. What is DBSCAN stand for?
6. What is the primary goal of machine learning?
7. What is conditional probability?
8. How are the initial centroids selected in K-Means clustering?
9. What are the advantages of using K-Means clustering?
10. What is the purpose of using clustering in data analysis?
5 Marks Questions
1. How machine learning works and its importance in modern technology, mention any five real time
applications
2. Compare and contrast supervised learning and unsupervised learning in terms of data requirements,
algorithm complexity, and typical applications.
3. Why K-Means clustering algorithm is a widely used algorithm for grouping data points into clusters.
Describe the steps involved in performing K-Means clustering on a dataset.
4. How the K-Medoid clustering algorithm works take an example of data set with working procedure
5. Describe the DBSCAN clustering algorithm. Explain the concepts of core points, border points, and
noise points.
6. Discuss different methods for evaluating the quality of clustering results. Provide examples of metrics
used.
7. Define clustering and explain its significance in data analysis. Discuss its use in various domains.
Dr M.Ramachandro 26 | P a g e