UNIT 4: Unsupervised Learning Teaching
Unsupervised Learning
• Unsupervised learning is a type of machine learning where the algorithm learns patterns and
structures from input data without explicit supervision or labeled responses.
• In unsupervised learning, the algorithm is given a dataset consisting of features but not the
corresponding target labels. The goal is to explore the inherent structure of the data, identify
patterns, and extract meaningful insights.
• Unsupervised Learning is the training of a machine using information that is neither
classified nor labelled , which allows the algorithm to act on that information without
guidance.
TYPES OF UNSUPERVISED LEARNING
CLUSTERING:
❖ Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of
another group.
❖ Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
ASSOCIATION:
❖ An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.
❖ It determines the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a bread)
are also tend to purchase Y (Butter/Jam) item. A typical example of Association rule is
Market Basket Analysis
Difference Between Supervised & Unsupervised Learning
Supervised learning algorithms are Unsupervised learning algorithms are trained using
trained using labelled data unlabelled data.
Supervised learning model predicts the Unsupervised learning model finds the hidden patterns
output. in data.
The goal of supervised learning is to The goal of unsupervised learning is to find the hidden
train the model so that it can predict the patterns and useful insights from the unknown dataset.
output when it is given new data.
Supervised learning can be categorized Unsupervised Learning can be classified in Clustering
in Classification and Regression and Associations problems.
problems.
Supervised learning model produces an Unsupervised learning model may give less accurate
accurate result. result as compared to supervised learning.
It includes various algorithms such as It includes various algorithms such as Clustering, KNN,
Linear Regression, Logistic Regression, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.
Supervised learning needs supervision Unsupervised learning does not need any supervision to
to train the model. train the model.
Supervised learning is not close to true Unsupervised learning is more close to
Artificial intelligence as in this, we first the true Artificial Intelligence as it learns
train the model for each data, and then similarly as a child learns daily routine things
only it can predict the correct output. by his experiences.
Applications of Unsupervised Learning
GENOME ANALYSIS: These Algorithms can analyze genetic data to uncover patterns and
links for genetic research.
IMAGE AND TEXT CLUSTERING: It can automatically group similar
images and texts , aiding in tasks like image organization, document clustering
or content recommendation.
CUSTOMER SEGMENTATION: Customers can be grouped based on their purchase
behavior, allowing organizations to customize marketing efforts.
SEMANTIC CLUSTERING: It organizes all the responses with the same meaning into
clusters to ensure that customer quickly and easily gets the information.
ANOMALY DETECTION: Unsupervised learning is used to identify data points, events,
and/or observations that deviate from a dataset's normal behavior.
MARKET BASKET ANALYSIS: Past purchase behavior coupled with unsupervised
learning can be used to help businesses discover data trends that they could use to develop
effective cross-selling strategies.
CLUSTERING
Clustering in unsupervised machine learning, is the process of grouping unlabeled data into
clusters, based on their similarities.
Broadly this technique is applied to group data, based on different patterns, such as
similarities or differences, our machine model finds.
Clustering aims at forming groups of homogeneous data points from a heterogeneous
dataset. It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest similarity score
together
Types of Clustering
PARTIONING BASED CLUSTERING: These methods divide the data set into distinct groups.
They aim to create partitions such that objects within each partition are as similar as possible.
• K-means Clustering: Partitions data into k clusters based on centroid minimization.
• K-medoids Clustering: Similar to k-means but uses medoids (most centrally located
points) instead of centroids.
HIERARCHICAL CLUSTERING: These methods create a hierarchy of clusters that can be
represented in a tree structure (dendrogram).
• Agglomerative (Bottom-Up): Starts with each object in its own cluster and merges the
closest pairs of clusters iteratively.
• Divisive (Top-Down): Starts with all objects in one cluster and splits the cluster iteratively.
DENSITY-BASED CLUSTERING: These methods find clusters based on the density of points
in the data space.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Finds clusters
based on the density of data points, capable of finding arbitrary-shaped clusters and dealing
with noise.
• OPTICS (Ordering Points to Identify the Clustering Structure): An extension of
DBSCAN that works better with varied densities.
GRID BASED CLUSTERING: These methods involve dividing the data space into a finite
number of cells that form a grid structure.
• STING (Statistical Information Grid): Uses a multi-resolution grid data structure for
efficient clustering.
• CLIQUE (Clustering In QUEst): Identifies dense regions in subspaces of high-
dimensional data and merges them to form clusters.
APPLICATIONS OF CLUSTERING:
MARKETING: Clustering can be used for marketing reasons to characterize and discover client
segments.
BIOLOGY: It can be used for classification among different species of plants and animals.
IMAGE PROCESSING: It can be used to group similar images together, classify images based on
content and identify patterns in image data.
FRAUD DETECTION: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, thereby helping in fraud detection.
CYBERSECURITY: It is used to group similar patterns of network traffic or system behavior
,which can help in detecting and preventing cyber attacks.
CLIMATE ANALYSIS: It is used for grouping comparable patterns of climate data, such as
temperature, precipitation and wind inorder to better understand climate change and its influence
on environment.
MEDICAL DIAGNOSIS: It is used to group patients with similar symptoms or diseases, Which
helps in making accurate diagnoses and identifying effective treatments.
FINANCE: Clustering is technique used to identify market groupings based on customer behavior
, find patterns in stock market data and assess risk in investment portfolios
K -MEANS CLUSTERING
➢ K-means clustering is a very popular clustering algorithm which applied when we have a
dataset with labels unknown.
➢ The goal is to find certain groups based on some kind of similarity in the data with the
number of groups represented by K.
➢ This algorithm is generally used in areas like market segmentation, customer
segmentation, etc. But, it can also be used to segment different objects in the images on the
basis of the pixel values.
How K-Means Clustering Works?
Here’s how it works:
1. Initialization: Start by randomly selecting K points from the dataset. These points will
act as the initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that
point and each of the K centroids. Assign the data point to the cluster whose centroid
is closest to it. This step effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the
centroids no longer change significantly or when a specified number of iterations is
reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster .
Problem on K-Means Clustering
Q. Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60),
(179,68), (182,72), (188,77) up to two iterations and show the clusters. Initially
choose first two objects as initial centroids.
Solution:
Given, number of clusters to be created (K) = 2 say c1 and c2,
number of iterations = 2 and
The given data points can be represented in tabular form as:
Data Points
also, first two objects as initial centroids:
Centroid for first cluster c1 = (185, 72)
Centroid for second cluster c2 = (170, 56)
Iteration 1: Now calculating similarity by using Euclidean distance measure as:
Euclidean distance calculation
Representing above information in tabular form:
Distance of each data points from cluster centroids
The resulting cluster after first iteration is:
Data points cluster
Iteration 2: Now calculating centroid for each cluster:
Calculating centroid as mean of data points
Now, again calculating similarity:
Distance calculation between data points and centroids
Representing above information in tabular form.
Distance of each data points from cluster centroids
The resulting cluster after second iteration is:
Data points cluster
As we have already completed two iteration as asked by our question, the
numerical ends here.
Since, the clustering doesn’t change after second iteration, so terminate the
iteration even if question doesn’t say so.
Limitations of K Means Clustering
• Sensitivity to Initial Conditions:K-means is sensitive to initial conditions.
The algorithm randomly initializes the cluster centroids at the beginning, and
the final clustering results can vary depending on these initial positions.
Different initializations can lead to other local optima, resulting in different
clustering outcomes. This makes the K-means algorithm less reliable and
reproducible.
• Difficulty in Determining / K :One of the drawbacks of the K-means
algorithm is that we have to set
k the number of clusters ( ) in advance.
Choosing an incorrect number of clusters can lead to inaccurate results.
• K-means requires us to specify the number of clusters a priory.
• K-Means is sensitive towards outlier. Outliers can skew the clusters in K-
Means in very large extent.
• K-Means forms spherical clusters only. It fails when data is not spherical (e.g.,
oblong or elliptical) or of arbitrary shape
• Inability to Handle Categorical Data: When categorical data is used with the
K-means algorithm, it requires converting the categories into numerical values,
such as using one-hot encoding. One shortcoming of using one-hot encoding is
that it treats each feature independently and can degrade performance since it
can significantly increase data dimensionality.
.CLUSTERING FOR IMAGE SEGMENATION
Image Segmentation is the process of partitioning an image into multiple regions based on the
characteristics of the pixels in the original image. Clustering is a technique to group similar entities
and label them. Thus, for image segmentation using clustering, we can cluster similar pixels using
a clustering algorithm and group a particular cluster pixel as a single segment.
K-Means clustering for Image Segmentation
K-M eans is an agglomerative type of clustering where we do Not have the labels of pixels/data
points beforehand. The number of clusters or groups formed is taken as K. These clusters are
derived based on some of the similar or common characteristics between the pixels or data points.
The steps involved in K-Means clustering are
• Select a particular value of K
• A feature is taken in each pixel like RGB value, etc.
• Similar pixels are grouped by using distances like Euclidian distance.
• K-Means is used with the center of the cluster.
• Generally, a threshold is defined within which if the calculated size falls it is grouped into
a single cluster.
Clustering for Preprocessing:
Since a clustering process considers all the attributes of the dataset and reduces the information
to a cluster, which is really another attribute (i.e., the ID of the cluster to which a record would
belong to), clustering can be used as a data compression technique. The output of clustering is
the cluster ID for each record and it can be used as an input variable for other data science tasks.
Hence, clustering can be employed as a preprocessing technique for other data science processes.
In general, clustering can be used for two types of preprocessing:
1.
Clustering to reduce dimensionality: In an n-dimensional dataset (n number of attributes),
the computational complexity is proportional to the number of dimensions or “n.” With
clustering, n-dimensional attributes can be converted or reduced to one categorical
attribute—“Cluster ID.” This reduces the complexity, although there will be some loss of
information because of the dimensionality reduction to one single attribute. Chapter 14,
Feature Selection provides an in-depth look at feature selection techniques.
2.
Clustering for object reduction: Assume that the number of customers for an organization
is in the millions and the number of cluster groups is 100. For each of these 100 cluster
groups, one “poster child” customer can be identified that represents the typical
characteristics of all the customers in that cluster group. The poster child customer can be
an actual customer or a fictional customer. The prototype of a cluster is the most common
representation of all the customers in a group. Reducing millions of customer records to
100 prototype records provides an obvious benefit. In some applications, instead of
processing millions of records, just the prototypes can be processed for further
classification or regression tasks. This greatly reduces the record count and the dataset can
be made appropriate for classification by algorithms like k-nearest neighbor (k-NN) where
computation complexity depends on the number of records.
Using Clustering for semi supervised learning
Using clustering for Semi-supervised learning can offer several benefits over purely
supervised or unsupervised learning methods. First, it can leverage the large amount of unlabeled
data that is often available in real-world scenarios, and use it to enhance the accuracy and
generalization of the model. Second, it can reduce the cost and effort of labeling data manually,
which can be tedious, subjective, or error-prone. Third, it can discover hidden patterns or structures
in the data that may not be obvious or captured by the existing labels.
How does clustering work with semi-supervised learning
Semi-supervised learning with clustering can be implemented in various ways, depending on
the type of data, the clustering algorithm, and the supervised learning model. A common approach
is to first apply a clustering algorithm to the unlabelled data, and then combine the labelled and
unlabeled data, using the cluster labels as pseudo-labels for the unlabelled data. Next, a supervised
learning model is trained on the combined data, with the original labels and pseudo-labels serving
as target variables. Finally, the model can be evaluated on a test set, and optionally refined or
adjusted.
Some examples of semi-supervised learning with clustering
Semi-supervised learning with clustering has been applied to various domains and
tasks in AI, such as natural language processing for text classification, sentiment analysis,
topic modelling, or named entity recognition. For example, one can cluster unlabelled
documents based on their word embeddings and use the cluster labels as pseudo-labels to
train a text classifier. This same approach can be used for computer vision tasks like image
classification, object detection, face recognition, or semantic segmentation. Clustering can
also be used in bioinformatics for gene expression analysis, protein function prediction, or
disease diagnosis. For instance, one can cluster unlabelled gene expression profiles based on
their similarity and use the generated labels as pseudo-labels to train a gene classifier.
DBSCAN
DBSCAN is the abbreviation for Density-Based Spatial Clustering of Applications with Noise. It
is an unsupervised clustering algorithm.DBSCAN clustering can work with clusters of any size
from huge amounts of data and can work with datasets containing a significant amount of noise.
It is basically based on the criteria of a minimum number of points within a region.
What is DBSCAN Algorithm?
DBSCAN algorithm can cluster densely grouped points efficiently into one cluster. It can identify
local density in the data points among large datasets. DBSCAN can very effectively handle
outliers. An advantage of DBSACN over the K-means algorithm is that the number of centroids
need not be known beforehand in the case of DBSCAN.
DBSCAN algorithm depends upon two parameters epsilon and minPoints.
Epsilon is defined as the radius of each data point around which the density is considered.
minPoints is the number of points required within the radius so that the data point becomes a core
point.
The circle can be extended to higher dimensions.
Working of DBSCAN Algorithm
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around each data point and the
data point is classified into Core Point, Border Point, or Noise Point. The data point is classified
as a core point if it has minPoints number of data points with epsilon radius. If it has points less
than minPoints it is known as Border Point and if there are no points inside epsilon radius it is
considered a Noise Point.
Let us understand working through an example.
In the above figure, we can see that point A has no points inside epsilon(e) radius. Hence it is a
Noise Point. Point B has minPoints(=4) number of points with epsilon e radius , thus it is a Core
Point. While the point has only 1 ( less than minPoints) point, hence it is a Border Point.
Steps Involved in DBSCAN Algorithm.
• First, all the points within epsilon radius are found and the core points are identified with
number of points greater than or equal to minPoints.
• Next, for each core point, if not assigned to a particular cluster, a new cluster is created for
it.
• All the densely connected points related to the core point are found and assigned to the same
cluster. Two points are called densely connected points if they have a neighbor point that
has both the points within epsilon distance.
• Then all the points in the data are iterated, and the points that do not belong to any cluster
are marked as noise.
Advantages of the DBSCAN Algorithm
• DBSCAN does not require the number of centroids to be known beforehand as in the case with the K-
Means Algorithm.
• It can find clusters with any shape.
• It can also locate clusters that are not connected to any other group or clusters. It can work well with
noisy clusters.
• It is robust to outliers.
Disadvantages of the DBSCAN Algorithm
• It does not work with datasets that have varying densities.
• Cannot be employed with multiprocessing as it cannot be partitioned.
• Cannot find the right cluster if the dataset is sparse.
• It is sensitive to parameters epsilon and minPoints
Applications of DBSCAN
• It is used in satellite imagery.
• Used in XRay crystallography
• Anamoly detection in temperature.
Other Clustering algorithms
Hierarchical clustering is another unsupervised learning algorithm that is used to group together
the unlabeled data points having similar characteristics.
Hierarchical clustering is a technique for grouping data points based on similarities. The process
involves finding the two data points closest to each other and combining the two most similar ones.
After repeating this process until all data points are grouped into clusters, the end result is a
hierarchical tree of related groups known as a dendrogram
There are two major types of approaches:
• Agglomerative clustering: Divide the data points into different clusters and then
aggregate them as the distance decreases.
• Divisive clustering: Combine all the data points as a single cluster and divide them
as the distance between them increases.
Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the
individual data points as a single cluster, then it is merged continuously based on similarity
until it forms one big cluster containing all objects. It is good at identifying small clusters.
The steps for agglomerative clustering are as follows:
1. Compute the proximity matrix using a distance metric.
2. Use a linkage function to group objects into a hierarchical cluster tree based on the
computed distance matrix from the above step.
3. Data points with close proximity are merged together to form a cluster.
4. Repeat steps 2 and 3 until a single cluster remains.
The pictorial representation of the above steps would be:
I
In the above figure,
• The data points 1,2,...6 are assigned to each individual cluster.
• After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5
are merged together to form clusters.
• Again, the proximity matrix is computed and clusters with points 4,5 and 6 are
merged together.
• And again, the proximity matrix is computed, then the clusters with points 4,5,6 and
2,3 are merged together to form a cluster.
• As a final step, the remaining clusters are merged together to form a single cluster.
Proximity Matrix and Linkage
The proximity matrix is a matrix consisting of the distance between each pair of data
points. The distance is computed by a distance function. Euclidean distance is one of the
most commonly used distance functions.
Image:
The above proximity matrix consists of n points named x, and the d(xi,xj) represents the
distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in
the proximity matrix are taken and the data points are grouped based on similarity. The
newly formed clusters are linked to each other until they form a single cluster containing all
the data points.
The most common linkage methods are as follows:
• Complete linkage: The maximum of all pairwise distance between elements in
each pair of clusters is used to measure the distance between two clusters.
• Single linkage: The minimum of all pairwise distance between elements in each
pair of clusters is used to measure the distance between two clusters.
• Average linkage: The average of all pairwise distances between elements in each
pair of clusters is used to measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids
are considered.
• Ward’s Method: It uses squared error to compute the similarity of the two clusters
for merging.
Dendrogram Charts
The way to find the optimal number of clusters in hierarchical clustering is to use a
dendrogram chart.
• .
The chart is shown here:
Image:
Author
From the above chart we can visualize the hierarchical technique, so how to find the
optimal number of clusters from the above chart?
To find it, draw a horizontal line where there is no overlap in the vertical lines of the bars.
The number of bars without the overlap below the line is the optical number of the clusters.
Refer to the figure below for a clear illustration.
Image: Author
From the above figure, we have three bars below the horizontal line, so the optimal number
of clusters is three. Also, if you recall, the Iris dataset has three classes and we got the same
number from the above chart.
Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by
considering all the data points into a big single cluster and later on splitting them into
smaller heterogeneous clusters continuously until all data points are in their own cluster.
Thus, they are good at identifying large clusters. It follows a top-down approach and is
more efficient than agglomerative clustering. But, due to its complexity in implementation,
it doesn’t have any predefined implementation in any of the major machine learning
frameworks.
Steps in Divisive Clustering
Consider all the data points as a single cluster.
1. Split into clusters using any flat-clustering method, say K-Means.
2. Choose the best cluster among the clusters to split further, choose the one that has
the largest Sum of Squared Error (SSE).
3. Repeat steps 2 and 3 until a single cluster is formed.
In the above figure,
• The data points 1,2,...6 are assigned to large cluster.
• After calculating the proximity matrix, based on the dissimilarity the points are split
up into separate clusters.
• The proximity matrix is again computed until each point is assigned to an individual
cluster.
The proximity matrix and linkage function follow the same procedure as agglomerative
clustering, As the divisive clustering is not used in many places, there is no predefined
class/function in any Python library.