0% found this document useful (0 votes)
29 views33 pages

Unit 2

Uploaded by

divya.en
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

Unit 2

Uploaded by

divya.en
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Using Clustering to subdivide

data
Prof. Sonia F. Panesar
Department of Computer Sci. & Eng.
Babaria Institute of Technology
Clustering
Clustering basics
“Clustering is the task of dividing the population
or data points into a number of groups such that
data points in the same groups are more similar
to other data points in the same group than
those in other groups”.
Types of clustering algorithms
• Partitional
– Algorithms that create only a single set of clusters
• Hierarchical
– Algorithms that create separate sets of nested
clusters, each in its own hierarchal level
Partitional
• It is a type of clustering that divides the data into
non-hierarchical groups. It is also known as
the centroid-based method.
• The dataset is divided into a set of k groups, where
K is used to define the number of pre-defined
groups.
• The cluster center is created in such a way that the
distance between the data points of one cluster is
minimum as compared to another cluster
centroid.
Hierarchical
• The dataset is divided into clusters to create a
tree-like structure, which is also called
a dendrogram.
• The observations or any number of clusters
can be selected by cutting the tree at the
correct level.
• The most common example of this method is
the Agglomerative Hierarchical algorithm.
Partitional & Hierarchical
When to use Clustering Algo
• Know and understand the dataset you’re
analyzing
• You don’t have an exact idea of the nature of
the subsets (clusters)
• The subsets (clusters) are determined by only
the single dataset you’re analyzing.
• Your goal is to determine a model that
describes the subsets in a single dataset and
only this dataset.
Geometric metrics
• Euclidean metric
– A measure of the distance between points
plotted on a
Euclidean plane.
• Manhattan metric
– A measure of the distance between points where distance is
calculated as the sum of the absolute value of the differences
between two points’ Cartesian coordinates.
• Minkowski distance metric
– A generalization of the Euclidean and Manhattan distance
metrics. Quite often, these metrics can be used interchangeably.
• Cosine similarity metric
– The cosine metric measures the similarity of two data
points based on their orientation, as determined by taking the
cosine of the angle between them
Jaccard distance metric
• compares the number of features that two
observations have in common
Clustering algorithms
• k-means
• kernel density estimation (KDE)
• hierarchical algorithms
• DBScan
K-mean
• K-Means Clustering is an Unsupervised
Learning algorithm
• which groups the unlabeled dataset
into different clusters.
• Here K defines the number of pre-
defined clusters that need to be created in the
process
• if K=2, there will be two clusters, and for K=3,
there will be three clusters
How the algorithm works
1. Initially, you randomly pick k centroids (or points
that will be the center of your clusters) in d-
space. Try to make them near the data but
different from one another.
2. Then assign each data point to the closest
centroid.
3. Move the centroids to the average location of
the data points (which correspond to users in
this example) assigned to it.
4. Repeat the preceding two steps until the
assignments don’t change, or change very little.
k-means in action
Kernel density estimation (KDE)
• kernel — a weighting function that is useful for
quantifying density
• place a kernel on each data point in the dataset
and then summing the kernels to generate a
kernel density estimate for the overall region.
• Areas of greater point density will sum out with
greater kernel density, and areas of lower point
density will sum out with less kernel density.
• don’t rely on cluster center placement
Hierarchical algorithms
• Slower and more computationally
expensive than k-means algorithms
• Unsupervised clustering algorithm
• Hierarchical algorithms are not subject
to errors caused by center convergence
• ideal machine learning solution to study
or analyzing biological or environmental data
Types of Hierarchical Clustering
• Two types
1. Agglomerative 2. Divisive Hierarchical
hierarchical clustering clustering
How the algorithm works
• It predicts groupings within a dataset by
calculating the distance and generating a link
between each singular observation and its
nearest neighbor
• The distance between observations is
measured in three different ways: Euclidean,
Manhattan, or Cosine
• linkage is formed by three different methods:
Ward, Complete, and Average
How the algorithm works
• The number of clusters will be the number of vertical lines which
are being intersected by the line drawn using the threshold.
• In the above example, since the red line intersects 2 vertical lines,
we will have 2 clusters. One cluster will have a sample (1,2,4) and
the other will have a sample (3,5).
Why DBScan?
DBScan
• DBSCAN stands for
Based Spatial Clustering Density- of
with Noise (DBScan) Applications
• Unsupervised learning method
• Useful for identifying collective outliers
• Clustering core samples (dense areas of a
dataset) while simultaneously demarking non-
core samples (portions of the dataset that are
comparatively sparse)
How the algorithm works
DBScan
• Generally effective
• Have two weaknesses
– computationally expensive
– We might have to provide the model with
empirical parameter values for expected cluster
size and cluster density
Decision Tree
• A decision tree algorithm works by developing
a set of yes-or-no rules that you can follow
for
new data to see exactly how will be
it characterized by the
model
• run the high risk of error propagation
How the algorithm works
Random Forest
• Slower but more powerful alternative
• Algorithm creates random trees and then
determines which one best classifies the
testing data
• Eliminates the risk of error propagation
Reference
s https://www.analyticsvidhya.com/blog/2016/11/an-introduction-

to-clustering-and-different-methods-of- clustering/#:~:
text=Clustering%20is%20the%20task%20of,and%20a
ssign%20them%20into%20clusters
• https://www.analyticsvidhya.com/blog/2019/08/comprehensive-
guide-k-means- clustering/?
utm_source=blog&utm_medium=DBSCAN
• https://www.analyticsvidhya.com/blog/2019/05/beginners-guide
-
hierarchical-clustering/?
utm_source=blog&utm_medium=DBSCAN
• https://datatofish.com/k-means-clustering-python/
• https://www.javatpoint.com/k-means-clustering-algorithm-in-
machine-learning
• https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-
clustering-works/

You might also like