Unit 2

Uploaded by

divya.en

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views33 pages

Unit 2

Uploaded by

divya.en

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Using Clustering to subdivide

data
Prof. Sonia F. Panesar
Department of Computer Sci. & Eng.
Babaria Institute of Technology
Clustering
Clustering basics
“Clustering is the task of dividing the population
or data points into a number of groups such that
data points in the same groups are more similar
to other data points in the same group than
those in other groups”.
Types of clustering algorithms
• Partitional
– Algorithms that create only a single set of clusters
• Hierarchical
– Algorithms that create separate sets of nested
clusters, each in its own hierarchal level
Partitional
• It is a type of clustering that divides the data into
non-hierarchical groups. It is also known as
the centroid-based method.
• The dataset is divided into a set of k groups, where
K is used to define the number of pre-defined
groups.
• The cluster center is created in such a way that the
distance between the data points of one cluster is
minimum as compared to another cluster
centroid.
Hierarchical
• The dataset is divided into clusters to create a
tree-like structure, which is also called
a dendrogram.
• The observations or any number of clusters
can be selected by cutting the tree at the
correct level.
• The most common example of this method is
the Agglomerative Hierarchical algorithm.
Partitional & Hierarchical
When to use Clustering Algo
• Know and understand the dataset you’re
analyzing
• You don’t have an exact idea of the nature of
the subsets (clusters)
• The subsets (clusters) are determined by only
the single dataset you’re analyzing.
• Your goal is to determine a model that
describes the subsets in a single dataset and
only this dataset.
Geometric metrics
• Euclidean metric
– A measure of the distance between points
plotted on a
Euclidean plane.
• Manhattan metric
– A measure of the distance between points where distance is
calculated as the sum of the absolute value of the differences
between two points’ Cartesian coordinates.
• Minkowski distance metric
– A generalization of the Euclidean and Manhattan distance
metrics. Quite often, these metrics can be used interchangeably.
• Cosine similarity metric
– The cosine metric measures the similarity of two data
points based on their orientation, as determined by taking the
cosine of the angle between them
Jaccard distance metric
• compares the number of features that two
observations have in common
Clustering algorithms
• k-means
• kernel density estimation (KDE)
• hierarchical algorithms
• DBScan
K-mean
• K-Means Clustering is an Unsupervised
Learning algorithm
• which groups the unlabeled dataset
into different clusters.
• Here K defines the number of pre-
defined clusters that need to be created in the
process
• if K=2, there will be two clusters, and for K=3,
there will be three clusters
How the algorithm works
1. Initially, you randomly pick k centroids (or points
that will be the center of your clusters) in d-
space. Try to make them near the data but
different from one another.
2. Then assign each data point to the closest
centroid.
3. Move the centroids to the average location of
the data points (which correspond to users in
this example) assigned to it.
4. Repeat the preceding two steps until the
assignments don’t change, or change very little.
k-means in action
Kernel density estimation (KDE)
• kernel — a weighting function that is useful for
quantifying density
• place a kernel on each data point in the dataset
and then summing the kernels to generate a
kernel density estimate for the overall region.
• Areas of greater point density will sum out with
greater kernel density, and areas of lower point
density will sum out with less kernel density.
• don’t rely on cluster center placement
Hierarchical algorithms
• Slower and more computationally
expensive than k-means algorithms
• Unsupervised clustering algorithm
• Hierarchical algorithms are not subject
to errors caused by center convergence
• ideal machine learning solution to study
or analyzing biological or environmental data
Types of Hierarchical Clustering
• Two types
1. Agglomerative 2. Divisive Hierarchical
hierarchical clustering clustering
How the algorithm works
• It predicts groupings within a dataset by
calculating the distance and generating a link
between each singular observation and its
nearest neighbor
• The distance between observations is
measured in three different ways: Euclidean,
Manhattan, or Cosine
• linkage is formed by three different methods:
Ward, Complete, and Average
How the algorithm works
• The number of clusters will be the number of vertical lines which
are being intersected by the line drawn using the threshold.
• In the above example, since the red line intersects 2 vertical lines,
we will have 2 clusters. One cluster will have a sample (1,2,4) and
the other will have a sample (3,5).
Why DBScan?
DBScan
• DBSCAN stands for
Based Spatial Clustering Density- of
with Noise (DBScan) Applications
• Unsupervised learning method
• Useful for identifying collective outliers
• Clustering core samples (dense areas of a
dataset) while simultaneously demarking non-
core samples (portions of the dataset that are
comparatively sparse)
How the algorithm works
DBScan
• Generally effective
• Have two weaknesses
– computationally expensive
– We might have to provide the model with
empirical parameter values for expected cluster
size and cluster density
Decision Tree
• A decision tree algorithm works by developing
a set of yes-or-no rules that you can follow
for
new data to see exactly how will be
it characterized by the
model
• run the high risk of error propagation
How the algorithm works
Random Forest
• Slower but more powerful alternative
• Algorithm creates random trees and then
determines which one best classifies the
testing data
• Eliminates the risk of error propagation
Reference
s https://www.analyticsvidhya.com/blog/2016/11/an-introduction-
•
to-clustering-and-different-methods-of- clustering/#:~:
text=Clustering%20is%20the%20task%20of,and%20a
ssign%20them%20into%20clusters
• https://www.analyticsvidhya.com/blog/2019/08/comprehensive-
guide-k-means- clustering/?
utm_source=blog&utm_medium=DBSCAN
• https://www.analyticsvidhya.com/blog/2019/05/beginners-guide
-
hierarchical-clustering/?
utm_source=blog&utm_medium=DBSCAN
• https://datatofish.com/k-means-clustering-python/
• https://www.javatpoint.com/k-means-clustering-algorithm-in-
machine-learning
• https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-
clustering-works/

Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering
No ratings yet
Clustering
75 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
75 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering
No ratings yet
Clustering
20 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Clustering
No ratings yet
Clustering
12 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 4
No ratings yet
Unit 4
16 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
Clustering
No ratings yet
Clustering
11 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
65 pages
M5
No ratings yet
M5
40 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Week 10
No ratings yet
Week 10
84 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Clustering
No ratings yet
Clustering
55 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Module 5
No ratings yet
Module 5
43 pages
Unsupervised Learning Guide
No ratings yet
Unsupervised Learning Guide
50 pages
Unit 2
No ratings yet
Unit 2
89 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
80 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
UNIT5
No ratings yet
UNIT5
60 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Day 3
No ratings yet
Day 3
74 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering
No ratings yet
Clustering
67 pages
Clustering
No ratings yet
Clustering
53 pages
Unit 4
No ratings yet
Unit 4
28 pages
Unit 5 Spatial Data
No ratings yet
Unit 5 Spatial Data
21 pages
Statistics & Probability Guide
No ratings yet
Statistics & Probability Guide
38 pages
Unit 3
No ratings yet
Unit 3
15 pages
11th_ip Practice Workseet Mid Term
No ratings yet
11th_ip Practice Workseet Mid Term
13 pages
Omc Alcatel PDF
0% (1)
Omc Alcatel PDF
2 pages
Automation for Bug Bounty Hunters
No ratings yet
Automation for Bug Bounty Hunters
35 pages
Virtual Machine Setup Guide
No ratings yet
Virtual Machine Setup Guide
7 pages
Hot Patching Vs Cold Patching
No ratings yet
Hot Patching Vs Cold Patching
24 pages
Cybersecurity Challenges in The Digital Age
No ratings yet
Cybersecurity Challenges in The Digital Age
2 pages
Implementation of Queue Using Array
100% (1)
Implementation of Queue Using Array
42 pages
Legal, Ethical, and Professional
No ratings yet
Legal, Ethical, and Professional
18 pages
Launcher Log
No ratings yet
Launcher Log
506 pages
Who Killed Nokia? Nokia Did: Strategy
No ratings yet
Who Killed Nokia? Nokia Did: Strategy
1 page
Clustering Approaches For Financial Data Analysis PDF
No ratings yet
Clustering Approaches For Financial Data Analysis PDF
7 pages
Reflection - Webinar
No ratings yet
Reflection - Webinar
3 pages
Differentiate Between While Loop and Do-While Loop
No ratings yet
Differentiate Between While Loop and Do-While Loop
2 pages
Tailoring & Subscription Process Guide
No ratings yet
Tailoring & Subscription Process Guide
1 page
CPD Tutorial
50% (6)
CPD Tutorial
23 pages
Control Systems Control Systems: ME 304 ME 304
No ratings yet
Control Systems Control Systems: ME 304 ME 304
50 pages
LP Ventilator Mekanik
No ratings yet
LP Ventilator Mekanik
51 pages
Data Patterns Interview Questions
86% (7)
Data Patterns Interview Questions
2 pages
The Ultimate Guide To Building Custom GPTs On OpenAI's ChatGPT Platform
100% (1)
The Ultimate Guide To Building Custom GPTs On OpenAI's ChatGPT Platform
44 pages
Email Services
No ratings yet
Email Services
16 pages
Instructions SC WD365 2021 EOM9-1
No ratings yet
Instructions SC WD365 2021 EOM9-1
3 pages
RL8008EN EPON OLT User Manaul-WEB - 20160420
No ratings yet
RL8008EN EPON OLT User Manaul-WEB - 20160420
24 pages
PowerPoint Design Tips
No ratings yet
PowerPoint Design Tips
20 pages
pb-chl8318 ASP1212
No ratings yet
pb-chl8318 ASP1212
4 pages
J 1
No ratings yet
J 1
2 pages
Utilizing Block Chain Technology in Various Application Areas of Machine Learning
No ratings yet
Utilizing Block Chain Technology in Various Application Areas of Machine Learning
5 pages
Azure Notes
100% (1)
Azure Notes
52 pages
HXP01 HP35 PFB06 00614 4 1/3: General Note
No ratings yet
HXP01 HP35 PFB06 00614 4 1/3: General Note
1 page
NRF - 019 - PEMEX - 2011 - Fire Protection in Control Rooms Containing Electronic Equipment
No ratings yet
NRF - 019 - PEMEX - 2011 - Fire Protection in Control Rooms Containing Electronic Equipment
46 pages
Blue and White Modern Photo Engineering Resume
No ratings yet
Blue and White Modern Photo Engineering Resume
1 page

Unit 2

Uploaded by

Unit 2

Uploaded by

Using Clustering to subdivide

You might also like