0% found this document useful (0 votes)

21 views48 pages

4 Clustring

The document discusses different clustering algorithms including k-means, k-modes, partitioning around medoids (PAM), hierarchical clustering, and density-based clustering using DBSCAN. It provides an overview of how k-means works including determining the number of clusters and additional considerations when using k-means.

Uploaded by

Mike Amukhumba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views48 pages

4 Clustring

Uploaded by

Mike Amukhumba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

ALY 6040: DATA MINING

CLUSTERING

FATEMEH (PARISSA) AHMADI, PH.D.

INTRODUCTION
 In general, clustering is the use of unsupervised techniques for grouping similar
objects. In machine learning, unsupervised refers to the problem of finding hidden
structure within unlabeled data (You can use a labeled dataset for clustering practice if
you remove the label column). Clustering techniques are unsupervised in the sense that
the data scientist does not determine, in advance, the labels to apply to the clusters.
 For example, based on customers’ personal income, it is straightforward to divide the
customers into three groups depending on arbitrarily selected values. The customers
could be divided into three groups as follows:
• Earn less than $10,000
• Earn between $10,000 and $99,999
• Earn $100,000 or more
DIFFERENT
CLUSTERING
CATEGORIES
K-MEANS

 Given a collection of objects each with n measurable attributes, k-means is an

analytical technique that, for a chosen value of k, identifies k clusters of objects based
on the objects’ proximity to the center of the k groups.
 The center is determined as the arithmetic average (mean) of each cluster’s n-
dimensional vector of attributes.
 Next figure illustrates three clusters of objects with two attributes. Each object in the
dataset is represented by a small dot color-coded to the closest large dot, the mean of
the cluster.
K-MEANS

K-Means for
K=3
USE CASES
USE CASES
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD

Step 1:
https://www.bing.com/videos/search?ptag=ICO-
a37ed4490a84afc3&pc=1MSC&q=kmeans+algorithm+video&ru=%2fsearch%3fptag%3dIC
O-
a37ed4490a84afc3%26form%3dINCOH1%26pc%3d1MSC%26q%3dkmeans%2520algorith
m%2520video&view=detail&mmscn=vwrc&mid=FFB271DF90375FC9E55CFFB271DF903
75FC9E55C&FORM=WRVORC
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD

Step 2:
OVERVIEW OF THE METHOD
OVERVIEW OF THE
METHOD

Step 3:
OVERVIEW OF THE METHOD
DETERMINING THE NUMBER OF CLUSTERS

 With the preceding algorithm, k clusters can be identified in a given dataset, but what
value of k should be selected?
 The value of k can be chosen based on a reasonable guess or some predefined
requirement. However, even then, it would be good to know how much better or
worse having k clusters versus k – 1 or k + 1 clusters would be in explaining the
structure of the data. Next, a heuristic using the Within Sum of Squares (WSS) metric
is examined to determine a reasonably optimal value of k. Using the distance function
given in following equation, WSS is defined as shown in equation below:
DETERMINING THE NUMBER OF CLUSTERS

 In other words, WSS is the sum of the squares of the distances between each data
point and the closest centroid. The term q(i) indicates the closest centroid that is
associated with the ith point. If the points are relatively close to their respective
centroids, the WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the
value of WSS from the case with only k clusters, there may be little benefit to adding
another cluster.
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE

Size compare to
default size=1
Plot Character: Star

Apply(dataset, row-wise/column-wise, function):

2: row-wise, var: Variance
CLUSTERING-EXAMPLE

An elbow of k=2 in clear. In cases where the elbow location is

ambiguous, other metrics such as Silhouette should be used.
CLUSTERING-EXAMPLE

Best Number of Clusters

Number of Measures
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
PAMK (PARTITIONING AROUND MEDOIDS)
PAMK
PAMK
ADDITIONAL CONSIDERATIONS

 The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important
to rerun the k-means analysis several times for a particular value of k to ensure the cluster results
provide the overall minimum WSS. As seen earlier, this task is accomplished in R by using the nstart
option in the kmeans() function call.
 Here, the use of the Euclidean distance function is presented to assign the points to the closest
centroids. Other possible function choices include the Cosine similarity and the Manhattan distance
functions. The Cosine similarity function is often chosen to compare two documents based on the
frequency of each word that appears in each of the documents.
 For two points, p and q, at (p1, p2, …, pn) and (q1, q2, …, qn) , respectively, the Manhattan distance, d1 ,
between p and q is expressed as shown in next Equation:
ADDITIONAL CONSIDERATIONS

 The Manhattan distance function is analogous to the distance traveled by a car in a city, where the
streets are laid out in a rectangular grid (such as city blocks). In Euclidean distance, the measurement is
made in a straight line. Using previous equation, the distance from (1, 1) to (4, 5) would be |1 – 4| + |1
– 5| = 7. From an optimization perspective, if there is a need to use the Manhattan distance for a
clustering analysis, the median is a better choice for the centroid than use of the mean.
 K-means clustering is applicable to objects that can be described by attributes that are numerical with
a meaningful distance measure. However, k-means does not handle categorical variables well.
 For example, suppose a clustering analysis is to be conducted on new car sales. Among other attributes,
such as the sale price, the color of the car is considered important. Although one could assign numerical
values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful to consider that yellow is
as close to red as yellow is to green from a clustering perspective. In such cases, it may be necessary to
use an alternative clustering methodology.
ADDITIONAL ALGORITHMS

 The k-means clustering method is easily applied to numeric data where the concept of
distance can naturally be applied. However, it may be necessary or desirable to use an
alternative clustering algorithm.
 As discussed at the end of the previous section, k-means does not handle categorical data.
In such cases, k-modes is a commonly used method for clustering categorical data based
on the number of differences in the respective components of the attributes.
 For example, if each object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3.
In R, the function kmode() is implemented in the klaR package.
ADDITIONAL ALGORITHMS

 Because k-means and k-modes divide the entire dataset into distinct groups, both approaches
are considered partitioning methods. A third partitioning method is known as Partitioning
around Medoids (PAM).
 In general, a medoid is a representative object in a set of objects. In clustering, the medoids
are the objects in each cluster that minimize the sum of the distances from the medoid to the
other objects in the cluster.
 The advantage of using PAM is that the “center” of each cluster is an actual object in
the dataset. PAM is implemented in R by the pam() function included in the cluster R
package. The fpc R package includes a function pamk(), which uses the pam() function to find
the optimal value for k.
ADDITIONAL ALGORITHMS

 Other clustering methods include hierarchical agglomerative clustering and density

clustering methods. In hierarchical agglomerative clustering, each object is initially placed in
its own cluster. The clusters are then combined with the most similar cluster. This process is
repeated until one cluster, which includes all the objects, exists. The R stats package includes
the hclust() function for performing hierarchical agglomerative clustering.

 In density-based clustering methods, the clusters are identified by the concentration of

points. The fpc R package includes a function, dbscan(), to perform density-based clustering
analysis. Density-based clustering can be useful to identify irregularly shaped clusters.
DBSCAN

▪ Core Points
▪ Border/Boundary Points
▪ Noise/Outlier

https://www.bing.com/videos/search?q=dbscan+algorithm+video&qs=n&sp=-1&lq=0&pq=dbscan+algorithm+video&sc=6-
22&sk=&cvid=F20866D7EA994B10A858FF746FB4DD29&ghsh=0&ghacc=0&ghpl=&ru=%2fsearch%3fq%3ddbscan%2balgorithm%2bvideo
%26qs%3dn%26form%3dQBRE%26sp%3d-1%26lq%3d0%26pq%3ddbscan%2balgorithm%2bvideo%26sc%3d6-
22%26sk%3d%26cvid%3dF20866D7EA994B10A858FF746FB4DD29%26ghsh%3d0%26ghacc%3d0%26ghpl%3d&view=detail&mmscn=vwr
c&mid=4F8A4D63F261C954ABF64F8A4D63F261C954ABF6&FORM=WRVORC

https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB
2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
DBSCAN

https://www.bing.com/images/search?view=detailV2&ccid=z9asJXY5&id=D21C5B3AF4E34CE3433710B7AE71DBC8394C7621&thid=OIP.z9asJXY5UkR
JZ3NdZ5TpqAHaFg&mediaurl=https%3A%2F%2Fwww.researchgate.net%2Fprofile%2FYijin_Liu%2Fpublication%2F308750501%2Ffigure%2Fdownload%
2Ffig4%2FAS%3A412083041652736%401475259661770%2FSchematic-drawings-of-the-DBSCAN-clustering-algorithm-Panel-a-shows-the-
clustering.png&cdnurl=https%3A%2F%2Fth.bing.com%2Fth%2Fid%2FR.cfd6ac25763952444967735d6794e9a8%3Frik%3DIXZMOcjbca63EA%26pid%3DI
mgRaw%26r%3D0&exph=510&expw=685&q=DBSCAN+Clustering&simid=607997061279082409&form=IRPRST&ck=BFA17CD056C1FA9D9F0093F
C4AF169F1&selectedindex=3&ajaxhist=0&ajaxserp=0&vt=0
DBSCAN

diff(dist)

1/length(dist)
DBSCAN

KNNdistplot:

▪ Tries to find a best value for

eps based on KNN Distance.
▪ KNN Distance is defined as
the nearest distance from a
point to its K Nearest
Neighbors.
▪ We seek a value for eps at the
knee of the line.
DBSCAN
DBSCAN
HULLPLOT FOR DBSCAN

HullPlot
DBSCAN
plot(iris3, col=db1$cluster)
DBSCAN

https://www.bing.com/images/search?view=detailV2&ccid=3LdJGuOj&id=8CC2488C0F8B17E22DD76E9D59F2ACDFDA158EEE&thid=OIP.3LdJGuOjolHFG7m8W2NmXQHaCw&mediaurl=h
ttps%3a%2f%2fmiro.medium.com%2fmax%2f1200%2f1*KqWII7sFp1JL0EXwJGpqFw.png&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.dcb7491ae3a3a251c51bb9bc5b63665d%3frik%3d7
o4V2t%252bs8lmdbg%26pid%3dImgRaw%26r%3d0&exph=448&expw=1200&q=dbscan&simid=607992534376195191&FORM=IRPRST&ck=D05FFCD01B189C12EA922B280124E7E8&selected
Index=3&ajaxhist=0&ajaxserp=0
KMEANS IN PYTHON

41
KMEANS IN PYTHON

42
KMEANS IN
PYTHON

43
DBSCAN IN PYTHON

The default values for

parameters:
eps=0.5,
min_samples=5

44
DBSCAN IN PYTHON

45
DBSCAN IN PYTHON

▪ algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

▪ The algorithm to be used by the NearestNeighbors module to
compute pointwise distances and find nearest neighbors.

46
DBSCAN IN PYTHON

47
REFERENCE

 Data Science and Big Data analytics, Wiley, 2015. Chapter 4: Advanced analytical
theory and methods: Clustering.

Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Chapter 04 Clustering
No ratings yet
Chapter 04 Clustering
36 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering
No ratings yet
Clustering
55 pages
Clustering
No ratings yet
Clustering
29 pages
Cluster Analysis and K-Means Guide
No ratings yet
Cluster Analysis and K-Means Guide
20 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Cluster Analysis and Methods Overview
No ratings yet
Cluster Analysis and Methods Overview
47 pages
Unit II Final
No ratings yet
Unit II Final
152 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Clustering
No ratings yet
Clustering
18 pages
K Means
No ratings yet
K Means
25 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Clustering
No ratings yet
Clustering
84 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Clustering
No ratings yet
Clustering
80 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
K-Means Clustering Guide 2023
No ratings yet
K-Means Clustering Guide 2023
14 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 4
No ratings yet
Unit 4
125 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Chapter 4
No ratings yet
Chapter 4
30 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
M5
No ratings yet
M5
40 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
PART2
No ratings yet
PART2
61 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Module 4
No ratings yet
Module 4
63 pages
KMeans Clustering Report
No ratings yet
KMeans Clustering Report
2 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Atp 100 Amm Quick Notes
No ratings yet
Atp 100 Amm Quick Notes
11 pages
Edgar Orals
No ratings yet
Edgar Orals
571 pages
JSC Interview Questions 2024
No ratings yet
JSC Interview Questions 2024
4 pages
Orals 2023
No ratings yet
Orals 2023
3 pages
ATP 101 - AMM - Quick Notes
No ratings yet
ATP 101 - AMM - Quick Notes
3 pages
Sample Transfer - OIM
No ratings yet
Sample Transfer - OIM
5 pages
Sample Partial Discharge of Charge - JWK & AAK
No ratings yet
Sample Partial Discharge of Charge - JWK & AAK
7 pages
Sample Partial Discharge of Charge - JWK & AAK
No ratings yet
Sample Partial Discharge of Charge - JWK & AAK
7 pages
Thevenin Voltage Calculation
No ratings yet
Thevenin Voltage Calculation
2 pages
Sample Transfer - OIM
No ratings yet
Sample Transfer - OIM
5 pages
General Principles of Commercial Law
No ratings yet
General Principles of Commercial Law
2 pages
The Land Laws (Amendment) (No.2) Bill, 2023-1
No ratings yet
The Land Laws (Amendment) (No.2) Bill, 2023-1
52 pages
Assignment Brief 2023-2024 For 6EJ530-RSESE 123
No ratings yet
Assignment Brief 2023-2024 For 6EJ530-RSESE 123
9 pages
Time Management
No ratings yet
Time Management
4 pages
Comparative Politics Syllabus
No ratings yet
Comparative Politics Syllabus
5 pages
Social Science Edited Revised
No ratings yet
Social Science Edited Revised
13 pages
Hill Napoleon - Outwitting The Devil
97% (68)
Hill Napoleon - Outwitting The Devil
302 pages
Case in Point - 1 Introduction
No ratings yet
Case in Point - 1 Introduction
3 pages
SITXCCS010 Assessment C Case Study V2-0
No ratings yet
SITXCCS010 Assessment C Case Study V2-0
23 pages
Eurocode 6 Part 1,3 - DDENV 1996-1-3-1998 PDF
No ratings yet
Eurocode 6 Part 1,3 - DDENV 1996-1-3-1998 PDF
34 pages
Pps Unit 2 Notes
No ratings yet
Pps Unit 2 Notes
10 pages
PPT5
No ratings yet
PPT5
23 pages
Bearing & Seal Tech Guide
No ratings yet
Bearing & Seal Tech Guide
12 pages
Network File Systems
No ratings yet
Network File Systems
18 pages
ICT and Web Evolution Overview
No ratings yet
ICT and Web Evolution Overview
72 pages
International Standard: Iso/Iec 17020
No ratings yet
International Standard: Iso/Iec 17020
6 pages
Pharmaceutical Chemistry 1 Inorganic by Mohammed Ali
No ratings yet
Pharmaceutical Chemistry 1 Inorganic by Mohammed Ali
316 pages
Jurnal Morfologi Alternanthera Philoxeroides Kelompok 4
No ratings yet
Jurnal Morfologi Alternanthera Philoxeroides Kelompok 4
6 pages
ESRI Military Symbology
0% (1)
ESRI Military Symbology
25 pages
English Practice for Students
No ratings yet
English Practice for Students
45 pages
Report Composit Position
No ratings yet
Report Composit Position
3 pages
ATT & Turbine Protection Logic
No ratings yet
ATT & Turbine Protection Logic
90 pages
Functions and Their Graphs: 2.2 The Graph of A Function
No ratings yet
Functions and Their Graphs: 2.2 The Graph of A Function
5 pages
Tso C139
No ratings yet
Tso C139
5 pages
Introduction To Cost Management
No ratings yet
Introduction To Cost Management
8 pages
F Cps Pearson Math Textbooks Award
No ratings yet
F Cps Pearson Math Textbooks Award
10 pages
Noname 365
No ratings yet
Noname 365
4 pages
Bachelor of Science in Psychology: Program Curriculum Ay 2020 - 2021
No ratings yet
Bachelor of Science in Psychology: Program Curriculum Ay 2020 - 2021
5 pages
Chos Global Natural Farming Sarra
100% (16)
Chos Global Natural Farming Sarra
110 pages
106106140
No ratings yet
106106140
2 pages
B.Tech Mass Transfer Exam 2014
No ratings yet
B.Tech Mass Transfer Exam 2014
2 pages
The 41 View Model of Architecture
No ratings yet
The 41 View Model of Architecture
16 pages
2539-Article Text-4175-1-10-20191209 PDF
No ratings yet
2539-Article Text-4175-1-10-20191209 PDF
15 pages
Intro 2 Customer Service
No ratings yet
Intro 2 Customer Service
3 pages
CV DR Moetrarsi
No ratings yet
CV DR Moetrarsi
2 pages
Counting Letter
No ratings yet
Counting Letter
2 pages

4 Clustring

Uploaded by

4 Clustring

Uploaded by

ALY 6040: DATA MINING

FATEMEH (PARISSA) AHMADI, PH.D.

 Given a collection of objects each with n measurable attributes, k-means is an

Apply(dataset, row-wise/column-wise, function):

An elbow of k=2 in clear. In cases where the elbow location is

Best Number of Clusters

 Other clustering methods include hierarchical agglomerative clustering and density

 In density-based clustering methods, the clusters are identified by the concentration of

▪ Tries to find a best value for

The default values for

▪ algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

You might also like