ALY 6040: DATA MINING
CLUSTERING
FATEMEH (PARISSA) AHMADI, PH.D.
INTRODUCTION
In general, clustering is the use of unsupervised techniques for grouping similar
objects. In machine learning, unsupervised refers to the problem of finding hidden
structure within unlabeled data (You can use a labeled dataset for clustering practice if
you remove the label column). Clustering techniques are unsupervised in the sense that
the data scientist does not determine, in advance, the labels to apply to the clusters.
For example, based on customers’ personal income, it is straightforward to divide the
customers into three groups depending on arbitrarily selected values. The customers
could be divided into three groups as follows:
• Earn less than $10,000
• Earn between $10,000 and $99,999
• Earn $100,000 or more
DIFFERENT
CLUSTERING
CATEGORIES
K-MEANS
Given a collection of objects each with n measurable attributes, k-means is an
analytical technique that, for a chosen value of k, identifies k clusters of objects based
on the objects’ proximity to the center of the k groups.
The center is determined as the arithmetic average (mean) of each cluster’s n-
dimensional vector of attributes.
Next figure illustrates three clusters of objects with two attributes. Each object in the
dataset is represented by a small dot color-coded to the closest large dot, the mean of
the cluster.
K-MEANS
K-Means for
K=3
USE CASES
USE CASES
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD
Step 1:
https://www.bing.com/videos/search?ptag=ICO-
a37ed4490a84afc3&pc=1MSC&q=kmeans+algorithm+video&ru=%2fsearch%3fptag%3dIC
O-
a37ed4490a84afc3%26form%3dINCOH1%26pc%3d1MSC%26q%3dkmeans%2520algorith
m%2520video&view=detail&mmscn=vwrc&mid=FFB271DF90375FC9E55CFFB271DF903
75FC9E55C&FORM=WRVORC
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD
Step 2:
OVERVIEW OF THE METHOD
OVERVIEW OF THE
METHOD
Step 3:
OVERVIEW OF THE METHOD
DETERMINING THE NUMBER OF CLUSTERS
With the preceding algorithm, k clusters can be identified in a given dataset, but what
value of k should be selected?
The value of k can be chosen based on a reasonable guess or some predefined
requirement. However, even then, it would be good to know how much better or
worse having k clusters versus k – 1 or k + 1 clusters would be in explaining the
structure of the data. Next, a heuristic using the Within Sum of Squares (WSS) metric
is examined to determine a reasonably optimal value of k. Using the distance function
given in following equation, WSS is defined as shown in equation below:
DETERMINING THE NUMBER OF CLUSTERS
In other words, WSS is the sum of the squares of the distances between each data
point and the closest centroid. The term q(i) indicates the closest centroid that is
associated with the ith point. If the points are relatively close to their respective
centroids, the WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the
value of WSS from the case with only k clusters, there may be little benefit to adding
another cluster.
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
Size compare to
default size=1
Plot Character: Star
Apply(dataset, row-wise/column-wise, function):
2: row-wise, var: Variance
CLUSTERING-EXAMPLE
An elbow of k=2 in clear. In cases where the elbow location is
ambiguous, other metrics such as Silhouette should be used.
CLUSTERING-EXAMPLE
Best Number of Clusters
Number of Measures
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
PAMK (PARTITIONING AROUND MEDOIDS)
PAMK
PAMK
ADDITIONAL CONSIDERATIONS
The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important
to rerun the k-means analysis several times for a particular value of k to ensure the cluster results
provide the overall minimum WSS. As seen earlier, this task is accomplished in R by using the nstart
option in the kmeans() function call.
Here, the use of the Euclidean distance function is presented to assign the points to the closest
centroids. Other possible function choices include the Cosine similarity and the Manhattan distance
functions. The Cosine similarity function is often chosen to compare two documents based on the
frequency of each word that appears in each of the documents.
For two points, p and q, at (p1, p2, …, pn) and (q1, q2, …, qn) , respectively, the Manhattan distance, d1 ,
between p and q is expressed as shown in next Equation:
ADDITIONAL CONSIDERATIONS
The Manhattan distance function is analogous to the distance traveled by a car in a city, where the
streets are laid out in a rectangular grid (such as city blocks). In Euclidean distance, the measurement is
made in a straight line. Using previous equation, the distance from (1, 1) to (4, 5) would be |1 – 4| + |1
– 5| = 7. From an optimization perspective, if there is a need to use the Manhattan distance for a
clustering analysis, the median is a better choice for the centroid than use of the mean.
K-means clustering is applicable to objects that can be described by attributes that are numerical with
a meaningful distance measure. However, k-means does not handle categorical variables well.
For example, suppose a clustering analysis is to be conducted on new car sales. Among other attributes,
such as the sale price, the color of the car is considered important. Although one could assign numerical
values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful to consider that yellow is
as close to red as yellow is to green from a clustering perspective. In such cases, it may be necessary to
use an alternative clustering methodology.
ADDITIONAL ALGORITHMS
The k-means clustering method is easily applied to numeric data where the concept of
distance can naturally be applied. However, it may be necessary or desirable to use an
alternative clustering algorithm.
As discussed at the end of the previous section, k-means does not handle categorical data.
In such cases, k-modes is a commonly used method for clustering categorical data based
on the number of differences in the respective components of the attributes.
For example, if each object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3.
In R, the function kmode() is implemented in the klaR package.
ADDITIONAL ALGORITHMS
Because k-means and k-modes divide the entire dataset into distinct groups, both approaches
are considered partitioning methods. A third partitioning method is known as Partitioning
around Medoids (PAM).
In general, a medoid is a representative object in a set of objects. In clustering, the medoids
are the objects in each cluster that minimize the sum of the distances from the medoid to the
other objects in the cluster.
The advantage of using PAM is that the “center” of each cluster is an actual object in
the dataset. PAM is implemented in R by the pam() function included in the cluster R
package. The fpc R package includes a function pamk(), which uses the pam() function to find
the optimal value for k.
ADDITIONAL ALGORITHMS
Other clustering methods include hierarchical agglomerative clustering and density
clustering methods. In hierarchical agglomerative clustering, each object is initially placed in
its own cluster. The clusters are then combined with the most similar cluster. This process is
repeated until one cluster, which includes all the objects, exists. The R stats package includes
the hclust() function for performing hierarchical agglomerative clustering.
In density-based clustering methods, the clusters are identified by the concentration of
points. The fpc R package includes a function, dbscan(), to perform density-based clustering
analysis. Density-based clustering can be useful to identify irregularly shaped clusters.
DBSCAN
▪ Core Points
▪ Border/Boundary Points
▪ Noise/Outlier
https://www.bing.com/videos/search?q=dbscan+algorithm+video&qs=n&sp=-1&lq=0&pq=dbscan+algorithm+video&sc=6-
22&sk=&cvid=F20866D7EA994B10A858FF746FB4DD29&ghsh=0&ghacc=0&ghpl=&ru=%2fsearch%3fq%3ddbscan%2balgorithm%2bvideo
%26qs%3dn%26form%3dQBRE%26sp%3d-1%26lq%3d0%26pq%3ddbscan%2balgorithm%2bvideo%26sc%3d6-
22%26sk%3d%26cvid%3dF20866D7EA994B10A858FF746FB4DD29%26ghsh%3d0%26ghacc%3d0%26ghpl%3d&view=detail&mmscn=vwr
c&mid=4F8A4D63F261C954ABF64F8A4D63F261C954ABF6&FORM=WRVORC
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB
2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
DBSCAN
https://www.bing.com/images/search?view=detailV2&ccid=z9asJXY5&id=D21C5B3AF4E34CE3433710B7AE71DBC8394C7621&thid=OIP.z9asJXY5UkR
JZ3NdZ5TpqAHaFg&mediaurl=https%3A%2F%2Fwww.researchgate.net%2Fprofile%2FYijin_Liu%2Fpublication%2F308750501%2Ffigure%2Fdownload%
2Ffig4%2FAS%3A412083041652736%401475259661770%2FSchematic-drawings-of-the-DBSCAN-clustering-algorithm-Panel-a-shows-the-
clustering.png&cdnurl=https%3A%2F%2Fth.bing.com%2Fth%2Fid%2FR.cfd6ac25763952444967735d6794e9a8%3Frik%3DIXZMOcjbca63EA%26pid%3DI
mgRaw%26r%3D0&exph=510&expw=685&q=DBSCAN+Clustering&simid=607997061279082409&form=IRPRST&ck=BFA17CD056C1FA9D9F0093F
C4AF169F1&selectedindex=3&ajaxhist=0&ajaxserp=0&vt=0
DBSCAN
diff(dist)
1/length(dist)
DBSCAN
KNNdistplot:
▪ Tries to find a best value for
eps based on KNN Distance.
▪ KNN Distance is defined as
the nearest distance from a
point to its K Nearest
Neighbors.
▪ We seek a value for eps at the
knee of the line.
DBSCAN
DBSCAN
HULLPLOT FOR DBSCAN
HullPlot
DBSCAN
plot(iris3, col=db1$cluster)
DBSCAN
https://www.bing.com/images/search?view=detailV2&ccid=3LdJGuOj&id=8CC2488C0F8B17E22DD76E9D59F2ACDFDA158EEE&thid=OIP.3LdJGuOjolHFG7m8W2NmXQHaCw&mediaurl=h
ttps%3a%2f%2fmiro.medium.com%2fmax%2f1200%2f1*KqWII7sFp1JL0EXwJGpqFw.png&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.dcb7491ae3a3a251c51bb9bc5b63665d%3frik%3d7
o4V2t%252bs8lmdbg%26pid%3dImgRaw%26r%3d0&exph=448&expw=1200&q=dbscan&simid=607992534376195191&FORM=IRPRST&ck=D05FFCD01B189C12EA922B280124E7E8&selected
Index=3&ajaxhist=0&ajaxserp=0
KMEANS IN PYTHON
41
KMEANS IN PYTHON
42
KMEANS IN
PYTHON
43
DBSCAN IN PYTHON
The default values for
parameters:
eps=0.5,
min_samples=5
44
DBSCAN IN PYTHON
45
DBSCAN IN PYTHON
▪ algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’
▪ The algorithm to be used by the NearestNeighbors module to
compute pointwise distances and find nearest neighbors.
46
DBSCAN IN PYTHON
47
REFERENCE
Data Science and Big Data analytics, Wiley, 2015. Chapter 4: Advanced analytical
theory and methods: Clustering.