DataCamp Cluster
Analysis in R
CLUSTER ANALYSIS IN R
What is Cluster
Analysis?
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
DataCamp Cluster Analysis in R
What is Clustering?
A form of exploratory data analysis (EDA) where
observations are divided into meaningful groups
that share common characteristics (features).
DataCamp Cluster Analysis in R
The Flow of Cluster Analysis
DataCamp Cluster Analysis in R
The Flow of Cluster Analysis
DataCamp Cluster Analysis in R
The Flow of Cluster Analysis
DataCamp Cluster Analysis in R
The Flow of Cluster Analysis
DataCamp Cluster Analysis in R
The Flow of Cluster Analysis
DataCamp Cluster Analysis in R
Structure of This Course
DataCamp Cluster Analysis in R
Structure of This Course
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's Learn!
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Distance Between Two
Observations
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
Distance vs Similarity
DataCamp Cluster Analysis in R
Distance vs Similarity
DISTANCE = 1 − SIMILARITY
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
Distance Between Two Players
DataCamp Cluster Analysis in R
dist() Function
print(two_players)
X Y
BLUE 0 0
RED 9 12
dist(two_players, method = 'euclidean')
BLUE
RED 15
DataCamp Cluster Analysis in R
More than 2 Observations
print(three_players)
X Y
BLUE 0 0
RED 9 12
GREEN -2 19
dist(three_players)
BLUE RED
RED 15.00000
GREEN 19.10497 13.03840
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's practice!
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
The Scales of Your
Features
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
Distance Between Individuals
Observation Height (feet) Weight (lbs)
1 6.0 200
2 6.0 202
3 8.0 200
... ... ...
... ... ...
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
Scaling our Features
height − mean(height)
heightscaled =
sd(height)
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
scale() function
print(height_weight)
Height Weight
1 6 200
2 6 202
3 8 200
... ... ...
scale(height_weight)
Height Weight
1 0.60 0.67
2 0.60 0.73
3 11.3 0.67
... ... ...
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's practice!
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Measuring Distance
For Categorical Data
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
Binary Data
wine beer whiskey vodka
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE TRUE TRUE
... ... ... ... ...
DataCamp Cluster Analysis in R
Jaccard Index
A∩B
J(A, B) =
A∪B
DataCamp Cluster Analysis in R
Calculating Jaccard Distance
wine beer whiskey vodka
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE TRUE TRUE
1∩2 1
J(1, 2) = = = 0.25
1∪2 4
Distance(1, 2) = 1 − J(1, 2) = 0.75
DataCamp Cluster Analysis in R
Calculating Jaccard Distance in R
print(survey_a)
wine beer whiskey vodka
<lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE TRUE TRUE
3 TRUE FALSE TRUE FALSE
dist(survey_a, method = "binary")
1 2
2 0.7500000
3 0.6666667 0.7500000
DataCamp Cluster Analysis in R
More Than Two Categories
color sport colorblue colorgreen colorred sporthockey sportsoccer
1 red soccer 1 0 0 1 0 1
2 green hockey 2 0 1 0 1 0
3 blue hockey 3 1 0 0 1 0
4 blue soccer 4 1 0 0 0 1
... ... ... ... ... ... ... ... ...
DataCamp Cluster Analysis in R
Dummification in R
print(survey_b)
color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer
library(dummies)
dummy.data.frame(survey_b)
colorblue colorgreen colorred sporthockey sportsoccer
1 0 0 1 0 1
2 0 1 0 1 0
3 1 0 0 1 0
4 1 0 0 0 1
DataCamp Cluster Analysis in R
Generalizing Categorical Distance in R
print(survey_b)
color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer
dummy_survey_b <- dummy.data.frame(survey_b)
dist(dummy_survey_b, method = 'binary')
1 2 3
2 1.0000000
3 1.0000000 0.6666667
4 0.6666667 1.0000000 0.6666667
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's practice!