DataCamp Cluster
Analysis in R
CLUSTER ANALYSIS IN R
Occupational Wage
Data
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
Occupational Wage Data
22 Occupation Observations
15 Measurements of Average Income from 2001-2016
DataCamp Cluster Analysis in R
Occupational Wage Data
print(oes)
2001 2002 2003 2004 2005 ...
Management 70800 78870 83400 87090 88450 ...
Business Operations 50580 53350 56000 57120 57930 ...
Computer Science 60350 61630 64150 66370 67100 ...
Architecture/Engineering 56330 58020 60390 63060 63910 ...
Life/Physical/Social Sci. 49710 52380 54930 57550 58030 ...
Community Services 34190 34630 35800 37050 37530 ...
... ... ... ... ... ... ...
DataCamp Cluster Analysis in R
Occupational Wage Data
DataCamp Cluster Analysis in R
Next Steps: Hierarchical Clustering
Evaluate whether pre-processing is necessary
Create a distance matrix
Build a dendrogram
Extract clusters from dendrogram
Explore resulting clusters
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's practice!
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Reviewing the
Hierarchical Clustering
Results
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
The Dendrogram
DataCamp Cluster Analysis in R
The Trends
DataCamp Cluster Analysis in R
Connecting The Two
DataCamp Cluster Analysis in R
Next Steps: k-means Clustering
Evaluate whether pre-processing is necessary
Estimate the "best" k using the elbow plot
Estimate the "best" k using the maximum average silhouette width
Explore resulting clusters
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Let's cluster!
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Review K-means
Results
Dmitriy (Dima) Gorenshteyn
Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
Comparing The Two Clustering Methods
Hierarchical Clustering k-means
Distance Used: virtually any euclidean only
Results Stable: Yes No
Evaluating # of dendrogram, silhouette, silhoette,
Clusters: elbow elbow
Computation Relatively Higher Relatively
Complexity: Lower
DataCamp Cluster Analysis in R
What you have learned?
Chapter 1:
What is distance
Why is scale important
Chapter 2:
How linkage works
How the dendrogram is formed
How to analyze your clusters
Chapter 3:
How k-means works
How to estimate k
DataCamp Cluster Analysis in R
Lot's More to Learn
k-mediods
DBSCAN
Optics
DataCamp Cluster Analysis in R
CLUSTER ANALYSIS IN R
Congratulations!