0% found this document useful (0 votes)
95 views54 pages

What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn

This document provides an introduction to cluster analysis in R. It begins with defining what cluster analysis is as an exploratory data analysis technique that groups observations into meaningful clusters based on common characteristics. It then outlines the typical steps in a cluster analysis workflow and describes the structure of the course, which will cover measuring distance between observations, scaling features, and handling categorical data.

Uploaded by

atulyaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views54 pages

What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn

This document provides an introduction to cluster analysis in R. It begins with defining what cluster analysis is as an exploratory data analysis technique that groups observations into meaningful clusters based on common characteristics. It then outlines the typical steps in a cluster analysis workflow and describes the structure of the course, which will cover measuring distance between observations, scaling features, and handling categorical data.

Uploaded by

atulyaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

DataCamp Cluster

Analysis in R

CLUSTER ANALYSIS IN R

What is Cluster
Analysis?

Dmitriy (Dima) Gorenshteyn


Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?
DataCamp Cluster Analysis in R

What is Clustering?

A form of exploratory data analysis (EDA) where


observations are divided into meaningful groups
that share common characteristics (features).
DataCamp Cluster Analysis in R

The Flow of Cluster Analysis


DataCamp Cluster Analysis in R

The Flow of Cluster Analysis


DataCamp Cluster Analysis in R

The Flow of Cluster Analysis


DataCamp Cluster Analysis in R

The Flow of Cluster Analysis


DataCamp Cluster Analysis in R

The Flow of Cluster Analysis


DataCamp Cluster Analysis in R

Structure of This Course


DataCamp Cluster Analysis in R

Structure of This Course


DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Let's Learn!
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Distance Between Two


Observations

Dmitriy (Dima) Gorenshteyn


Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R

Distance vs Similarity
DataCamp Cluster Analysis in R

Distance vs Similarity

DISTANCE = 1 − SIMILARITY
DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

Distance Between Two Players


DataCamp Cluster Analysis in R

dist() Function
print(two_players)
X Y
BLUE 0 0
RED 9 12

dist(two_players, method = 'euclidean')

BLUE
RED 15
DataCamp Cluster Analysis in R

More than 2 Observations


print(three_players)
X Y
BLUE 0 0
RED 9 12
GREEN -2 19

dist(three_players)

BLUE RED
RED 15.00000
GREEN 19.10497 13.03840
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Let's practice!
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

The Scales of Your


Features

Dmitriy (Dima) Gorenshteyn


Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R

Distance Between Individuals


Observation Height (feet) Weight (lbs)

1 6.0 200

2 6.0 202

3 8.0 200

... ... ...

... ... ...


DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R

Scaling our Features


height − mean(height)
heightscaled =
sd(height)
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R
DataCamp Cluster Analysis in R

scale() function
print(height_weight)

Height Weight
1 6 200
2 6 202
3 8 200
... ... ...

scale(height_weight)

Height Weight
1 0.60 0.67
2 0.60 0.73
3 11.3 0.67
... ... ...
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Let's practice!
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Measuring Distance
For Categorical Data

Dmitriy (Dima) Gorenshteyn


Sr. Data Scientist,
Memorial Sloan Kettering Cancer Center
DataCamp Cluster Analysis in R

Binary Data
wine beer whiskey vodka

1 TRUE TRUE FALSE FALSE

2 FALSE TRUE TRUE TRUE

... ... ... ... ...


DataCamp Cluster Analysis in R

Jaccard Index






A∩B
J(A, B) =
A∪B
DataCamp Cluster Analysis in R

Calculating Jaccard Distance


wine beer whiskey vodka

1 TRUE TRUE FALSE FALSE

2 FALSE TRUE TRUE TRUE

1∩2 1
J(1, 2) = = = 0.25
1∪2 4

Distance(1, 2) = 1 − J(1, 2) = 0.75


DataCamp Cluster Analysis in R

Calculating Jaccard Distance in R


print(survey_a)

wine beer whiskey vodka


<lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE TRUE TRUE
3 TRUE FALSE TRUE FALSE

dist(survey_a, method = "binary")

1 2
2 0.7500000
3 0.6666667 0.7500000
DataCamp Cluster Analysis in R

More Than Two Categories


color sport colorblue colorgreen colorred sporthockey sportsoccer

1 red soccer 1 0 0 1 0 1

2 green hockey 2 0 1 0 1 0

3 blue hockey 3 1 0 0 1 0

4 blue soccer 4 1 0 0 0 1

... ... ... ... ... ... ... ... ...


DataCamp Cluster Analysis in R

Dummification in R
print(survey_b)

color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer

library(dummies)

dummy.data.frame(survey_b)

colorblue colorgreen colorred sporthockey sportsoccer


1 0 0 1 0 1
2 0 1 0 1 0
3 1 0 0 1 0
4 1 0 0 0 1
DataCamp Cluster Analysis in R

Generalizing Categorical Distance in R


print(survey_b)

color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer

dummy_survey_b <- dummy.data.frame(survey_b)

dist(dummy_survey_b, method = 'binary')

1 2 3
2 1.0000000
3 1.0000000 0.6666667
4 0.6666667 1.0000000 0.6666667
DataCamp Cluster Analysis in R

CLUSTER ANALYSIS IN R

Let's practice!

You might also like