0% found this document useful (0 votes)

73 views4 pages

Clustering - The Data Ensemble

Unsupervised learning algorithms are used to group unlabeled data into clusters based on similarities. There are various distance measures used in clustering to determine how close or far apart data points are from each other, such as Jaccard, Euclidean, Cosine and Manhattan distances. Hierarchical clustering builds a hierarchy of clusters by either merging or splitting clusters at each step. It is represented using a dendrogram which shows the relationships between clusters. K-means clustering aims to partition data into k predefined clusters by minimizing distances between data points and their respective cluster centers. Choosing the right number of clusters k is important for obtaining meaningful results from k-means clustering.

Uploaded by

Daniel N Sherine Foo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views4 pages

Clustering - The Data Ensemble

Uploaded by

Daniel N Sherine Foo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

Unsupervised learning

Take away from Unsupervised Learning

In this topic, you will be learning about Unsupervised Learning algorithms. You
will also understand the various distance measures that will be used in clustering.

Distance Measure
Distance Measure is a very important aspect of clustering. Knowing how close or how
far apart each variable is with respect to the other helps in grouping them.

Jaccard Distance

The Jaccard index is used to compare elements of two sets to identify which of the
members are shared and not shared. The Jaccard Distance is a measure of how
different the two given sets are.

Jaccard Distance = 1-(Jaccard Index)

Eucledian Distance is the shortest distance between the two given points in
Eucledian Space.

Cosine distance of two given vectors u and v is the angular cosine between the
given vectors.

Manhattan distance is calculated on a strictly horizontal or vertical path.

Module Summary
Learning at a high level
Few prominent distance measures used in Clustering data

Hierarchical Clustering

Hierarchical clustering – What’s Covered?

In this course we will be learning the following clustering techniques

K Means
Hierarchical

Hierarchical Clustering Explained

Begin by allotting each item to a cluster. If you are having N items, you are now
having N clusters, where each of them contains one item.
Now, let us make the similarities (distances) between the clusters the same as the
similarities (distances) between the items they include.
Discover the most identical or closest pair of clusters, merge them into one
cluster, thereby reducing one cluster.
Calculate the similarities (distances) between each of the old clusters and the new
cluster.
Repeat step 2 and step 3 until all items are finally clustered into one cluster
with size N.
Source:
https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Dendrogram
A dendrogram is a branching diagram that represents the relationships of similarity
among a group of entities
Each branch is called a clade
The terminal end of each clade is called a leaf
There is no limit to the number of leaves in a clade
The arrangement of the clades tells us which leaves are most similar to each other
The height of the branch points indicates how similar or different they are from
each other
The greater the height, the greater the difference between the points

Disadvantages of Agglomerative Clustering

Disadvantages for agglomerative hierarchical clustering

If data points are wrongly grouped at the inception, they cannot be reallocated.
If different similarity measures are utilized to calculate the similarity between
clusters, it may result in different results altogether.

Tips for Hierarchical Clustering

There is no particular size that fits all solutions to determine how many clusters
you need. It depends on what you intend to do with them. For a better solution,
look at the basic characteristics of the given clusters at successive steps and
make a decision when you have a solution that can be interpreted

Hierarchical clustering – Standardization

Standardizing the variables is a good way to follow while clustering data.

Summary on Hierarchical Clustering

In this module, you have learnt Hierarchical Clustering in detail. You have also
learnt how to read a Dendogram and some tips to be followed when
fitting hierarchical clustering to a data set.

K Means Clustering

Take Away from K-Means Clustering

In this topic, you will learn K Means Clustering in detail. You will also get to
understand the concept though an interactive game.

K-Means Algorithm Simplified

Place k points in the space represented by the objects that are being clustered.
These points represent initial group centroids.
Assign each object to the group that has the closest centroid.
When all objects have been assigned, recalculate the positions of the k centroids.
Repeat Step 2 and 3 until the centroids no longer move.
Source: http://eacharya.inflibnet.ac.in/data-server/eacharya-
documents/53e0c6cbe413016f23443704_INFIEP_33/19/LM/33-19-LM-V1-
S1__document_clustering_2.pdf

Tips for K Means Clustering

For large datasets random sampling can be used to determine the k value for
clustering
Hierarchical Clustering can also be used for the same

Choosing Right K-value

Other Ways to choose the right k value

By rule of thumb
Elbow method
Information Criterion Approach
An Information Theoretic Approach
Choosing k using the Silhouette
Cross-validation

Summary on K Means Clustering

In this topic, you have learnt K-Means clustering in detail. You have also learnt
the concept through a game and understood some tips on how to fit k-means algorithm
to your data set.

K Means Clustering using R

Code Snippet
K Means Clustering in R

Loading and exploring the dataset

library(datasets)
head(iris)
Visualizing the data

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Setting the seed and creating the cluster

set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
Comparing the clusters with the species

table(irisCluster$cluster, iris$Species)
Plotting the dataset to view the clusters

irisCluster$cluster <- as.factor(irisCluster$cluster)

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()

Code Snippet
Hierarchical Clustering in R

Loading and exploring the dataset

library(datasets)
head(iris)

Visualizing the data

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

Calculating the distance and plotting the dendogram

clusters <- hclust(dist(iris[, 3:4]))

plot(clusters)
Cutting the desired number of clusters and comparing it with the data

clusterCut <- cutree(clusters, 3)

table(clusterCut, iris$Species)
Visualizing the clusters

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +

geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))

Clustering Course Summary

In this course you have learnt
Unsupervised Learning Technique (Clustering)
Hierarchical Clustering
K Means Clustering
Hands-on Exercise on clustering using R
Hope you had fun in this journey.

Hierarchical vs K-Means Clustering Guide
No ratings yet
Hierarchical vs K-Means Clustering Guide
4 pages
Clustering
No ratings yet
Clustering
20 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Clustering Algorithms & Evaluation in R
No ratings yet
Clustering Algorithms & Evaluation in R
11 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
30 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
57 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
K-Means Clustering
No ratings yet
K-Means Clustering
18 pages
Module12.02 UnsupervisedLearning
No ratings yet
Module12.02 UnsupervisedLearning
25 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
K-Means Clustering Insights
No ratings yet
K-Means Clustering Insights
8 pages
Cluster Analysis Usingr PDF
No ratings yet
Cluster Analysis Usingr PDF
0 pages
Clustering
No ratings yet
Clustering
75 pages
Mlclustering2022 10 26
No ratings yet
Mlclustering2022 10 26
36 pages
K-means and Hierarchical Clustering Guide
No ratings yet
K-means and Hierarchical Clustering Guide
53 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
110 pages
Clustering
No ratings yet
Clustering
22 pages
Part2 Clustering Q&A
No ratings yet
Part2 Clustering Q&A
7 pages
Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Day 3
No ratings yet
Day 3
74 pages
Clustering Techniques in ML
No ratings yet
Clustering Techniques in ML
3 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
ML Unit-5
No ratings yet
ML Unit-5
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
What Is Unsupervised Learning
No ratings yet
What Is Unsupervised Learning
9 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
21 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
Module 4
No ratings yet
Module 4
63 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering
No ratings yet
Clustering
75 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
38 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
ML PR 5
No ratings yet
ML PR 5
23 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Slides - Clustering
No ratings yet
Slides - Clustering
13 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Gartner Predicts Procurement Data Challenges and Rapid Change 2025
No ratings yet
Gartner Predicts Procurement Data Challenges and Rapid Change 2025
2 pages
Statistics
No ratings yet
Statistics
1 page
Statistics and Probability Katabasis
No ratings yet
Statistics and Probability Katabasis
7 pages
Data Science with Python & Anaconda
No ratings yet
Data Science with Python & Anaconda
9 pages
Q and A For Job Interview
No ratings yet
Q and A For Job Interview
2 pages
Data Handling Using R
No ratings yet
Data Handling Using R
1 page
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Advanced Regression with GLMs
No ratings yet
Advanced Regression with GLMs
13 pages
PCNE Workbook
No ratings yet
PCNE Workbook
83 pages
End-to-End Developer Journey On GKE Ebook 02
No ratings yet
End-to-End Developer Journey On GKE Ebook 02
37 pages
Deep Learning Essentials
No ratings yet
Deep Learning Essentials
9 pages
Clustering Others Evaluation
No ratings yet
Clustering Others Evaluation
70 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
9 pages
Fuzzy Logic & Machine Learning - PPT
No ratings yet
Fuzzy Logic & Machine Learning - PPT
138 pages
Week 2 - VAE
No ratings yet
Week 2 - VAE
14 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
76 pages
Technical Seminar
No ratings yet
Technical Seminar
27 pages
Telegram Channel Telegram Group
No ratings yet
Telegram Channel Telegram Group
36 pages
2021 Pho1 15 Neural Networks Part1
No ratings yet
2021 Pho1 15 Neural Networks Part1
77 pages
Word 2 Vec
No ratings yet
Word 2 Vec
29 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Building Neural Networks - A Hands-On Journey From Scratch With Python - by Long Nguyen - Medium
No ratings yet
Building Neural Networks - A Hands-On Journey From Scratch With Python - by Long Nguyen - Medium
21 pages
Multi-Layer Perceptron Guide
No ratings yet
Multi-Layer Perceptron Guide
16 pages
Soft Computing: Perceptron & XOR Backpropagation
No ratings yet
Soft Computing: Perceptron & XOR Backpropagation
6 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
9 pages
Cheatsheet Deep Learning
No ratings yet
Cheatsheet Deep Learning
2 pages
(2020) Gaussian Error Linear Units (Gelus)
No ratings yet
(2020) Gaussian Error Linear Units (Gelus)
9 pages
Exam 2001
No ratings yet
Exam 2001
17 pages
Unit 9 ANN
No ratings yet
Unit 9 ANN
14 pages
Quiz Week 5 - Attempt Review
No ratings yet
Quiz Week 5 - Attempt Review
6 pages
Lecture 14 Introduction To Pytorch
No ratings yet
Lecture 14 Introduction To Pytorch
45 pages
Deep Learning Basics Explained
No ratings yet
Deep Learning Basics Explained
21 pages
AI & CNN Concepts in B.Tech CSE
No ratings yet
AI & CNN Concepts in B.Tech CSE
7 pages
Supervised Learning: Max Margin Classifier
No ratings yet
Supervised Learning: Max Margin Classifier
11 pages
ML UNIT-4 Notes PDF
100% (1)
ML UNIT-4 Notes PDF
40 pages
Machine Learning Mastery Notes
No ratings yet
Machine Learning Mastery Notes
4 pages
Thongsuwan 2020
No ratings yet
Thongsuwan 2020
10 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages
AI Fundamentals Midterm Quiz
No ratings yet
AI Fundamentals Midterm Quiz
4 pages
Exam 2004
No ratings yet
Exam 2004
20 pages

Clustering - The Data Ensemble

Uploaded by

Clustering - The Data Ensemble

Uploaded by

Unsupervised learning

Take away from Unsupervised Learning

Jaccard Distance = 1-(Jaccard Index)

Manhattan distance is calculated on a strictly horizontal or vertical path.

Hierarchical clustering – What’s Covered?

Hierarchical Clustering Explained

Disadvantages of Agglomerative Clustering

Tips for Hierarchical Clustering

Hierarchical clustering – Standardization

Summary on Hierarchical Clustering

Take Away from K-Means Clustering

K-Means Algorithm Simplified

Tips for K Means Clustering

Choosing Right K-value

Summary on K Means Clustering

K Means Clustering using R

Loading and exploring the dataset

irisCluster$cluster <- as.factor(irisCluster$cluster)

Loading and exploring the dataset

Visualizing the data

Calculating the distance and plotting the dendogram

clusters <- hclust(dist(iris[, 3:4]))

clusterCut <- cutree(clusters, 3)

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +

Clustering Course Summary

You might also like