0% found this document useful (0 votes)

58 views7 pages

Clustering

The document discusses clustering algorithms and k-means clustering. It describes how k-means works, including initializing centroids, assigning points to clusters, recomputing centroids, and iterating until convergence. It also discusses choosing the number of clusters, evaluating clusters, and implementing k-means in R.

Uploaded by

Deepak Varma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views7 pages

Clustering

Uploaded by

Deepak Varma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Clustering

Most of the topics are based on the textbook, introduction to data mining by tan and
video lectures by AndewNg
Clustering algorithm is an unsupervised learning

Clusters are potential classes and cluster analysis is the study of techniques for automatically
finding classes.

Dividing objects is clustering and assigning particular objects to these groups is called
classification

Clustering is similar to classification and that we don’t know labels here

Ex: finding human genome, social network analysis, market segmentation and astronomical data
analysis

So, why can’t we get labels?

We can’t afford to get (costly) – amazon web pages

Don’t exist (no real truth) – classification of animals into species

We can represent each object by the index of the prototype associated with it. This type of
compression is called vector quantization

Cluster validity – methods for evaluating the goodness of the cluster produced by clustering
algorithm

The greater the similarity within the group and greater the difference between groups the better
or more distinct is the clustering

The definition of cluster is imprecise and the best definition depends on the nature of data and
the desired results

Clustering can be used either for utility or for understanding

Understanding:

Clustering is used in biology, information retrieval, climate, business etc. to understand various
different patterns

Utility: here clustering analysis is only starting point for other purposes

Summarization

Compression

Efficiently finding nearest neighbours

Different types of clustering:

Hierarchical and partitional clustering:

Exclusive: assigning each object to a single cluster

Overlapping or non-exclusive clustering: situations in which a point can be reasonably be placed
in more than one cluster

Fuzzy: Every object belongs to every cluster with a membership weight that is between 0 and 1
(all the weights must sum to 1)

Probabilistic clustering compute the probability with which each point belongs to each cluster (all
the probabilities must sum to 1)

Complete and partial clusters: a complete cluster assigns every object to a cluster where as a
partial clustering does not.

Types of cluster:

Well separated

Prototype based – (centre-based clusters)

Graph based – connected component

Density based

Shared property
There are three major algorithms used in clustering they are

1) k-means
2) DBSCAN
3) agglomerative hierarchical clustering

In this exercise we are mainly going to concentrate on Kmeans clustering which is prototype based
partitional clustering

Algorithm:

1: Select K points as initial centroids

2: repeat

3: Form K clusters by assigning each point to its closest centroid

4: recomputed the centroid of each cluster

5: until centroids do not change

Step 5 is often replaced by weaker condition, like repeat until only 1% of points change cluster

Kmeans is based on proximity measures to assign a point to its closest center, where proximity
measure characterizes the similarity or dissimilarity that exists between the objects

Some proximity measures:

Euclidean and Manhattan – used for data points

Cosine and Jaccard - used for documents

Objective: Minimise the SSE (sum of squared errors) from a point to its centroid

𝑘
2
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑐𝑖, 𝑥)
𝑖=1 𝑥∈𝑐𝑖

If we have the output labels we can calculate the accuracy of objects allocated to their
respective clusters

Time and space complexity:

Storage required – O((m+K)n)
Time required – O(I*K*m*n) – linear
Here I - number of iterations, K – number of clusters, m – number of data points, n – number
of attributes. Here I and K are much smaller compared to m

Process in kmeans

Programming in R:

1) Using the library

2) Implementing the algorithm

Prototype:

kmeans(x, centers, iter.max = 10, nstart = 1,

algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)

#use required parameters depending on the application

#x, centers are necessary parameters and others are optional

kmeans(dataframe, number of clusters) – generally used

fit = kmeans(data,2) – here 2 represents number of clusters

fit$centers – gives centers of each cluster

fit$cluster – gives which points belong to which cluster

fit$size – gives number of points in each cluster

fit$iter – number of iterations it required to converge

Cluster Evaluation:
The evaluation measures that are applied to judge various aspects of cluster validity are
traditionally classified into three types

Unsupervised:
Measures the goodness of cluster structure without respect to external information. An
example of this is SSE.
Supervised:
Measures the extent to which the clustering structure discovered by a clustering algorithm
matches some external structure. Here external structure can be externally provided class
labels
Relative:
Compares different clustering’s or clusters. A relative cluster evaluation method is a
supervised or unsupervised evaluation measure that is used for the purpose of comparison

Issues:
Choosing initial clusters
Choosing number of clusters
Handling empty clusters
Outliers

Choosing initial clusters:

One way of choosing initial clusters Random initialization

Randomly pick objects and set centroid equal to these objects

K means can converge to different solution depending on the centroid initialization (So random
centroid initialisation is important)

There should be a global optima, clusters should not struck at local optima
For this problem, try multiple random initialisations and consider those whose cost function value is
low

This will be helpful when number of clusters are less

Choosing number of clusters:

There is no particular better way for this, one way is visualisation

Elbow method for choosing clusters:

Distortion (cost function value) goes down as we increase number of clusters

Choose number of clusters at the elbow point (before elbow, distortion goes down rapidly and after
elbow, distortion goes down slowly)
Questions:

1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).

Compute the (Euclidean) distance.
2)Consider the points x1<-c(1,2) and x2<-c(5,10). Compute cosine distance
3) Using the data x<-c(1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10) . Find clusters
where k=2.
4) For the given sonar_test.csv find the clustering with k=2 for first two
columns
5) Find the testing error
6) For the given sonar_test.csv find the clustering with k=2 for the entire
data set and find the testing error
7) For the given sonar_test.csv do hierarchical clustering with 4 cuts

Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Clustering
No ratings yet
Clustering
20 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering
No ratings yet
Clustering
44 pages
Clustering
No ratings yet
Clustering
38 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning Insights
No ratings yet
Unsupervised Learning Insights
10 pages
Week 9
No ratings yet
Week 9
66 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
UNIT5
No ratings yet
UNIT5
60 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
Clustering
No ratings yet
Clustering
55 pages
SLide#4 - Clustering and Elbow Technique
No ratings yet
SLide#4 - Clustering and Elbow Technique
29 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Unit 5
No ratings yet
Unit 5
63 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Clustering
No ratings yet
Clustering
84 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Clustering
No ratings yet
Clustering
80 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Cluster
100% (1)
Cluster
72 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
ML - 8
No ratings yet
ML - 8
70 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Day 3
No ratings yet
Day 3
74 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Silo - Tips - Secret Server Architecture and Sizing Guide
No ratings yet
Silo - Tips - Secret Server Architecture and Sizing Guide
12 pages
Mobile Money V100R003 Error Codes Reference 01
100% (3)
Mobile Money V100R003 Error Codes Reference 01
1,566 pages
ICT Level 1
No ratings yet
ICT Level 1
3 pages
DevOps Resume Sahadeva
No ratings yet
DevOps Resume Sahadeva
3 pages
CRM Interview Questions
No ratings yet
CRM Interview Questions
16 pages
Programming Basics for Students
No ratings yet
Programming Basics for Students
10 pages
openSAP s4h29 All Slides
No ratings yet
openSAP s4h29 All Slides
64 pages
Dining Tables: Home /the One Furniture /dining Room Furniture /dining Tables
No ratings yet
Dining Tables: Home /the One Furniture /dining Room Furniture /dining Tables
7 pages
Edukondalu 9 Years EXP IN UI-angular-platform-Engineer
No ratings yet
Edukondalu 9 Years EXP IN UI-angular-platform-Engineer
5 pages
Purchase Invoice Management
No ratings yet
Purchase Invoice Management
5 pages
Presentation Name: Voice Aided Email Alert
No ratings yet
Presentation Name: Voice Aided Email Alert
16 pages
HP G4 Thunderbolt Dock 120W Owner's Manual
No ratings yet
HP G4 Thunderbolt Dock 120W Owner's Manual
26 pages
CV Hardik Sankhla (1) - 250108 - 180517
No ratings yet
CV Hardik Sankhla (1) - 250108 - 180517
3 pages
Micom P847: Phasor Measurement and System Stability Unit
No ratings yet
Micom P847: Phasor Measurement and System Stability Unit
4 pages
Android-Based Emergency Medical Service in Ethiopian Health-Care Institutions - Alazar Mengstab PDF
100% (1)
Android-Based Emergency Medical Service in Ethiopian Health-Care Institutions - Alazar Mengstab PDF
96 pages
Vol 1 Form 1 Ict Studying Notes
No ratings yet
Vol 1 Form 1 Ict Studying Notes
17 pages
Software Engineering Interview Questions
No ratings yet
Software Engineering Interview Questions
1 page
ES Lab Manual DSP Board-Embedded
No ratings yet
ES Lab Manual DSP Board-Embedded
77 pages
Week 3 - Introduction To The Profession - Presentation
No ratings yet
Week 3 - Introduction To The Profession - Presentation
16 pages
100 Python Interview Questions and Answers
No ratings yet
100 Python Interview Questions and Answers
32 pages
CS101 Practical Assignment 1 - Aug-Dec 2025
No ratings yet
CS101 Practical Assignment 1 - Aug-Dec 2025
3 pages
Report Patient Health
No ratings yet
Report Patient Health
40 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
4 pages
KMDRV
No ratings yet
KMDRV
2 pages
Project - File of Electricity Biling System
No ratings yet
Project - File of Electricity Biling System
10 pages
The Forrester Wave™ - Software Composition Analysis Software, Q4 2024 - 001a000001AEpqmAAD - D31a573c
No ratings yet
The Forrester Wave™ - Software Composition Analysis Software, Q4 2024 - 001a000001AEpqmAAD - D31a573c
22 pages
Assessment of Smart Home: Security and Privacy
No ratings yet
Assessment of Smart Home: Security and Privacy
18 pages
AI Roadmap 2025
No ratings yet
AI Roadmap 2025
3 pages
F.Y. B.Tech. (Semester - I) (Revised) Examination, December - 2018 Fundamentals of Electronics and Computers (All Branches)
No ratings yet
F.Y. B.Tech. (Semester - I) (Revised) Examination, December - 2018 Fundamentals of Electronics and Computers (All Branches)
2 pages
SQLMemory Corruption Dump
No ratings yet
SQLMemory Corruption Dump
2 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Clustering is similar to classification and that we don’t know labels here

So, why can’t we get labels?

We can’t afford to get (costly) – amazon web pages

Don’t exist (no real truth) – classification of animals into species

Clustering can be used either for utility or for understanding

Efficiently finding nearest neighbours

Different types of clustering:

Hierarchical and partitional clustering:

Exclusive: assigning each object to a single cluster

Prototype based – (centre-based clusters)

Graph based – connected component

1: Select K points as initial centroids

3: Form K clusters by assigning each point to its closest centroid

4: recomputed the centroid of each cluster

5: until centroids do not change

Some proximity measures:

Euclidean and Manhattan – used for data points

Cosine and Jaccard - used for documents

Time and space complexity:

1) Using the library

kmeans(x, centers, iter.max = 10, nstart = 1,

#use required parameters depending on the application

#x, centers are necessary parameters and others are optional

kmeans(dataframe, number of clusters) – generally used

fit = kmeans(data,2) – here 2 represents number of clusters

fit$cluster – gives which points belong to which cluster

fit$size – gives number of points in each cluster

fit$iter – number of iterations it required to converge

Choosing initial clusters:

Randomly pick objects and set centroid equal to these objects

This will be helpful when number of clusters are less

Choosing number of clusters:

There is no particular better way for this, one way is visualisation

Elbow method for choosing clusters:

Distortion (cost function value) goes down as we increase number of clusters

1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).

You might also like