0% found this document useful (0 votes)

18 views13 pages

Clustering Evaluation

The document discusses cluster analysis, outlining its basic concepts, methods, and evaluation techniques. It categorizes clustering algorithms into partitioning, hierarchical, density-based, and grid-based methods, highlighting popular examples like K-means and DBSCAN. Additionally, it covers methods for assessing clustering quality, both extrinsic and intrinsic, and provides references for further reading.

Uploaded by

Nedia Ben Ammar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

Clustering Evaluation

Uploaded by

Nedia Ben Ammar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Università degli Studi di Milano

Master Degree in Computer Science

Information Management
course
Teacher: Alberto Ceselli

Lecture 21: 09/12/2014

Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
3
Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Statistic
 Given a dataset D regarded as a sample of a random variable z,
determine how far away z is from being uniformly distributed in
the data space
 Sample n points, p1, …, pn, uniformly from the feature space of D.
For each pi, find its nearest neighbor in D: yi = min{dist (pi, v)}
where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: xi = min{dist (qi, v)} where v in D
and v ≠ qi
 Calculate the Hopkins Statistic:

 If z (and so D) is uniformly distributed, ∑ xi and ∑ yi are close to

each other and H is close to 0.5.
 If D is clustered, H is close to 1 4
Determine the Number of Clusters
 Empirical method
 # of clusters ≈sqrt(n/2) for a dataset of n points

 Elbow method
 Use the turning point in the curve of sum of within cluster

variance w.r.t the # of clusters

 Cross validation method
 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the clustering


E.g., For each point in the test set, find the closest
centroid, and use the sum of squared distance between
all points in the test set and the closest centroids to
measure how well the model fits the test set
 For any k > 0, repeat it m times, compare the overall quality

measure w.r.t. different k’s, and find # of clusters that fits

the data the best 5
Measuring Clustering Quality
 Two methods: extrinsic vs. intrinsic
 Extrinsic: supervised, i.e., the ground truth (ideal
clustering, e.g. built by domain experts) is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is
unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient
6
Measuring Clustering Quality: Extrinsic
Methods

 Clustering quality measure: Q(C, Cg), for a clustering C

given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better

 Cluster completeness: should assign objects belong

to the same category in the ground truth to the same

cluster
 Rag bag: putting a heterogeneous object into a pure

cluster should be penalized more than putting it into

a rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category

into pieces is more harmful than splitting a large

category into pieces

7
Measuring Clustering Quality: Intrinsic
Methods

 Silhouette coefficient: similarity metric between objects

in the data set
 Let C1 .. Ck be the clusters
 For each object o in a certain cluster t
 let a(o) be the average distance between o and

the objects of Ct
 let b (o) be the average distance between o and
l
the objects of cluster l; then b(o) = minl ≠ t bl(o)
 The silhouette coefficient is defined as follows:

b (o)−a(o)
s (o)=
max (a (o) , b (o))

8
Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
9
Summary
 Cluster analysis groups objects based on their similarity and has wide
applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods, and
model-based methods
 K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms, and
there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
 Quality of clustering results can be evaluated in various ways

10
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications.
SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering
points to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
11
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data:
An approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering
algorithm for large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering
algorithm for categorical attributes. In ICDE'99, pp. 512-521,
Sydney, Australia, March 1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in
Large Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice
Hall, 1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8):
68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an
Introduction to Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in
large datasets. VLDB’98.

12
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and
Applications to Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional
Data: A Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for
very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-
resolution clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of
Obstacles, ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets, SIGMOD’ 02.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data
clustering method for very large databases. SIGMOD'96.
 Xiaoxin Yin, Jiawei Han, and Philip Yu, “
LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, in Proc.
2006 Int. Conf. on Very Large Data Bases (VLDB'06), Seoul, Korea, Sept.
2006.
13

Cluster Analysis Methods Guide
100% (1)
Cluster Analysis Methods Guide
21 pages
Clustering Part2
No ratings yet
Clustering Part2
29 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
2022 Istdm 06
No ratings yet
2022 Istdm 06
76 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Data Mining Chapter 5 Cluster Analysis
No ratings yet
Data Mining Chapter 5 Cluster Analysis
44 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
DA Unit II
No ratings yet
DA Unit II
21 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Unit 4
No ratings yet
Unit 4
65 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
Unit IV
No ratings yet
Unit IV
96 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Intro to Data Clustering Methods
No ratings yet
Intro to Data Clustering Methods
29 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Big Data Clustering Techniques
No ratings yet
Big Data Clustering Techniques
28 pages
Clustering
No ratings yet
Clustering
45 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
43 pages
DM Unit-4 Part1
No ratings yet
DM Unit-4 Part1
21 pages
Cluster Analysis
No ratings yet
Cluster Analysis
136 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
Data Mining - 5
No ratings yet
Data Mining - 5
4 pages
Unit VII
No ratings yet
Unit VII
30 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
36 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages
Clustering
No ratings yet
Clustering
104 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering in Data Mining Guide
No ratings yet
Clustering in Data Mining Guide
39 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering Part 1
No ratings yet
Clustering Part 1
12 pages
BD Unit 3
No ratings yet
BD Unit 3
27 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
08PCA
No ratings yet
08PCA
21 pages
15mining Freq Patterns-Part1
No ratings yet
15mining Freq Patterns-Part1
25 pages
Clustering Methods
No ratings yet
Clustering Methods
64 pages
Backjumping and Learning in Search
No ratings yet
Backjumping and Learning in Search
46 pages
Strategic Games With Incomplete Information: Microeconomics I: Game Theory
No ratings yet
Strategic Games With Incomplete Information: Microeconomics I: Game Theory
50 pages
Department of Economics: Game Theory 7112
No ratings yet
Department of Economics: Game Theory 7112
3 pages
Column Shear Strenght
No ratings yet
Column Shear Strenght
8 pages
Teacher Guidelines for Haoclass
No ratings yet
Teacher Guidelines for Haoclass
21 pages
School Deworming Report Form
No ratings yet
School Deworming Report Form
1 page
Gasket Brochure
100% (1)
Gasket Brochure
8 pages
K.G.F: Chapter 1 Movie Analysis
No ratings yet
K.G.F: Chapter 1 Movie Analysis
2 pages
HACCP Program
No ratings yet
HACCP Program
28 pages
LoE in APAC 2012magazine
No ratings yet
LoE in APAC 2012magazine
4 pages
The Art of Immutable Architecture Theory and Practice of Data Management in Distributed Systems 2nd Edition Michael L Perry Download
No ratings yet
The Art of Immutable Architecture Theory and Practice of Data Management in Distributed Systems 2nd Edition Michael L Perry Download
32 pages
Solution 28: Perform Availability Checks
No ratings yet
Solution 28: Perform Availability Checks
1 page
Cat 880200 R21 Litc
No ratings yet
Cat 880200 R21 Litc
110 pages
Alien Periodic Table Analysis
No ratings yet
Alien Periodic Table Analysis
5 pages
The International Association For The Properties of Water and Steam
No ratings yet
The International Association For The Properties of Water and Steam
7 pages
VVC Newsletter August 2024
No ratings yet
VVC Newsletter August 2024
15 pages
A63 L1&L2 Service Manual PDF
No ratings yet
A63 L1&L2 Service Manual PDF
28 pages
Haynog300 Bilingue Rev 01 Leitura
No ratings yet
Haynog300 Bilingue Rev 01 Leitura
60 pages
Surrey Crime Reduction Strategy
100% (1)
Surrey Crime Reduction Strategy
48 pages
Algebra Manipulations
100% (1)
Algebra Manipulations
8 pages
Florence Student Housing Guide
No ratings yet
Florence Student Housing Guide
1 page
Problem 3.60 PDF
No ratings yet
Problem 3.60 PDF
1 page
High/Low-Line Plenum Boxes for Diffusers
No ratings yet
High/Low-Line Plenum Boxes for Diffusers
1 page
(Ebook) Biomechanics: Principles and Practices by Donald R. Peterson, Joseph D. Bronzino ISBN 9781439870983, 1439870985 Digital Download
No ratings yet
(Ebook) Biomechanics: Principles and Practices by Donald R. Peterson, Joseph D. Bronzino ISBN 9781439870983, 1439870985 Digital Download
154 pages
Transportation Data Management Insights
No ratings yet
Transportation Data Management Insights
24 pages
D&D Cheat Sheet
100% (1)
D&D Cheat Sheet
2 pages
Laboratory #4: Control Charts For Variable Data (X-Bar and R) Purpose: Materials
No ratings yet
Laboratory #4: Control Charts For Variable Data (X-Bar and R) Purpose: Materials
7 pages
Contoh Business Plan Indonesia
No ratings yet
Contoh Business Plan Indonesia
9 pages
Singson Vs Coa
50% (2)
Singson Vs Coa
2 pages
PC Zone (M) Sdn. BHD.: Invoice
No ratings yet
PC Zone (M) Sdn. BHD.: Invoice
1 page
NVQ Level 3: Indian Head Massage Guide
No ratings yet
NVQ Level 3: Indian Head Massage Guide
27 pages
Physical Maps Latin America
No ratings yet
Physical Maps Latin America
4 pages
IBM Sterling Order Management System Technical 1
No ratings yet
IBM Sterling Order Management System Technical 1
52 pages

Clustering Evaluation

Uploaded by

Clustering Evaluation

Uploaded by

Università degli Studi di Milano

Master Degree in Computer Science

Lecture 21: 09/12/2014

 If z (and so D) is uniformly distributed, ∑ xi and ∑ yi are close to

variance w.r.t the # of clusters

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the clustering

measure w.r.t. different k’s, and find # of clusters that fits

 Clustering quality measure: Q(C, Cg), for a clustering C

 Cluster completeness: should assign objects belong

to the same category in the ground truth to the same

cluster should be penalized more than putting it into

into pieces is more harmful than splitting a large

 Silhouette coefficient: similarity metric between objects

You might also like