0% found this document useful (0 votes)
18 views13 pages

Clustering Evaluation

The document discusses cluster analysis, outlining its basic concepts, methods, and evaluation techniques. It categorizes clustering algorithms into partitioning, hierarchical, density-based, and grid-based methods, highlighting popular examples like K-means and DBSCAN. Additionally, it covers methods for assessing clustering quality, both extrinsic and intrinsic, and provides references for further reading.

Uploaded by

Nedia Ben Ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Clustering Evaluation

The document discusses cluster analysis, outlining its basic concepts, methods, and evaluation techniques. It categorizes clustering algorithms into partitioning, hierarchical, density-based, and grid-based methods, highlighting popular examples like K-means and DBSCAN. Additionally, it covers methods for assessing clustering quality, both extrinsic and intrinsic, and provides references for further reading.

Uploaded by

Nedia Ben Ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Università degli Studi di Milano

Master Degree in Computer Science

Information Management
course
Teacher: Alberto Ceselli

Lecture 21: 09/12/2014


Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
3
Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Statistic
 Given a dataset D regarded as a sample of a random variable z,
determine how far away z is from being uniformly distributed in
the data space
 Sample n points, p1, …, pn, uniformly from the feature space of D.
For each pi, find its nearest neighbor in D: yi = min{dist (pi, v)}
where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: xi = min{dist (qi, v)} where v in D
and v ≠ qi
 Calculate the Hopkins Statistic:

 If z (and so D) is uniformly distributed, ∑ xi and ∑ yi are close to


each other and H is close to 0.5.
 If D is clustered, H is close to 1 4
Determine the Number of Clusters
 Empirical method
 # of clusters ≈sqrt(n/2) for a dataset of n points

 Elbow method
 Use the turning point in the curve of sum of within cluster

variance w.r.t the # of clusters


 Cross validation method
 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the clustering


E.g., For each point in the test set, find the closest
centroid, and use the sum of squared distance between
all points in the test set and the closest centroids to
measure how well the model fits the test set
 For any k > 0, repeat it m times, compare the overall quality

measure w.r.t. different k’s, and find # of clusters that fits


the data the best 5
Measuring Clustering Quality
 Two methods: extrinsic vs. intrinsic
 Extrinsic: supervised, i.e., the ground truth (ideal
clustering, e.g. built by domain experts) is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is
unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient
6
Measuring Clustering Quality: Extrinsic
Methods

 Clustering quality measure: Q(C, Cg), for a clustering C


given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better

 Cluster completeness: should assign objects belong

to the same category in the ground truth to the same


cluster
 Rag bag: putting a heterogeneous object into a pure

cluster should be penalized more than putting it into


a rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category

into pieces is more harmful than splitting a large


category into pieces

7
Measuring Clustering Quality: Intrinsic
Methods

 Silhouette coefficient: similarity metric between objects


in the data set
 Let C1 .. Ck be the clusters
 For each object o in a certain cluster t
 let a(o) be the average distance between o and

the objects of Ct
 let b (o) be the average distance between o and
l
the objects of cluster l; then b(o) = minl ≠ t bl(o)
 The silhouette coefficient is defined as follows:

b (o)−a(o)
s (o)=
max (a (o) , b (o))

8
Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
9
Summary
 Cluster analysis groups objects based on their similarity and has wide
applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods, and
model-based methods
 K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms, and
there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
 Quality of clustering results can be evaluated in various ways

10
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications.
SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering
points to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
11
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data:
An approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering
algorithm for large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering
algorithm for categorical attributes. In ICDE'99, pp. 512-521,
Sydney, Australia, March 1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in
Large Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice
Hall, 1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8):
68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an
Introduction to Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in
large datasets. VLDB’98.

12
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and
Applications to Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional
Data: A Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for
very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-
resolution clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of
Obstacles, ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets, SIGMOD’ 02.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data
clustering method for very large databases. SIGMOD'96.
 Xiaoxin Yin, Jiawei Han, and Philip Yu, “
LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, in Proc.
2006 Int. Conf. on Very Large Data Bases (VLDB'06), Seoul, Korea, Sept.
2006.
13

You might also like