Università degli Studi di Milano
Master Degree in Computer Science
Information Management
course
Teacher: Alberto Ceselli
Lecture 21: 09/12/2014
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
3
Assessing Clustering Tendency
Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
Test spatial randomness by statistic test: Hopkins Statistic
Given a dataset D regarded as a sample of a random variable z,
determine how far away z is from being uniformly distributed in
the data space
Sample n points, p1, …, pn, uniformly from the feature space of D.
For each pi, find its nearest neighbor in D: yi = min{dist (pi, v)}
where v in D
Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: xi = min{dist (qi, v)} where v in D
and v ≠ qi
Calculate the Hopkins Statistic:
If z (and so D) is uniformly distributed, ∑ xi and ∑ yi are close to
each other and H is close to 0.5.
If D is clustered, H is close to 1 4
Determine the Number of Clusters
Empirical method
# of clusters ≈sqrt(n/2) for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster
variance w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m – 1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering
E.g., For each point in the test set, find the closest
centroid, and use the sum of squared distance between
all points in the test set and the closest centroids to
measure how well the model fits the test set
For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different k’s, and find # of clusters that fits
the data the best 5
Measuring Clustering Quality
Two methods: extrinsic vs. intrinsic
Extrinsic: supervised, i.e., the ground truth (ideal
clustering, e.g. built by domain experts) is available
Compare a clustering against the ground truth using
certain clustering quality measure
Ex. BCubed precision and recall metrics
Intrinsic: unsupervised, i.e., the ground truth is
unavailable
Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
Ex. Silhouette coefficient
6
Measuring Clustering Quality: Extrinsic
Methods
Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
Cluster homogeneity: the purer, the better
Cluster completeness: should assign objects belong
to the same category in the ground truth to the same
cluster
Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into
a rag bag (i.e., “miscellaneous” or “other” category)
Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
7
Measuring Clustering Quality: Intrinsic
Methods
Silhouette coefficient: similarity metric between objects
in the data set
Let C1 .. Ck be the clusters
For each object o in a certain cluster t
let a(o) be the average distance between o and
the objects of Ct
let b (o) be the average distance between o and
l
the objects of cluster l; then b(o) = minl ≠ t bl(o)
The silhouette coefficient is defined as follows:
b (o)−a(o)
s (o)=
max (a (o) , b (o))
8
Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
9
Summary
Cluster analysis groups objects based on their similarity and has wide
applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods, and
model-based methods
K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms, and
there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
Quality of clustering results can be evaluated in various ways
10
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications.
SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering
points to identify the clustering structure, SIGMOD’99.
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
11
References (2)
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data:
An approach based on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering
algorithm for large databases. SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering
algorithm for categorical attributes. In ICDE'99, pp. 512-521,
Sydney, Australia, March 1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in
Large Multimedia Databases with Noise. KDD’98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice
Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8):
68-75, 1999.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an
Introduction to Cluster Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in
large datasets. VLDB’98.
12
References (3)
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and
Applications to Clustering. John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. VLDB'94.
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional
Data: A Review, SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for
very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-
resolution clustering approach for very large spatial databases. VLDB’98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of
Obstacles, ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets, SIGMOD’ 02.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data
clustering method for very large databases. SIGMOD'96.
Xiaoxin Yin, Jiawei Han, and Philip Yu, “
LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, in Proc.
2006 Int. Conf. on Very Large Data Bases (VLDB'06), Seoul, Korea, Sept.
2006.
13