0% found this document useful (0 votes)
29 views4 pages

Hierarchical Clustering

This document provides a comprehensive overview of hierarchical clustering, detailing its methods, mathematical foundations, and challenges. It covers both agglomerative and divisive approaches, distance metrics, linkage criteria, and evaluation metrics, while also addressing contemporary issues such as scalability and robustness to noise. The paper concludes by emphasizing the importance of hierarchical clustering in exploratory data analysis and the need for future advancements in efficiency and integration with deep learning.

Uploaded by

diaetorres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views4 pages

Hierarchical Clustering

This document provides a comprehensive overview of hierarchical clustering, detailing its methods, mathematical foundations, and challenges. It covers both agglomerative and divisive approaches, distance metrics, linkage criteria, and evaluation metrics, while also addressing contemporary issues such as scalability and robustness to noise. The paper concludes by emphasizing the importance of hierarchical clustering in exploratory data analysis and the need for future advancements in efficiency and integration with deep learning.

Uploaded by

diaetorres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hierarchical Clustering

Hierarchical Clustering: Methods, Formalism, and Contemporary Challenges

Abstract:
Hierarchical clustering, a fundamental unsupervised learning technique, constructs a hierarchy of partitions either by iterative
mergers (agglomerative) or divisions (divisive) of clusters. This paper presents a mathematically rigorous treatment of hierarchical
clustering, detailing the core principles, distance metrics, linkage criteria, computational complexities, and practical implications.
Formal notations and proofs are provided where applicable, and we examine recent enhancements addressing the scalability and
interpretability challenges inherent to traditional hierarchical models.

Keywords:
Hierarchical Clustering, Agglomerative Clustering, Divisive Clustering, Dendrogram, Linkage Methods, Distance Metrics,
Mathematical Formalism

1. Introduction
Clustering is a form of unsupervised learning where the objective is to partition a set X = {x 1 , x 2 , … , x n } ⊆ R d into groups,
or clusters, such that intra-cluster similarity is maximized and inter-cluster similarity is minimized. Hierarchical clustering
approaches this task by successively combining or dividing clusters, resulting in a nested structure represented by a
dendrogram. Unlike flat methods, hierarchical clustering does not require the number of clusters kk to be pre-specified.

2. Mathematical Foundations
2.1 Problem Definition
Given a finite dataset X ⊆ R d , the goal of hierarchical clustering is to construct a sequence of partitions:
P 0 , P 1 , … , P n−1
where:

 P 0 = {{x 1 }, {x 2 }, … , {x n }} (all points as singleton clusters)


 P n−1 = {X} (all points merged into a single cluster)
 Each partition Pi+1\mathcal{P}_{i+1} is obtained from Pi\mathcal{P}_i by merging two clusters.

In divisive clustering, the sequence is reversed.

2.2 Distance Functions


A distance function (or dissimilarity measure) d:X×X→R≥0d : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_{\geq 0}
must satisfy:

 (Non-negativity) d(x, y) ≥0
 (Identity) d(x, y) =0 ⟺ x=y
 (Symmetry) d(x, y) = d(y, x)
 (Triangle Inequality) d(x, z) ≤ d(x, y) + d(y, z)

Common distance metrics include:



Euclidean distance: d Euclidean (x i , x j ) = √∑ dk=1 (x ik − x jk ) 2
 Manhattan distance: d Manhattan (x i , x j ) = ∑ dk=1 |x ik − x jk |
 Cosine dissimilarity: d ⟨x i ,x j ⟩
Cosine (x i , x j ) =1− ∥x i ∥∥x j ∥

where ⟨⋅, ⋅⟩ denotes the dot product.

3. Hierarchical Agglomerative Clustering (HAC)


3.1 Algorithmic Structure
HAC begins with each data point in its own cluster. At each iteration:

1. Compute the pairwise distances between all clusters.


2. Merge the two clusters AA and BB such that: (A, B) = arg min (Ci ,Cj ) d linkage (C i , C j )
3. Update the distance matrix.

The algorithm terminates when a single cluster remains.

3.2 Linkage Criteria Formalism


Let C i , C j ⊆X be two clusters. Different linkage methods define the distance d linkage (C i , C j ) between clusters:

 Single linkage (minimum distance): d single (C i , C j ) = min x∈Ci ,y∈Cj d(x, y)


 Complete linkage (maximum distance): d complete (C i , C j ) = max x∈C i ,y∈C j d(x, y)
 Average linkage (mean distance): d average (C i , C j ) = |C ||C | ∑ x∈C i ∑ y∈C j d(x, y)
1
i j

 Ward’s linkage (increase in variance):


Ward's method minimizes the total within-cluster variance. The distance between clusters CiCi and CjC_j is: $$d{\text{Ward}}
(C_i, C_j) = \frac{|C_i||C_j|}{|C_i| + |C_j|} | \mu_i - \mu_j |^2$$

where μ i and μ j are the centroids of clusters C i and C j , respectively.

4. Divisive Hierarchical Clustering


Divisive clustering begins with all points in one cluster and recursively splits clusters.

At each step:

 Choose the cluster C with highest internal dissimilarity.


 Partition C into two clusters C 1 , C 2 by optimizing a criterion such as maximum dissimilarity or minimizing the within-cluster
sum of squares (WCSS):

WCSS(C) = ∑ ∥x − μ C ∥ 2
x∈C

Due to its computational cost (O(2 n ) in worst-case scenarios), divisive methods are less popular than agglomerative ones.

5. Dendrograms and Cluster Extraction


A dendrogram is a tree T where:

 Leaves correspond to data points x i ∈ X.


 Internal nodes represent cluster merges with associated merge distances.

Cutting the dendrogram at a threshold t yields a flat clustering.


Mathematically, if height(u) denotes the merge height of node u, the set of clusters at threshold t corresponds to connected
components of the subgraph:

{u ∈ T ∣ height(u) ≤ t}

6. Computational Complexity
The naïve implementation of HAC requires:

 Distance matrix computation: O(n 2 )


 Merge operations: O(n 3 ) (in the worst case, updating all pairwise distances at each step)

Using data structures like priority queues or optimizations such as the nearest-neighbor chain algorithm, this can be
reduced to O(n 2 ).

Divisive methods, especially exhaustive splits, can have exponential complexity.

7. Evaluation Metrics
7.1 Silhouette Coefficient
Given a point x i :

 Let a(i) = average distance from x i to all other points in the same cluster.
 Let b(i) = minimum average distance from x i to points in a different cluster.

The silhouette coefficient s(i) is: s(i)


b(i)−a(i)
= max{a(i),b(i)}

where −1 ≤ s(i) ≤ 1.

7.2 Cophenetic Correlation Coefficient


Measures how faithfully a dendrogram preserves pairwise distances:

Cophenetic Correlation = corr(d(x i , x j ), d cophenetic (x i , x j ))


where cophenetic(x i , x j ) is the dendrogram merge height at which x i and x j first join the same cluster.

8. Limitations and Advances


8.1 Scalability
Techniques such as BIRCH and CURE address memory and runtime issues by:

 Compressing data using clustering features (CF) in BIRCH.


 Representing clusters via multiple representative points in CURE.

8.2 Robustness to Noise


Standard hierarchical clustering is sensitive to noise and outliers. Density-based variants like HDBSCAN improve noise handling
by incorporating density estimates into cluster formation.

8.3 Hybrid Approaches


Recent works combine hierarchical pre-clustering with deep autoencoders, resulting in models such as DeepCluster
and DEC-Hierarchy, where representations are learned jointly with clustering.

9. Conclusion
Hierarchical clustering offers a mathematically rich, flexible framework for exploratory data analysis. Despite computational
challenges, it remains valuable for its interpretability and lack of assumptions regarding the number of clusters. Future research
must further address efficiency, robustness, and integration with deep learning paradigms.

You might also like