Clustering
Most of the topics are based on the textbook, introduction to data mining by tan and
video lectures by AndewNg
Clustering algorithm is an unsupervised learning
Clusters are potential classes and cluster analysis is the study of techniques for automatically
finding classes.
Dividing objects is clustering and assigning particular objects to these groups is called
classification
Clustering is similar to classification and that we don’t know labels here
Ex: finding human genome, social network analysis, market segmentation and astronomical data
analysis
So, why can’t we get labels?
We can’t afford to get (costly) – amazon web pages
Don’t exist (no real truth) – classification of animals into species
We can represent each object by the index of the prototype associated with it. This type of
compression is called vector quantization
Cluster validity – methods for evaluating the goodness of the cluster produced by clustering
algorithm
The greater the similarity within the group and greater the difference between groups the better
or more distinct is the clustering
The definition of cluster is imprecise and the best definition depends on the nature of data and
the desired results
Clustering can be used either for utility or for understanding
Understanding:
Clustering is used in biology, information retrieval, climate, business etc. to understand various
different patterns
Utility: here clustering analysis is only starting point for other purposes
Summarization
Compression
Efficiently finding nearest neighbours
Different types of clustering:
Hierarchical and partitional clustering:
Exclusive: assigning each object to a single cluster
Overlapping or non-exclusive clustering: situations in which a point can be reasonably be placed
in more than one cluster
Fuzzy: Every object belongs to every cluster with a membership weight that is between 0 and 1
(all the weights must sum to 1)
Probabilistic clustering compute the probability with which each point belongs to each cluster (all
the probabilities must sum to 1)
Complete and partial clusters: a complete cluster assigns every object to a cluster where as a
partial clustering does not.
Types of cluster:
Well separated
Prototype based – (centre-based clusters)
Graph based – connected component
Density based
Shared property
There are three major algorithms used in clustering they are
1) k-means
2) DBSCAN
3) agglomerative hierarchical clustering
In this exercise we are mainly going to concentrate on Kmeans clustering which is prototype based
partitional clustering
Algorithm:
1: Select K points as initial centroids
2: repeat
3: Form K clusters by assigning each point to its closest centroid
4: recomputed the centroid of each cluster
5: until centroids do not change
Step 5 is often replaced by weaker condition, like repeat until only 1% of points change cluster
Kmeans is based on proximity measures to assign a point to its closest center, where proximity
measure characterizes the similarity or dissimilarity that exists between the objects
Some proximity measures:
Euclidean and Manhattan – used for data points
Cosine and Jaccard - used for documents
Objective: Minimise the SSE (sum of squared errors) from a point to its centroid
𝑘
2
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑐𝑖, 𝑥)
𝑖=1 𝑥∈𝑐𝑖
If we have the output labels we can calculate the accuracy of objects allocated to their
respective clusters
Time and space complexity:
Storage required – O((m+K)n)
Time required – O(I*K*m*n) – linear
Here I - number of iterations, K – number of clusters, m – number of data points, n – number
of attributes. Here I and K are much smaller compared to m
Process in kmeans
Programming in R:
1) Using the library
2) Implementing the algorithm
Prototype:
kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)
#use required parameters depending on the application
#x, centers are necessary parameters and others are optional
kmeans(dataframe, number of clusters) – generally used
fit = kmeans(data,2) – here 2 represents number of clusters
fit$centers – gives centers of each cluster
fit$cluster – gives which points belong to which cluster
fit$size – gives number of points in each cluster
fit$iter – number of iterations it required to converge
Cluster Evaluation:
The evaluation measures that are applied to judge various aspects of cluster validity are
traditionally classified into three types
Unsupervised:
Measures the goodness of cluster structure without respect to external information. An
example of this is SSE.
Supervised:
Measures the extent to which the clustering structure discovered by a clustering algorithm
matches some external structure. Here external structure can be externally provided class
labels
Relative:
Compares different clustering’s or clusters. A relative cluster evaluation method is a
supervised or unsupervised evaluation measure that is used for the purpose of comparison
Issues:
Choosing initial clusters
Choosing number of clusters
Handling empty clusters
Outliers
Choosing initial clusters:
One way of choosing initial clusters Random initialization
Randomly pick objects and set centroid equal to these objects
K means can converge to different solution depending on the centroid initialization (So random
centroid initialisation is important)
There should be a global optima, clusters should not struck at local optima
For this problem, try multiple random initialisations and consider those whose cost function value is
low
This will be helpful when number of clusters are less
Choosing number of clusters:
There is no particular better way for this, one way is visualisation
Elbow method for choosing clusters:
Distortion (cost function value) goes down as we increase number of clusters
Choose number of clusters at the elbow point (before elbow, distortion goes down rapidly and after
elbow, distortion goes down slowly)
Questions:
1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).
Compute the (Euclidean) distance.
2)Consider the points x1<-c(1,2) and x2<-c(5,10). Compute cosine distance
3) Using the data x<-c(1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10) . Find clusters
where k=2.
4) For the given sonar_test.csv find the clustering with k=2 for first two
columns
5) Find the testing error
6) For the given sonar_test.csv find the clustering with k=2 for the entire
data set and find the testing error
7) For the given sonar_test.csv do hierarchical clustering with 4 cuts