0% found this document useful (0 votes)
3 views8 pages

Machine 3

Uploaded by

Yağız Kılıç
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Machine 3

Uploaded by

Yağız Kılıç
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MACHINE LEARNING (Part 2)

In Part 1 of the topic, you learnt that unsupervised learning uses unlabelled datasets and reveals the
structure of data. One of the tasks of unsupervised machine learning is clustering. It is a process of
dividing an available dataset into subsets that share some similarities. These subsets of data are called
clusters; for example, an enterprise can group its clients based on the level of income (clients with high,
average, and low income), a librarian can group books based on a theme (romantic books, science
fiction, drama, etc.). Therefore, each cluster is made of one or more data objects and characterized by
two aspects:
• the similarity of data objects within the cluster;
• the difference of data objects between the clusters.

Concept of similarity
The concept of similarity (sometimes called similarity metric, distance, proximity, or closeness) makes
the basis for many unsupervised learning algorithms. In this case, the similarity of data objects is
determined based on feature-based descriptions of data objects (the so-called feature similarity) (Kubat,
2017). The notion of similarity differs for different data types (Kubat, 2017).

For continuous feature values, it is possible to calculate the geometric distance between any pair of data
objects: the closer to each other the data objects, the greater their mutual similarity. Two most common
metrics are the Euclidean distance and the Manhattan distance. The Euclidean distance is calculated
using the following formula (Jones, 2009; Kubat, 2017):

where x = (x1, x2,…, xn) is a feature vector of a new data object;


xi is the value of an ith feature of a new data object;
y = (y1, y2,…, yn) is a feature vector of a data object from the training dataset;
yi is the value of an ith feature of a data object from the training dataset;
n is the number of features describing data objects.
Another distance metrics for continuous feature values is the Manhattan distance that calculates the
sum of absolute differences between all feature values of data objects:

where x = (x1, x2,…, xn) is a feature vector of a new data object;


xi is the value of an ith feature of a new data object;
y = (y1, y2,…, yn) is a feature vector of a data object from the training dataset;
yi is the value of an ith feature of a data object from the training dataset;
n is the number of features describing data objects.

Figure 1 shows the examples of calculation of the Euclidean and Manhattan distances for two data
objects.

1
Fig. 1. Calculating the Euclidean and Manhattan distances

For categorical feature values, the Hamming distance is applied to calculate the number of features on
which two data objects differ (Russell & Norvig, 2010). The fewer the differences in features, the greater
the similarity of data objects. If the values of the same feature of both data objects are the same, then
the distance is equal to 0; otherwise, it is equal to 1. Figure 2 gives examples of calculating the Hamming
distance between three data objects. Student 1 and Student 2 have different values of the feature x1, so
the distance for this feature is 1; they have the same value of the feature x2, so the distance is 0 for this
feature; they have different values for the feature x3, so the distance is 1 for this feature. Similarly,
distances between Student 1 and Student 3 and Student 2 and Student 3 are calculated. As a result, one
can conclude that Student 1 and Student 3 are the most similar data objects as the distance between
them is the smallest one among the calculated distances.

Fig. 2. Calculating the Hamming distance

As quite usually datasets include mixed-type data – both continuous and categorical, to cope with these
data, we can rely on the sum of the squared distances along corresponding features (Kubat, 2017):

where x = (x1, x2,…, xn) is a feature vector of a new data object;


xi is the value of an ith feature of a new data object;
y = (y1, y2,…, yn) is a feature vector of a data object from the training dataset;
yi is the value of an ith feature of a data object from the training dataset;
2
n is the number of features describing data objects;
for continuous features;
for categorical features (Hamming distance).

Figure 3 demonstrates the calculation of the distance for mixed-type feature values. The features
“Average mark in the semester work” and “The number of missed lectures” have continuous values,
therefore the distance is defined as the difference of the values of the features in the square. In turn, the
features that represent the student's evaluations in the laboratory works have categorical values, so the
Hamming distance is used to determine whether the feature values of the data objects are the same or
different. The values of the data objects for the feature “Laboratory work 1” are the same, so the
Hamming distance is equal to 0. The values of the data objects for the feature “Laboratory work 2” are
different, therefore the Hamming distance is equal to 1.

Fig.3. Calculating the distance for mixed-type feature values

K-Means clustering
The K-Means algorithm is one of the popular algorithms of unsupervised machine learning. According to
(Jones, 2009), “the algorithm is popular primarily because it works relatively well and is extremely simple
both to understand and to implement”. The algorithm is based on two central concepts:
• the concept of distance;
• the concept of a centroid.

A distance represents a similarity measure used to group data objects in clusters (see above). A
centroid is a centre of a cluster around which data objects are grouped based on their distance to the
centroid. It is the average of the current set of feature vectors within the cluster (Jones, 2009). It means
that the data objects that are close to the centroid in terms of distance make a cluster. The task of K-
Means is to minimize the sum of distances between data objects belonging to a cluster and the cluster
centroid. The algorithm maps each data object to only one cluster.

The K-Means algorithm is based on the following steps (Jones, 2009; Kubat, 2017; Tyugu, 2007):
1. Randomly select K data objects from the available dataset, which will represent the initial
centroids. K is a hyperparameters of this algorithm. It determines how many clusters (K ) the
algorithm should create. This is usually chosen by the model developer using a trial-and-error
approach or heuristic-based methods.
2. Starting with the first data object, for each data object in the dataset:
a) calculate the distance between the data object and each of the centroids;
b) find the smallest distance and assign the data object to the corresponding cluster;

3
2. Check, whether the last data object in the dataset has been reached.
3. If not, retrieve the next data object and return to the distance calculation for that data object.
4. When the last data object is reached, recalculate the centroid values by taking the average
values of all data objects belonging to the centroid cluster.
5. Check, if the new centroid values differ from the previous values.
6. If yes, start he next iteration; otherwise, terminate.
The mentioned steps are represented in Figure 4. The typical termination criteria of the clustering
process in the K-Means algorithm is an iteration when data objects do not change their cluster
membership (centroid values do not change) (Jones, 2009; Kubat, 2017; Tyugu, 2007). Sometimes, it
is possible to terminate the clustering process by reaching a pre-defined number of iterations.

Fig.4. Steps of the K-Means algorithm

The algorithm has two main drawbacks (Jones, 2009; Kubat, 2017; Tyugu, 2007):
• it is necessary to define the number of clusters before the clustering process. It calls for serious
exploration of data before clustering and intensive experimentation to check the performance of
the algorithm with the different number of clusters;
• initialization of centroids can also be problematic.

Hierarchical clustering is another unsupervised machine learning algorithm that does not demand from
the developer any assumption on the number of clusters as the K-Means algorithm does.

Hierarchical clustering
Hierarchical clustering is about building a hierarchy of clusters that is characterized by the following
aspects (Hastie et al., 2017):
• the clusters at each level of the hierarchy are created by merging/dividing clusters at the next
lower level;
• at the lowest level, each cluster contains a single data object;
• at the highest level, there is only one cluster containing all of the data objects.

4
There are two types of hierarchical clustering (Hastie et al., 2017). Agglomerative hierarchical
clustering is a bottom-up approach to building a hierarchy of clusters. Therefore, each data object
initially is attributed to its own cluster. Then, at each level, a selected pair of clusters is recursively
merged into a single cluster. This produces a grouping at the next higher level with one less cluster. The
pairs chosen for merging consist of the two groups with the smallest intergroup dissimilarity (the fragment is
based on the information given in (Hastie et al., 2017))
. Therefore, a pair of clusters is merged at each hierarchy level until
one cluster with all data objects is acquired at the highest level. Figure 5 represents the process of
agglomerative hierarchical clustering in which clusters having the smallest distance between them are
merged at each step of clustering.

Fig.5. Agglomerative hierarchical clustering

The steps of the agglomerative hierarchical clustering can be described as follows (Fig. 6):
1. Attribute each data object to one cluster. This results in N clusters, where N is the number of data
objects in the dataset.
2. Calculate the distances between the clusters and merge the two clusters with the smallest
distance between them into one cluster. This step reduces the total number of clusters by 1.
3. Check, if all data objects are forming a single cluster. If yes, terminate; otherwise, return to Step
2.

Fig. 6. Steps of agglomerative hierarchical clustering

5
Divisive hierarchical clustering applies a top-down clustering approach. Therefore, it starts with a
cluster with all data objects inside it and then, at each level of the hierarchy, it recursively splits one of
the existing clusters at that level into two new clusters. The split is chosen to produce two new groups
with the largest between-group dissimilarity (the fragment is based on the information given in (Hastie et al., 2017)). The clustering
stops when each data object makes a cluster. Figure 7 represents the simplified idea of divisive
hierarchical clustering: data objects having the largest distance from other data objects in a cluster are
split in a separate cluster at each step of clustering. According to (Hastie et al., 2017), divisive
hierarchical clustering has not been studied nearly as extensively as agglomerative clustering in the
clustering literature.

Fig.7. Divisive hierarchical clustering

There are several methods to measure the similarity (distance) between clusters to decide the rules for
merging clusters in agglomerative hierarchical clustering, and they are often called linkage methods. The
most commonly used methods are (Hastie et al., 2017)
• Complete-linkage: in deciding about the cluster similarity, the distance between the clusters’
most distant elements (the longest distance).

where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster.

• Single-linkage: in deciding about the cluster similarity, the distance between the closest
elements of the two clusters (the shortest distance).

where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster.

• Average-linkage: the distance between two clusters is defined as the average distance between
each data object in one cluster to every data object in the other cluster.

6
where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster;
NG – number of data object in the G cluster;
NH – number of data object in the H cluster.

Different linkage methods lead to different clusters, and thus the choice of the method depends on the
developer.

A dendrogram is an output of the hierarchical clustering algorithm. It is a tree-like diagram representing


hierarchical relationships between data objects in the dataset. According to (Hastie et al., 2017), “a
dendrogram provides a highly interpretable complete description of the hierarchical clustering in a
graphical format”. Thus, it provides insight into the way how clusters were formed.

A dendrogram consists of (Figure 8):


• clades that represent merging of data objects into clusters; each clade has its height;
• leaves that are terminal of clades corresponding to the data objects used in the clustering
process.

Fig.8. Constituent parts of a dendrogram

The key to interpreting a dendrogram is to focus on the height of clades, that is, the height at which
two data objects are merged. There are 8 data objects in Figure – A, B, C, D un E. Clade heigt allows
to make the following conclusions:
• Data objects (leaves) in the same clade are more similar to each other than to other data
objects. From the dendrogram in Figure 8, data objects C and E are the most similar in terms
of the distance between them, as the height of the clade connecting them together is the
smallest. Data objects A and D are the next most similar data objects.

7
• Clades with a slight difference in height indicate more similar data objects/clusters. Data
objects C, E, A and D are more similar to each other than they are to data object B because
their heights have slight differences;
• Clades with a greater difference in their heights indicate more distinct data objects/clusters.
The mots distinct are data objects C and E and data object B, because they have the biggest
difference in heights.
Thus, the more significant the difference in the height of clades, the more dissimilar are clusters.

A horizontal cut-off line is usually made through the dendrogram to decide the number of clusters based
on the hierarchical clustering algorithm. Cut-offs can be performed at different levels of the hierarchy
leading to a different number of clusters. The number of vertical lines intersected by the horizontal cut-off
line represents the number of clusters. For example, in Figure 9, Cut-off 1 intersects two vertical lines,
and we receive two clusters with the following data objects: (C, E, A, D) and (B). Cut-off 2 intersects 3
vertical lines, so we receive three clusters with the following data objects: (C, E), (A, D) and (B).

Fig.9. Cutting-off the dendrogram

Summarising the information about the hierarchical clustering given in this topic, it is worth mentioning
that the entire cluster hierarchy represents an ordered sequence of cluster merges (Hastie et al., 2017).
This algorithm does not demand to specify the number of clusters before the algorithm operation. Still,
the developer needs to decide a) where to cut the dendrogram to receive the final number of clusters
and b) which method of linkage to use to measure similarity between clusters.

Information sources
• Hastie, T., Tibshirani, R., Friedman, J. (2017). The elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics.
• Jones T.M. (2009). Artificial Intelligence: A Systems Approach. Jones & Bartlett Learning.
• Kubat, M. (2017). An Introduction to Machine Learning. Springer International Publishing.
• Tyugu, E. (2007). Algorithms and architectures of artificial intelligence. IOS Press.
• Russell, S., & Norvig P. (2010). Artificial Intelligence: A Modern Approach. Pearson.

You might also like