0% found this document useful (0 votes)

3 views8 pages

Machine 3

Uploaded by

Yağız Kılıç

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views8 pages

Machine 3

Uploaded by

Yağız Kılıç

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

MACHINE LEARNING (Part 2)

In Part 1 of the topic, you learnt that unsupervised learning uses unlabelled datasets and reveals the
structure of data. One of the tasks of unsupervised machine learning is clustering. It is a process of
dividing an available dataset into subsets that share some similarities. These subsets of data are called
clusters; for example, an enterprise can group its clients based on the level of income (clients with high,
average, and low income), a librarian can group books based on a theme (romantic books, science
fiction, drama, etc.). Therefore, each cluster is made of one or more data objects and characterized by
two aspects:
• the similarity of data objects within the cluster;
• the difference of data objects between the clusters.

Concept of similarity
The concept of similarity (sometimes called similarity metric, distance, proximity, or closeness) makes
the basis for many unsupervised learning algorithms. In this case, the similarity of data objects is
determined based on feature-based descriptions of data objects (the so-called feature similarity) (Kubat,
2017). The notion of similarity differs for different data types (Kubat, 2017).

For continuous feature values, it is possible to calculate the geometric distance between any pair of data
objects: the closer to each other the data objects, the greater their mutual similarity. Two most common
metrics are the Euclidean distance and the Manhattan distance. The Euclidean distance is calculated
using the following formula (Jones, 2009; Kubat, 2017):

where x = (x1, x2,…, xn) is a feature vector of a new data object;

xi is the value of an ith feature of a new data object;
y = (y1, y2,…, yn) is a feature vector of a data object from the training dataset;
yi is the value of an ith feature of a data object from the training dataset;
n is the number of features describing data objects.
Another distance metrics for continuous feature values is the Manhattan distance that calculates the
sum of absolute differences between all feature values of data objects:

where x = (x1, x2,…, xn) is a feature vector of a new data object;

Figure 1 shows the examples of calculation of the Euclidean and Manhattan distances for two data
objects.

1
Fig. 1. Calculating the Euclidean and Manhattan distances

For categorical feature values, the Hamming distance is applied to calculate the number of features on
which two data objects differ (Russell & Norvig, 2010). The fewer the differences in features, the greater
the similarity of data objects. If the values of the same feature of both data objects are the same, then
the distance is equal to 0; otherwise, it is equal to 1. Figure 2 gives examples of calculating the Hamming
distance between three data objects. Student 1 and Student 2 have different values of the feature x1, so
the distance for this feature is 1; they have the same value of the feature x2, so the distance is 0 for this
feature; they have different values for the feature x3, so the distance is 1 for this feature. Similarly,
distances between Student 1 and Student 3 and Student 2 and Student 3 are calculated. As a result, one
can conclude that Student 1 and Student 3 are the most similar data objects as the distance between
them is the smallest one among the calculated distances.

Fig. 2. Calculating the Hamming distance

As quite usually datasets include mixed-type data – both continuous and categorical, to cope with these
data, we can rely on the sum of the squared distances along corresponding features (Kubat, 2017):

where x = (x1, x2,…, xn) is a feature vector of a new data object;

xi is the value of an ith feature of a new data object;
y = (y1, y2,…, yn) is a feature vector of a data object from the training dataset;
yi is the value of an ith feature of a data object from the training dataset;
2
n is the number of features describing data objects;
for continuous features;
for categorical features (Hamming distance).

Figure 3 demonstrates the calculation of the distance for mixed-type feature values. The features
“Average mark in the semester work” and “The number of missed lectures” have continuous values,
therefore the distance is defined as the difference of the values of the features in the square. In turn, the
features that represent the student's evaluations in the laboratory works have categorical values, so the
Hamming distance is used to determine whether the feature values of the data objects are the same or
different. The values of the data objects for the feature “Laboratory work 1” are the same, so the
Hamming distance is equal to 0. The values of the data objects for the feature “Laboratory work 2” are
different, therefore the Hamming distance is equal to 1.

Fig.3. Calculating the distance for mixed-type feature values

K-Means clustering
The K-Means algorithm is one of the popular algorithms of unsupervised machine learning. According to
(Jones, 2009), “the algorithm is popular primarily because it works relatively well and is extremely simple
both to understand and to implement”. The algorithm is based on two central concepts:
• the concept of distance;
• the concept of a centroid.

A distance represents a similarity measure used to group data objects in clusters (see above). A
centroid is a centre of a cluster around which data objects are grouped based on their distance to the
centroid. It is the average of the current set of feature vectors within the cluster (Jones, 2009). It means
that the data objects that are close to the centroid in terms of distance make a cluster. The task of K-
Means is to minimize the sum of distances between data objects belonging to a cluster and the cluster
centroid. The algorithm maps each data object to only one cluster.

The K-Means algorithm is based on the following steps (Jones, 2009; Kubat, 2017; Tyugu, 2007):
1. Randomly select K data objects from the available dataset, which will represent the initial
centroids. K is a hyperparameters of this algorithm. It determines how many clusters (K ) the
algorithm should create. This is usually chosen by the model developer using a trial-and-error
approach or heuristic-based methods.
2. Starting with the first data object, for each data object in the dataset:
a) calculate the distance between the data object and each of the centroids;
b) find the smallest distance and assign the data object to the corresponding cluster;

3
2. Check, whether the last data object in the dataset has been reached.
3. If not, retrieve the next data object and return to the distance calculation for that data object.
4. When the last data object is reached, recalculate the centroid values by taking the average
values of all data objects belonging to the centroid cluster.
5. Check, if the new centroid values differ from the previous values.
6. If yes, start he next iteration; otherwise, terminate.
The mentioned steps are represented in Figure 4. The typical termination criteria of the clustering
process in the K-Means algorithm is an iteration when data objects do not change their cluster
membership (centroid values do not change) (Jones, 2009; Kubat, 2017; Tyugu, 2007). Sometimes, it
is possible to terminate the clustering process by reaching a pre-defined number of iterations.

Fig.4. Steps of the K-Means algorithm

The algorithm has two main drawbacks (Jones, 2009; Kubat, 2017; Tyugu, 2007):
• it is necessary to define the number of clusters before the clustering process. It calls for serious
exploration of data before clustering and intensive experimentation to check the performance of
the algorithm with the different number of clusters;
• initialization of centroids can also be problematic.

Hierarchical clustering is another unsupervised machine learning algorithm that does not demand from
the developer any assumption on the number of clusters as the K-Means algorithm does.

Hierarchical clustering
Hierarchical clustering is about building a hierarchy of clusters that is characterized by the following
aspects (Hastie et al., 2017):
• the clusters at each level of the hierarchy are created by merging/dividing clusters at the next
lower level;
• at the lowest level, each cluster contains a single data object;
• at the highest level, there is only one cluster containing all of the data objects.

4
There are two types of hierarchical clustering (Hastie et al., 2017). Agglomerative hierarchical
clustering is a bottom-up approach to building a hierarchy of clusters. Therefore, each data object
initially is attributed to its own cluster. Then, at each level, a selected pair of clusters is recursively
merged into a single cluster. This produces a grouping at the next higher level with one less cluster. The
pairs chosen for merging consist of the two groups with the smallest intergroup dissimilarity (the fragment is
based on the information given in (Hastie et al., 2017))
. Therefore, a pair of clusters is merged at each hierarchy level until
one cluster with all data objects is acquired at the highest level. Figure 5 represents the process of
agglomerative hierarchical clustering in which clusters having the smallest distance between them are
merged at each step of clustering.

Fig.5. Agglomerative hierarchical clustering

The steps of the agglomerative hierarchical clustering can be described as follows (Fig. 6):
1. Attribute each data object to one cluster. This results in N clusters, where N is the number of data
objects in the dataset.
2. Calculate the distances between the clusters and merge the two clusters with the smallest
distance between them into one cluster. This step reduces the total number of clusters by 1.
3. Check, if all data objects are forming a single cluster. If yes, terminate; otherwise, return to Step
2.

Fig. 6. Steps of agglomerative hierarchical clustering

5
Divisive hierarchical clustering applies a top-down clustering approach. Therefore, it starts with a
cluster with all data objects inside it and then, at each level of the hierarchy, it recursively splits one of
the existing clusters at that level into two new clusters. The split is chosen to produce two new groups
with the largest between-group dissimilarity (the fragment is based on the information given in (Hastie et al., 2017)). The clustering
stops when each data object makes a cluster. Figure 7 represents the simplified idea of divisive
hierarchical clustering: data objects having the largest distance from other data objects in a cluster are
split in a separate cluster at each step of clustering. According to (Hastie et al., 2017), divisive
hierarchical clustering has not been studied nearly as extensively as agglomerative clustering in the
clustering literature.

Fig.7. Divisive hierarchical clustering

There are several methods to measure the similarity (distance) between clusters to decide the rules for
merging clusters in agglomerative hierarchical clustering, and they are often called linkage methods. The
most commonly used methods are (Hastie et al., 2017)
• Complete-linkage: in deciding about the cluster similarity, the distance between the clusters’
most distant elements (the longest distance).

where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster.

• Single-linkage: in deciding about the cluster similarity, the distance between the closest
elements of the two clusters (the shortest distance).

where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster.

• Average-linkage: the distance between two clusters is defined as the average distance between
each data object in one cluster to every data object in the other cluster.

6
where G, H – clusters;
d – distance;
i – a data object from G cluster;
i’ – a data object from H cluster;
NG – number of data object in the G cluster;
NH – number of data object in the H cluster.

Different linkage methods lead to different clusters, and thus the choice of the method depends on the
developer.

A dendrogram is an output of the hierarchical clustering algorithm. It is a tree-like diagram representing

hierarchical relationships between data objects in the dataset. According to (Hastie et al., 2017), “a
dendrogram provides a highly interpretable complete description of the hierarchical clustering in a
graphical format”. Thus, it provides insight into the way how clusters were formed.

A dendrogram consists of (Figure 8):

• clades that represent merging of data objects into clusters; each clade has its height;
• leaves that are terminal of clades corresponding to the data objects used in the clustering
process.

Fig.8. Constituent parts of a dendrogram

The key to interpreting a dendrogram is to focus on the height of clades, that is, the height at which
two data objects are merged. There are 8 data objects in Figure – A, B, C, D un E. Clade heigt allows
to make the following conclusions:
• Data objects (leaves) in the same clade are more similar to each other than to other data
objects. From the dendrogram in Figure 8, data objects C and E are the most similar in terms
of the distance between them, as the height of the clade connecting them together is the
smallest. Data objects A and D are the next most similar data objects.

7
• Clades with a slight difference in height indicate more similar data objects/clusters. Data
objects C, E, A and D are more similar to each other than they are to data object B because
their heights have slight differences;
• Clades with a greater difference in their heights indicate more distinct data objects/clusters.
The mots distinct are data objects C and E and data object B, because they have the biggest
difference in heights.
Thus, the more significant the difference in the height of clades, the more dissimilar are clusters.

A horizontal cut-off line is usually made through the dendrogram to decide the number of clusters based
on the hierarchical clustering algorithm. Cut-offs can be performed at different levels of the hierarchy
leading to a different number of clusters. The number of vertical lines intersected by the horizontal cut-off
line represents the number of clusters. For example, in Figure 9, Cut-off 1 intersects two vertical lines,
and we receive two clusters with the following data objects: (C, E, A, D) and (B). Cut-off 2 intersects 3
vertical lines, so we receive three clusters with the following data objects: (C, E), (A, D) and (B).

Fig.9. Cutting-off the dendrogram

Summarising the information about the hierarchical clustering given in this topic, it is worth mentioning
that the entire cluster hierarchy represents an ordered sequence of cluster merges (Hastie et al., 2017).
This algorithm does not demand to specify the number of clusters before the algorithm operation. Still,
the developer needs to decide a) where to cut the dendrogram to receive the final number of clusters
and b) which method of linkage to use to measure similarity between clusters.

Information sources
• Hastie, T., Tibshirani, R., Friedman, J. (2017). The elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics.
• Jones T.M. (2009). Artificial Intelligence: A Systems Approach. Jones & Bartlett Learning.
• Kubat, M. (2017). An Introduction to Machine Learning. Springer International Publishing.
• Tyugu, E. (2007). Algorithms and architectures of artificial intelligence. IOS Press.
• Russell, S., & Norvig P. (2010). Artificial Intelligence: A Modern Approach. Pearson.

Lecture 2
No ratings yet
Lecture 2
27 pages
Unit 4
No ratings yet
Unit 4
43 pages
2.unit 2 ML Q&A
No ratings yet
2.unit 2 ML Q&A
36 pages
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
No ratings yet
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
6 pages
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
No ratings yet
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
3 pages
Clustering
0% (1)
Clustering
127 pages
K-NN Classification Review
No ratings yet
K-NN Classification Review
7 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Unit-4 Unsupervised Algorithm
No ratings yet
Unit-4 Unsupervised Algorithm
18 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
5 pages
Distance-Based Methods - KNN
0% (1)
Distance-Based Methods - KNN
8 pages
Unit 2
No ratings yet
Unit 2
89 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
Unsupervised Learning Essentials
No ratings yet
Unsupervised Learning Essentials
29 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Suwanda 2020 J. Phys. Conf. Ser. 1566 012058
No ratings yet
Suwanda 2020 J. Phys. Conf. Ser. 1566 012058
7 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
ML Unit-5
No ratings yet
ML Unit-5
21 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
ML Unit-5
No ratings yet
ML Unit-5
30 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
Standardization and Its Effects On K-Means Clustering Algorithm
No ratings yet
Standardization and Its Effects On K-Means Clustering Algorithm
6 pages
Paper 16 - Clustering Applied To Data Structuring and Retrieval
No ratings yet
Paper 16 - Clustering Applied To Data Structuring and Retrieval
6 pages
Clustering Algorithm and Analyasis
No ratings yet
Clustering Algorithm and Analyasis
12 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Introduction To Classification - KNN
No ratings yet
Introduction To Classification - KNN
29 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
No ratings yet
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
10 pages
Data Science: Unsupervised Learning
No ratings yet
Data Science: Unsupervised Learning
49 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
ML - Unit 5
No ratings yet
ML - Unit 5
22 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Chapter 7
No ratings yet
Chapter 7
11 pages
Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Unsupervised Learning and Clustering
No ratings yet
Unsupervised Learning and Clustering
26 pages
5) - Differentiate Between K-Means and Hierarchical Clustering
No ratings yet
5) - Differentiate Between K-Means and Hierarchical Clustering
4 pages
164-Article Text-421-1-10-20210814
No ratings yet
164-Article Text-421-1-10-20210814
6 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
No ratings yet
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
10 pages
DS - Module 3
No ratings yet
DS - Module 3
65 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
Data Science: Clustering & Similarity
No ratings yet
Data Science: Clustering & Similarity
29 pages
Clustering
No ratings yet
Clustering
24 pages
Analysis of Euclidean Distance and Manhattan
No ratings yet
Analysis of Euclidean Distance and Manhattan
7 pages
Sis Sonn PDF
No ratings yet
Sis Sonn PDF
4 pages
K-means and Hierarchical Clustering Guide
No ratings yet
K-means and Hierarchical Clustering Guide
53 pages
Second Practical Assignment 2024
No ratings yet
Second Practical Assignment 2024
5 pages
Top 15 Jobs in Latva With Salaries
No ratings yet
Top 15 Jobs in Latva With Salaries
8 pages
Requirements For 3rd Assignment
No ratings yet
Requirements For 3rd Assignment
1 page
Practical Work I
No ratings yet
Practical Work I
20 pages
Boarding Pass
No ratings yet
Boarding Pass
4 pages
Revolutionizing Agriculture
No ratings yet
Revolutionizing Agriculture
23 pages
Standar Soportes Sierra Gorda
No ratings yet
Standar Soportes Sierra Gorda
117 pages
Invigilator Training Field Inv TCS
No ratings yet
Invigilator Training Field Inv TCS
27 pages
Submersible Pumps for Home & Farm
No ratings yet
Submersible Pumps for Home & Farm
4 pages
Computer Graphics Interaction Guide
No ratings yet
Computer Graphics Interaction Guide
15 pages
Popular Music Pedagogy & Production
No ratings yet
Popular Music Pedagogy & Production
25 pages
Queuing Theory: Single vs Multi-Server Systems
No ratings yet
Queuing Theory: Single vs Multi-Server Systems
59 pages
AA04173 - Pines de Direccion
No ratings yet
AA04173 - Pines de Direccion
4 pages
RC Truck Manual for Hobbyists
No ratings yet
RC Truck Manual for Hobbyists
24 pages
GC2000 Software Operating Instructions - en
No ratings yet
GC2000 Software Operating Instructions - en
65 pages
Columstore Index
No ratings yet
Columstore Index
6 pages
Technical Drawing and Technical Drafting
100% (1)
Technical Drawing and Technical Drafting
10 pages
Takeoff Cg/Trim Pos: Before Start Approach
No ratings yet
Takeoff Cg/Trim Pos: Before Start Approach
2 pages
Weak Internet Connection Final
No ratings yet
Weak Internet Connection Final
41 pages
Parrot Products Botswana (Pty) LTD - August Pricelist
No ratings yet
Parrot Products Botswana (Pty) LTD - August Pricelist
336 pages
Investigating Mechanical Properties of Animal Bone Powder Partially Replaced Cement in Concrete Production
No ratings yet
Investigating Mechanical Properties of Animal Bone Powder Partially Replaced Cement in Concrete Production
14 pages
InferenceOps and Management - LLM Inference Handbook
No ratings yet
InferenceOps and Management - LLM Inference Handbook
3 pages
How I Became A Pivotal Spring Professional Certified 5.0 PDF
No ratings yet
How I Became A Pivotal Spring Professional Certified 5.0 PDF
6 pages
Crimp Specs for Engineers
No ratings yet
Crimp Specs for Engineers
6 pages
Math Contest Problems 2012
No ratings yet
Math Contest Problems 2012
2 pages
Atmos-Pipe PB en
No ratings yet
Atmos-Pipe PB en
2 pages
Sba Sample Outline
No ratings yet
Sba Sample Outline
52 pages
ALS30C1023NP Capacitor Kit 6 PCS
No ratings yet
ALS30C1023NP Capacitor Kit 6 PCS
2 pages
Axitec PV Modules
No ratings yet
Axitec PV Modules
19 pages
Multilevel BOM
100% (1)
Multilevel BOM
3 pages
Fixed Slite Display: Installation Manual
No ratings yet
Fixed Slite Display: Installation Manual
61 pages
66 Mid Range Sterilizer Specs
No ratings yet
66 Mid Range Sterilizer Specs
10 pages
Strategic Management Analysis of The Strategy of The: Prof. Dr. Christian Buer
No ratings yet
Strategic Management Analysis of The Strategy of The: Prof. Dr. Christian Buer
25 pages
Performance Task 2 Rubric Interview
No ratings yet
Performance Task 2 Rubric Interview
2 pages
MELSEC iQ-R Ethernet, CC-Link IE, and MELSECNET/H Function Block Reference
No ratings yet
MELSEC iQ-R Ethernet, CC-Link IE, and MELSECNET/H Function Block Reference
202 pages

Machine 3

Uploaded by

Machine 3

Uploaded by

MACHINE LEARNING (Part 2)

where x = (x1, x2,…, xn) is a feature vector of a new data object;

where x = (x1, x2,…, xn) is a feature vector of a new data object;

Fig. 2. Calculating the Hamming distance

where x = (x1, x2,…, xn) is a feature vector of a new data object;

Fig.3. Calculating the distance for mixed-type feature values

Fig.4. Steps of the K-Means algorithm

Fig.5. Agglomerative hierarchical clustering

Fig. 6. Steps of agglomerative hierarchical clustering

Fig.7. Divisive hierarchical clustering

A dendrogram is an output of the hierarchical clustering algorithm. It is a tree-like diagram representing

A dendrogram consists of (Figure 8):

Fig.8. Constituent parts of a dendrogram

Fig.9. Cutting-off the dendrogram

You might also like