Chapter Four
MACHINE LEARNING
By:
Irandufa Indebu
irandufa.indebu@haramaya.edu.et
Outlines
Overview of machine learning
Supervised Learning (Classification and Regression)
Unsupervised Learning (More focus on Clustering ..)
Model Evaluation Methods
Overview of Machine Learning
Machine learning falls under the larger umbrella of artificial
intelligence.
Artificial intelligence is a branch of computer science that
includes reasoning, natural language processing, planning, and
machine learning.
Data science encompasses both artificial intelligence and
machine learning.
Overview of Machine Learning
Machine learning is the field of study that develops the
algorithms that the computers follow in order to identify and
extract patterns from data
ML allow us to “teach” computers how to perform tasks by
providing examples of how they should be done.
It gives computers the ability to learn without being explicitly
programmed
It is programming computers to optimize a performance criterion
using example data or past experience
Overview of Machine Learning
ML algorithms and techniques are applied primarily during the
modeling stage of CRISP-DM. ML involves a two-step process.
First, an ML algorithm is applied to a data set to identify useful
patterns in the data
These patterns can be represented in a number of different ways
The model may be predictive to make predictions in the future,
or descriptive to gain knowledge from data, or both.
Overview of Machine Learning
Pattern may be represented as decision tree, regression model,
and neural network.
These representations of patterns are known as “models,” which
is why this stage of the CRISP-DM life cycle is known at the
“modeling stage.”
Once a model has been created and represented, it is used for
analysis.
• Statistics: making inferences from sample data.
• Numerical algorithms (linear algebra, optimization):
ML depends on: optimize criteria, manipulate models.
• Computer science: data structures and programs that solve
a ML problem efficiently.
Some Application of ML
Fraud detection
• Algorithms are now capable of detecting when a financial
transaction has characteristics of fraud.
• Even Companies can spot fake reviews by recognizing word
patterns and timing for review postings that are more likely to
be fake.
Speech or face recognition: use speech recognition to understand
what we are saying and to respond to our request. Social media uses
complex data analysis to recognize patterns in our photos and can tell
who is in a picture before we even begin tagging.
Some Application of ML
Medical diagnosis: Machine learning could use image
recognition to diagnose x-rays by taking data from several patients
to imaging scans in order to make predictions about new patients.
Speech or face recognition. .
Credit scoring: classify customers into high- and low-risk,
based on their income and savings, using data about past loans
(whether they were paid or not).
Regression: the labels to be predicted are continuous
* Predict the price of a car from its mileage.
Some Application of ML
Stock Predictions: Stock traders look at many variables to decide
on what to do with a stock, whether they want to buy or sell or
wait it out.
They look at certain characteristics of a stock, and trends in the
market environment to make an educated guess on what they
should do.
ML Pipeline
The high-level components of machine learning system are
outlined in the following diagram
This diagram illustrates the machine learning pipeline from
which we obtain data and in which we store data.
ML Pipeline
We then transform it into a form that is usable as input to a
machine learning model; train, test, and refine our model; and
then, deploy the final model to our production system.
The process is then repeated as new data is generated
Types of ML
Some of the main types of machine learning are:
1. Supervised Learning, in which the training data is labeled with
the correct answers
, e.g. “spam” or “ham.”
“Fraudulent or non-Fraudulent”
Types of ML
2. Unsupervised learning, in which we are given a collection of
unlabeled data, which we wish to analyze and discover patterns
within. The two most important examples are dimension
reduction and clustering.
3. Reinforcement learning, in which an agent (e.g., a robot or
controller) seeks to learn the optimal actions to take based the
outcomes of past actions.
Supervised Learning
Supervised Learning
In supervised learning, programmers use labeled data
The data that we are looking at is already predetermined
Supervised learning is “supervised” because each of the instances
in the data set lists both the input values and the output (target)
value for each instance
For supervised learning to take place, each instance in the data
set must be labeled with the value of the target attribute.
Supervised Learning
We are trying to find a relationship between X and Y that we have
chosen. Input output
X Y
After you find a relationship between X and Y, you get a model, which
will predict an outcome based on those relationships that your machine
has observed in the data.
Once we have estimated this relationship, we can predict Y with
X.
Supervised Learning
The goal of supervised learning is to learn a function that maps
from the values of the attributes describing an instance to the value
of another attribute, known as the target attributes of that instance.
The two most common types of supervised learning are
classification (where the outputs are discrete labels, as in spam
filtering) and regression (where the outputs are real-valued).
An easy way to distinguish between classification and regression
tasks is to ask whether there is some kind of continuity in the
output
Classification
Classification requires labeled data and creates non-continuous
predications.
Classification models are probably the most widely used part of
machine learning and data science.
Classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
In a classification problem, a label is a member of a finite set of
classes
Classification
The first type of classification is binary classification
With binary classification, the data is classified into two
categories. The size of the set of classes is two (“sick”/“healthy”,
“spam”/“not_spam”),
We call it binary classification because there are only two
possible categories and all of our data falls into one or the other.
But there are instances when we have more than two categories,
and for this, we use multi-class classification models
Classification
Multiclass classification (also called multinomial) is a classification
problem with three or more classes
E.g.. Sentiment analysis result= positive, negative or neutral
Classification methods
Goal: Predict class Ci = f(x1, x2, .. xn)
There are various classification methods. Popular classification
techniques include the following.
K-nearest neighbor
Decision tree classifier: divide decision space into piecewise
constant regions.
Neural networks: partition by non-linear boundaries
Bayesian network: a probabilistic model
Support vector machine
K Nearest Neighbor
K-nearest neighbors are one of the most straightforward and
widely used methods of data classification
It is a supervised machine learning classifier that uses the
observations it memorizes from within a test dataset to predict
classifications for new, unlabeled observations.
kNN makes its predictions based on similarity the more similar
the training observations are to the new, incoming observations,
the more likely it is that the classifier will assign them both the
same class.
K Nearest Neighbor
In KNN, a new data point is classified by the average median
value of its neighbors K.
The purpose of this algorithm is to classify a new object based on
attributes and training samples: (xi, f(xi)), i=1..N.
Given a query point, we find K number of objects or (training
points) closest to the query point.
The classification is using majority vote among the
classification of the K objects.
K Nearest neighbor algorithm used neighborhood
classification as the prediction value of the new query
instance.
K Nearest Neighbor
K nearest neighbor algorithm is very simple. It works based on
minimum distance from the query instance to the training samples
to determine the K-nearest neighbors.
How to compute K-Nearest Neighbor (KNN) Algorithm?
Determine parameter K = number of nearest neighbors
Calculate the distance between the query-instance and all the
training samples
We can use Euclidean distance
Sort the distance and determine nearest neighbors based on
the Kth minimum distance
Gather the category of the nearest neighbors
Use simple majority of the category of nearest neighbors as
the prediction value of the query instance
Any ties can be broken at random with reason.
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
1. Setting the variable K (Number of nearest neighbors)
The numbers of nearest neighbors (K) should be based on
cross validation over a number of K setting.
When k=1 is a good baseline model to benchmark against.
A good rule-of-thumb is that K should be less than or equal to
the square root of the total number of training patterns.
K Nearest Neighbors: Key issues
2. Setting the type of distant metric
We need a measure of distance in order to know who are the
neighbours
Assume that we have T attributes for the learning problem.
Then one example point x has elements xt ∈ ℜ, t=1,…T.
The distance between two points xi xj is often defined as the
Euclidean distance:
D 2
Dist ( X , Y ) = ∑ ( Xi − Yi)
i =1
K Nearest Neighbors: Example
We have data from the questionnaires survey (to ask people
opinion) & objective testing with two attributes (acid durability &
strength) to classify whether a special paper tissue is good or not.
Here is four training samples.
X1 = Acid Durability X2 = Strength (kg/m2) Y = Classification
(seconds)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
K Nearest Neighbors: Example
Now the factory produces a new paper tissue that pass laboratory
test with X1 = 3 and X2 = 7. Without undertaking another
expensive survey, guess the goodness of the new tissue? Use
squared Euclidean distance for similarity measurement.
Steps:
1. Determine parameter K = number of nearest neighbors
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training
samples
K Nearest Neighbors: Example
Coordinate of query instance is (3, 7), instead of calculating the
distance we compute square distance which is faster to calculate
(without square root)
X1 = Acid X2 = Square Distance to
Durability Strength query instance (3,
(seconds) (kg/m2) 7)
7 7
7 4
3 4
1 4
K Nearest Neighbors: Example
3. Sort the distance and determine nearest neighbors based on the
K-th minimum distance
X1 = Acid X2 = Square Distance Rank
Durability Strength to query minimu
(seconds) (kg/m2) instance (3, 7) m
distance
7 7 3
7 4 4
3 4 1
1 4 2
K Nearest Neighbors: Example
4. Gather the category (Y) of the nearest neighbors. Notice in the
second row last column that the category of nearest neighbor (Y) is not
included because the rank of this data is more than 3 (=K).
X1 = Acid X2 = Square Distance Rank Is it Y=
Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example
5. Use simple majority of the category of nearest neighbors as the
prediction value of the query instance
X1 = Acid X2 = Square Distance Rank Is it Y=
Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example
We have 2 good and 1 bad, since 2>1 then we conclude that a
new paper tissue that pass laboratory test with X1 = 3 and X2 = 7
is included in Good category. Output of test set
X1 = Acid X2 = Strength Y = Category
Durability (kg/m2) of NN
(seconds)
3 7 Good
K Nearest Neighbors: Exercise
Training set
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4 No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No
Test set Number Lines Line types Rectangles Colours Mondrian?
8 7 2 9 4
When to use K Nearest Neighbor?
kNN works best if the dataset is:
Low on noise
Free of outliers
Labeled
Composed only of relevant selected features
Composed of distinguishable groups
KNNs: advantages & Disadvantages
Advantage
Simple &powerful
Requires less training time
Nonparametric architecture
Disadvantage: Difficulties with k-nearest neighbour algorithms
Classification/estimation is slow
Have to calculate the distance of the test case from all training
cases
Memory intensive: just store the training examples
when a test example is given then find the closest matches
Decision Tree
If the input attributes in a data set are primarily nominal or
ordinal, ML algorithms and models, such as decision trees are
more appropriate.
Decision tree is the most powerful and popular tool for
classification and prediction
It is a flowchart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome
of the test, and each leaf node (terminal node) holds a class label.
A decision tree encodes a set of if then, else rules in a tree
structure
Decision Tree
This Figure illustrates a decision tree for deciding whether
an email is spam or not
Rectangles with rounded corners represent tests on attributes,
and the square nodes indicate decision, or classification, nodes
Decision Tree
This tree encodes the following rules:
if the email is from an unknown sender, then it is spam;
if it isn’t from an unknown sender but contains suspicious words, then it is
spam;
if it is neither from an unknown sender nor contains suspicious words, then it
is not spam.
In a decision tree, the decision for an instance is made by
starting at the top of the tree and navigating down through the
tree by applying a sequence of attribute tests to the instance.
Decision Tree
Decision Tree Terminology
Root Nodes – It is the node present at the beginning of a decision tree
from this node the population starts dividing according to various
features.
Decision Nodes – the nodes we get after splitting the root nodes are
called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called
leaf nodes or terminal nodes
Sub-tree – just like a small portion of a graph is called sub-graph
similarly a sub-section of this decision tree is called sub-tree.
Decision Tree
Shall I play a Tennis Today?
Decision Tree
The goal of a decision-tree-learning is to find a set of
classification rules that divide the training data set into sets of
instances that have the same value for the target attribute.
One of the strengths of decision trees is that they are simple to
understand
Also it is possible to create very accurate models based on
decision trees
Although decision trees work well with both nominal and ordinal
data, they struggle with numeric data
Decision Tree
In a decision tree, a separate branch descends from each node for
each value in the domain of the attribute tested at the node.
Numeric attributes, however, have an infinite number of values in
their domains, with the implication that a tree would need an
infinite number of branches
One solution to this problem is to transform numeric attributes into
ordinal attributes, although doing so requires the definition of
appropriate thresholds, which can also be difficult.
Decision Tree
Each node in the tree specifies one attribute to test, and the
process descends the tree node by node by choosing the branch
from the current node with the label matching the value of the test
attribute of the instance.
The final decision is the label of the terminating (or leaf) node
that the instance descends to.
Each path in a decision tree, from root to leaf, defines a
classification rule composed of a sequence of tests
Choosing the Splitting Attribute
At each node, the best attribute is selected for splitting the
training examples using a Goodness function. The best attribute is
the one that separate the classes of the training examples faster
such that it results in the smallest tree
Typical goodness functions: information gain, information gain
ratio, and GINI index
Information Gain: Select the attribute with the highest information gain,
that create small average disorder
•First, compute the disorder using Entropy; the expected information
needed to classify objects into classes
•Second, measure the Information Gain; to calculate by how much the
disorder of a set would reduce by knowing the value of a particular
attribute.
Entropy
The Entropy measures the disorder of a set S containing a total
of n examples of which n+ are positive and n- are negative and it
is given by:
Some useful properties of the Entropy:
Information Gain
The Information Gain measures the expected reduction in entropy due
to splitting on an attribute A
k ni
GAIN split = Entropy ( S ) − ∑ Entropy (i )
i =1 n
Parent Node, S is split into k partitions; ni is number of records in
partition I
Information Gain: Measures Reduction in Entropy achieved
because of the split. Choose the split that achieves most reduction
(maximizes GAIN)
Example 1: The problem of “Sunburn”
You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
Data Collected: predict based on the observed properties of the
people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Example 1: The problem of “Sunburn”
Attribute Selection by Information Gain to construct the
optimal decision tree:
Entropy: The Disorder of Sunburned
D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily”,“Pete”,“John”,“Katie”})
3 3 5 5
= D(3+ ,5− ) = − log 2 − log 2 = 0.954
8 8 8 8
Example 1: The problem of “Sunburn”
Calculate the Average Disorder Associated with Hair Colour
D(Sblonde) = D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2+,2-) =1
S blonde S blonde 4
D ( S blonde ) = = = 0.5
S S 8
The second and third terms of the sum:
• Sred = {“Emily”}
• Sbrown = { “Alex”, “Pete”, “John”}.
• These are both 0 because within each set all the examples have
the same class
• So the average disorder created when splitting on ‘hair colour’ is
0.5+0+0=0.5
Example 1: The problem of “Sunburn”
Which decision variable minimises the disorder?
Note: The attribute "hair color" is selected as the first test because it
minimizes the entropy
Example 1: The problem of “Sunburn”
The best decision tree?
Once we have finished with hair colour we then need to calculate
the remaining branches of the decision tree.
Which attributes is better to classify the remaining ?
Example 1: The problem of “Sunburn”
The attribute "lotion" is selected because it minimizes the
entropy in the blonde hair subset.
The best Decision Tree This is the completed decision
tree. using the "hair color" and
Thus,
"lotion" tests together ensures the
proper identification of all the samples.
This is the simplest and optimal one
possible and it makes a lot of sense.
It classifies 4 of the people on just
the hair colour alone.
Example 1: The problem of “Sunburn”
You can view Decision Tree as an IF-THEN_ELSE statement
which tells us whether someone will suffer from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
Decision trees are able to generate understandable rules.
Decision trees perform classification without requiring much
computation.
Decision trees are able to handle both continuous and categorical
variables.
Decision trees provide a clear indication of which fields are most
important for prediction or classification.
Strengths and Weakness of Decision Tree approach
The Weakness of decision tree methods are:
Decision trees are less appropriate for estimation tasks where the
goal is to predict the value of a continuous attribute.
Decision trees are prone to errors in classification problems with
many class and relatively small number of training examples.
Decision tree can be computationally expensive to train. The
process of growing a decision tree is computationally expensive.
Support vector machine reading assignment.
Naïve Bayes Classifier
Naïve Bayes algorithm is a ML algorithm used for classification
problem
It is primary used for classification which involves high dimensional
training data
Example: Spam alteration and document classification
It is based on Bayes theorem and is a probabilistic classifier
Find out the probability of the previously unseen instance belonging
to each class, then simply pick the most probable class
Naïve Bayes Classifier
Probability Basics:
•Prior, conditional and joint probability
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X = ( X1 , X2 ), P( X ) = P(X1 ,X2 )
– Relationship: P(X1 ,X2 ) = P( X2 | X1 )P( X1 ) = P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 ) = P( X2 ), P( X1 | X2 ) = P( X1 ), P(X1 ,X2 ) = P( X1 )P( X2 )
•Bayesian Rule
P( X |C )P(C ) Likelihood × Prior
P(C |X ) = Posterior =
P( X ) Evidence
Naïve Bayes Classifier
Bayesian classifiers use Bayes theorem, which says:
Where, p(cj | d) = probability of instance d being in class cj,
p(d | cj) = probability of generating instance d given class cj
We can imagine that being in class cj, causes you to have feature
d with some probability
p(cj) = probability of occurrence of class cj, This is just how
frequent the class cj, is in our database
P(d)= the probability of instance d occurring.
Naïve Bayes Classifier
Example: Play Tennis
Naïve Bayes Classifier
•Learning Phase
Naïve Bayes Classifier
•Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
Map rule:
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Naïve Bayes Classifier
•Learning Phase
Naïve Bayes Classifier
Using Bayes classifier, predict the sex of this individual:
(‘over170cm=no, eye=brown, hair_length=long’)
Naïve Bayes Classifier
Age Income Student Credit rating Buys computer
Youth High No Fair No
Youth High No Excellent No
Mid_age High No Fair Yes
Senior Medium No Fair Yes
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Mid_age Low Yes Excellent Yes
Youth Medium No Fair No
Youth High Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Mid_age Medium No Excellent Yes
Mid_age High Yes Fair Yes
senior Medium No Excellent No
Thus, using these table…
Naïve Bayes Classifier
Predict the class level of a tuple x using Naïve Bayes (NB)
classifier given the training data on the above table.
The data tuples are described by attribute age, income, student,
and credit rating.
The class level attribute buys computer has two distinct values:
yes/no. The new tuple x is give as follow:
X=(age=youth, income=medium, student=yes, credit rating=fair)
Naïve Bayes Classifier
Advantage and disadvantage of Naïve Bayes classifier:
Advantage:
Fast to train (single scan). Fast to classify
Training is very easy and fast; just requiring considering each
attribute in each class separately
Not sensitive to irrelevant features
Handles real and discrete data
Handles streaming data well
Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
Disadvantage
Assume independence of feature
Artificial Neural Networks - Deep Learning
Deep learning is a popular area within data science today
They are deep in terms of the number of hidden layers they have.
Deep learning is also just a sexy term for Artificial Neural
Networks (ANN), which have been around for over forty years.
Artificial Neural Networks (ANN), also known as Neural
Networks, are one of the most widely used algorithms within the
field of machine learning.
Neural networks are commonly used in visual and audio
recognition
Artificial Neural Networks - Deep Learning
ANN emphasizes on analyzing data in many layers, and was
inspired by the human brain, which can visually process objects
through layers of neurons
ANN is presented in the form of interconnected neurons that
interact with each other.
Each connection has numeric weight that can be altered and is
based on experience.
The layers or neurons are stacked on top of each other starting
with a broad base.
Artificial Neural Networks - Deep Learning
The bottom layer consists of raw data such as text, images or
sound, which are divided into what we called neurons
Within each neuron is a collection of data.
Each neuron then sends information up to the layer of neurons
above
As the information ascends it becomes less abstract and more
specific, and the more we can learn from the data from each layer.
A simple neural network can be divided into input, hidden, and
output layers.
Artificial Neural Networks - Deep Learning
Data is first received by the input layer, and this first layer detects
broad features
The hidden layer/s then analyze and processes that data, and
through the passing of each layer with less neurons (which
diminish in number at each layer) the data becomes clearer, based
on previous computations.
The final result is shown as the output layer
The middle layers are considered hidden layers, because like
human sight we are unable to naturally break down objects into
layered vision
Artificial Neural Networks - Deep Learning
• The neural network in figure 15 has
five layers: one input layer on the
left containing three neurons, three
hidden layers (the black circles),
and one output layer on the right
containing two neurons.
The input layer has three neurons; the first hidden layer has five
Each of the next two hidden layers has four; and the output layer
has two.
The above ANN is fully connected, feed-forward network.
Artificial Neural Networks - Deep Learning
Most modern algorithms, including decision trees and naive
Bayes are considered shallow algorithms, as they do not analyze
information via numerous layers as ANN can.
The power of deep neural networks comes from the fact that they
can automatically learn useful attributes
It is the ability to automatically learn complex mappings of input
data to useful attribute representations that has made deep-
learning models so accurate in tasks with high-dimensional inputs
(such as image and text processing).
Regression
Both Linear Regression and Logistic regression reading
assignment.
Unsupervised Learning
Unsupervised Learning
Unsupervised machine learning uses unlabeled data
Unsupervised learning allows us to find patterns that would be
unobservable without computer scientists
The algorithm must discover patterns on its own
Clustering
The most common types of unsupervised machine learning is
clustering
Clustering methods find similarities between instances and group
instances
In clustering, the algorithm looks for clusters of instances that are
more similar to each other than they are to other instances in the
data.
A challenge for clustering is figuring out how to measure
similarity
Clustering
An unsupervised clustering algorithm will look for groups of
rows that are more similar to each other than they are to the other
rows in the data.
Each of these groups of similar rows defines a cluster of similar
instances.
Clustering example
Clustering
Simply put, a clustering algorithm computes the distance
between groupings and divides data points into multiple groups
based on their relational distance to one another.
Unlike classification, which starts with predefined labels
reflected in the database table, clustering creates its own labels
after clustering the data set.
Analysis by clustering can be used in various scenarios such as
pattern recognition, image processing and market research.
Clustering
For example, clustering can be applied to uncover customers that
share similar purchasing behavior
By understanding a particular cluster of customer purchasing
preferences, we can then form decisions on which products we can
recommend to the group based on their commonalities.
Clustering
General goal of clustering algorithm is:
Instances in the same group should be similar
Instances in the different group should be different
But the best clustering is hard to define since we don’t have a
test error
Generally, there is no best method in unsupervised learning
So why cluster?
You may want to know what group are there
You may want to know a group of new instance x
Clustering Algorithms
K-means Clustering Algorithm
Given new data points, k-Means will assign them to the closest
cluster center.
K-means Clustering Algorithm
k-Means clustering is one of the simplest and most commonly
used clustering algorithms
It tries to find cluster centers that are representative of certain
regions of the data.
The algorithm alternates between two steps: assigning each data
point to the closest cluster center, and then setting each cluster
center as the mean of the data points that are assigned to it.
The algorithm is finished when the assignment of instances to
clusters no longer changes.
K-means Clustering Algorithm
Input
The number of cluster ‘k’ (hyper parameter)
Initial guess of the center (the ‘mean’) of each cluster
Output: Clustered instances
Algorithm: Assign each Xi to its closest mean
Update the mean based on the assignment
Repeat until convergence
KNN vs K-means
Don’t confuse KNN classification and K-means clustering
Challenges in unsupervised learning
A major challenge in unsupervised learning is evaluating whether the
algorithm learned something useful
Unsupervised learning algorithms are usually applied to data that
does not contain any label information, so we don’t know what the
right output should be
Therefore it is very hard to say whether a model “did well”
Most of the time unsupervised learning is used in a preprocessing
step for supervised learning not for automatic purpose.
Evaluation of Machine learning algorithms
Confusion Matrix
A confusion matrix, also called a contingency table or error
matrix, is used to visualize the performance of a classifier
The columns of the matrix represent the instances of the
predicted classes and the rows represent the instances of the actual
class.
In the case of binary classification the table has 2 rows and 2
columns
Confusion Matrix
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for
3 classes, it is 3*3 table, and so on.
The matrix is divided into two dimensions, that are predicted
values and actual values along with the total number of predictions.
Predicted values are those values, which are predicted by the model,
and actual values are the true values for the given observations.
It looks like the below table:
Confusion Matrix
True Negative: Model has given prediction No, and the real or actual
value was also No
True Positive: The model has predicted yes, and the actual value was
also true.
Confusion Matrix
False Negative: The model has predicted no, but the actual value was
Yes. it is also called as Type-II error.
False Positive: The model has predicted Yes, but the actual value was
No. It is also called a Type-I error.
Need for Confusion Matrix in Machine learning
It evaluates the performance of the classification models, when
they make predictions on test data, and tells how good our
classification model is.
It not only tells the error made by the classifiers but also the type
of errors such as it is either type-I or type-II error
With the help of the confusion matrix, we can calculate the
different parameters for the model, such as accuracy, precision,
recall, etc.
Need for Confusion Matrix in Machine learning
Example: We can understand the confusion matrix using an
example.
Suppose we are trying to create a model that can predict the
result for the disease that is either a person has that disease or not.
So, the confusion matrix for this is given as:
Need for Confusion Matrix in Machine learning
From the above example, we can conclude that:
The table is given for the two-class classifier, which has two
predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.
The classifier has made a total of 100 predictions. Out of 100
predictions, 89 are true predictions, and 11 are incorrect predictions.
The model has given prediction "yes" for 32 times, and "No" for 68
times. Whereas the actual "Yes" was 27, and actual "No" was 73 times.
Evaluation Matrices:
The most popular evaluation metric used to evaluate the results of ML
(usually classifications ) are: Accuracy, Precision, Recall, F1-Score
Classification Accuracy: It is one of the important parameters to
determine the accuracy of the classification problems. It defines
how often the model predicts the correct output. It can be
calculated as the ratio of the number of correct predictions made by
the classifier to all number of predictions made by the classifiers.
TP+TN
The formula is given below: accuracy=
TP+FP+FN+TN
Evaluation Matrices:
Precision: It can be defined as the number of correct outputs
provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually
true. It can be calculated using the below formula:
𝑇𝑇𝑇𝑇
precision=
𝑇𝑇𝑇𝑇+𝐹𝐹𝑃𝑃
Precision is the ratio of the correctly identified positive cases to all
the predicted positive cases, i.e. the correctly and the incorrectly
cases predicted as positive.
Evaluation Matrices:
Recall: It is defined as the out of total positive classes, how our model
predicted correctly. The recall must be as high as possible.
Recall, also known as sensitivity, is the ratio of the correctly
identified positive cases to all the actual positive cases, which is
the sum of the "False Negatives" and "True Positives".
𝑇𝑇𝑇𝑇
Recall=
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
Evaluation Matrices:
F-measure: If two models have low precision and high recall or
vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the
recall and precision at the same time. The F-score is maximum if
the recall is equal to the precision. It can be calculated using the
below formula:
2 2 (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝∗𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)
F-score= 1 1 =
+ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Evaluation Matrices:
Problem statement: Prediction of corona patients
Let us assume that our model is trying to predict Corona patients.
That model performance is calculated by using this 2×2 matrix.
Evaluation Matrices:
Let’s understand TP, FP, FN, TN in terms of Coronavirus affected
people analogy.
True Positive: Interpretation: Model predicted positive and it’s true.
Model predicted that a person is Corona positive and he actually is having Corona.
True Negative: Interpretation: You predicted negative and it’s true.
Model predicted that person is Corona negative and he actually is
NOT having Corona.
False Positive: (Type 1 Error): Interpretation: Model predicted
positive and it’s false.
Model predicted that that person is Corona positive but he actually was NOT
having Corona.
False Negative: (Type 2 Error)
Interpretation: The model predicted negative and it’s false.
Model predicted that person is Corona negative but actually he was
Corona positive.
Evaluation Matrices:
From the above table, the accuracy of our machine learning model is
calculated as follow:
Accuracy=560+330/560+330+60+50 =>890/1000 =>0.89 =>89%
accuracy.
This means 89% of values(TP+TN) were correctly classified by the
model.
Recall=560/560+50 =>560/610 => 0.918
This means 91.8% of actual positive values were correctly classified.
Precision= 560/560+60 =>560/620 =>0.8
Here 0.8 means 80% precision. This means 80% of the positives were
identified correctly.
F1 Score => 2 x 0.83×0.91/0.83+0.91 => 1.51/1.74 =>0.86
F1 Score shows the balance between precision and recall.
Evaluation Matrices:
Example 2:
This means that the classifier correctly predicted a male person in 42
cases and it wrongly predicted 8 male instances as female. It correctly
predicted 32 instances as female. 18 cases had been wrongly predicted
as male instead of female
Evaluation Matrices:
The classifier in our previous example predicted correctly
predicted 42 male instances and 32 female instance
Therefore, the accuracy can be calculated by:
accuracy = (42 + 32) / (42 + 8 + 18 + 32) =0.72 or 72%
Thank
s