0% found this document useful (0 votes)

23 views106 pages

CH 4

The document discusses machine learning concepts including supervised and unsupervised learning. It covers classification methods like K-nearest neighbors algorithm which uses the closest training examples to predict the class of new examples. The document provides examples of machine learning applications and outlines the machine learning pipeline.

Uploaded by

Abebe Bekele

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views106 pages

CH 4

Uploaded by

Abebe Bekele

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Chapter Four

MACHINE LEARNING

By:
Irandufa Indebu

irandufa.indebu@haramaya.edu.et
Outlines
Overview of machine learning

Supervised Learning (Classification and Regression)

Unsupervised Learning (More focus on Clustering ..)

Model Evaluation Methods

Overview of Machine Learning
Machine learning falls under the larger umbrella of artificial
intelligence.

Artificial intelligence is a branch of computer science that

includes reasoning, natural language processing, planning, and
machine learning.

Data science encompasses both artificial intelligence and

machine learning.
Overview of Machine Learning
Machine learning is the field of study that develops the
algorithms that the computers follow in order to identify and
extract patterns from data

ML allow us to “teach” computers how to perform tasks by

providing examples of how they should be done.

It gives computers the ability to learn without being explicitly

programmed

It is programming computers to optimize a performance criterion

using example data or past experience
Overview of Machine Learning
ML algorithms and techniques are applied primarily during the
modeling stage of CRISP-DM. ML involves a two-step process.

First, an ML algorithm is applied to a data set to identify useful

patterns in the data

These patterns can be represented in a number of different ways

The model may be predictive to make predictions in the future,

or descriptive to gain knowledge from data, or both.
Overview of Machine Learning
Pattern may be represented as decision tree, regression model,
and neural network.

These representations of patterns are known as “models,” which

is why this stage of the CRISP-DM life cycle is known at the
“modeling stage.”

Once a model has been created and represented, it is used for

analysis.
• Statistics: making inferences from sample data.
• Numerical algorithms (linear algebra, optimization):
ML depends on: optimize criteria, manipulate models.
• Computer science: data structures and programs that solve
a ML problem efficiently.
Some Application of ML
 Fraud detection
• Algorithms are now capable of detecting when a financial
transaction has characteristics of fraud.
• Even Companies can spot fake reviews by recognizing word
patterns and timing for review postings that are more likely to
be fake.

Speech or face recognition: use speech recognition to understand

what we are saying and to respond to our request. Social media uses
complex data analysis to recognize patterns in our photos and can tell
who is in a picture before we even begin tagging.
Some Application of ML
Medical diagnosis: Machine learning could use image
recognition to diagnose x-rays by taking data from several patients
to imaging scans in order to make predictions about new patients.

 Speech or face recognition. .

 Credit scoring: classify customers into high- and low-risk,

based on their income and savings, using data about past loans
(whether they were paid or not).

 Regression: the labels to be predicted are continuous

* Predict the price of a car from its mileage.

Some Application of ML
Stock Predictions: Stock traders look at many variables to decide
on what to do with a stock, whether they want to buy or sell or
wait it out.

They look at certain characteristics of a stock, and trends in the

market environment to make an educated guess on what they
should do.
ML Pipeline
The high-level components of machine learning system are
outlined in the following diagram

This diagram illustrates the machine learning pipeline from

which we obtain data and in which we store data.
ML Pipeline
We then transform it into a form that is usable as input to a
machine learning model; train, test, and refine our model; and
then, deploy the final model to our production system.

The process is then repeated as new data is generated

Types of ML
Some of the main types of machine learning are:

1. Supervised Learning, in which the training data is labeled with

the correct answers

, e.g. “spam” or “ham.”

“Fraudulent or non-Fraudulent”
Types of ML
2. Unsupervised learning, in which we are given a collection of
unlabeled data, which we wish to analyze and discover patterns
within. The two most important examples are dimension
reduction and clustering.

3. Reinforcement learning, in which an agent (e.g., a robot or

controller) seeks to learn the optimal actions to take based the
outcomes of past actions.
Supervised Learning
Supervised Learning

In supervised learning, programmers use labeled data

The data that we are looking at is already predetermined

Supervised learning is “supervised” because each of the instances

in the data set lists both the input values and the output (target)
value for each instance

For supervised learning to take place, each instance in the data

set must be labeled with the value of the target attribute.
Supervised Learning

We are trying to find a relationship between X and Y that we have

chosen. Input output

X Y
After you find a relationship between X and Y, you get a model, which

will predict an outcome based on those relationships that your machine

has observed in the data.

Once we have estimated this relationship, we can predict Y with

X.
Supervised Learning

The goal of supervised learning is to learn a function that maps

from the values of the attributes describing an instance to the value
of another attribute, known as the target attributes of that instance.

The two most common types of supervised learning are

classification (where the outputs are discrete labels, as in spam
filtering) and regression (where the outputs are real-valued).

An easy way to distinguish between classification and regression

tasks is to ask whether there is some kind of continuity in the
output
Classification

Classification requires labeled data and creates non-continuous

predications.

Classification models are probably the most widely used part of

machine learning and data science.

Classifies data (constructs a model) based on the training set and

the values (class labels) in a classifying attribute and uses it in
classifying new data

In a classification problem, a label is a member of a finite set of

classes
Classification

The first type of classification is binary classification

With binary classification, the data is classified into two

categories. The size of the set of classes is two (“sick”/“healthy”,
“spam”/“not_spam”),

We call it binary classification because there are only two

possible categories and all of our data falls into one or the other.

But there are instances when we have more than two categories,
and for this, we use multi-class classification models
Classification

Multiclass classification (also called multinomial) is a classification

problem with three or more classes

E.g.. Sentiment analysis result= positive, negative or neutral

Classification methods

Goal: Predict class Ci = f(x1, x2, .. xn)

There are various classification methods. Popular classification

techniques include the following.

 K-nearest neighbor

 Decision tree classifier: divide decision space into piecewise

constant regions.

 Neural networks: partition by non-linear boundaries

 Bayesian network: a probabilistic model
 Support vector machine
K Nearest Neighbor

K-nearest neighbors are one of the most straightforward and

widely used methods of data classification

It is a supervised machine learning classifier that uses the

observations it memorizes from within a test dataset to predict
classifications for new, unlabeled observations.

kNN makes its predictions based on similarity the more similar

the training observations are to the new, incoming observations,
the more likely it is that the classifier will assign them both the
same class.
K Nearest Neighbor

In KNN, a new data point is classified by the average median

value of its neighbors K.

The purpose of this algorithm is to classify a new object based on

attributes and training samples: (xi, f(xi)), i=1..N.

Given a query point, we find K number of objects or (training

points) closest to the query point.
 The classification is using majority vote among the
classification of the K objects.
 K Nearest neighbor algorithm used neighborhood
classification as the prediction value of the new query
instance.
K Nearest Neighbor

K nearest neighbor algorithm is very simple. It works based on

minimum distance from the query instance to the training samples
to determine the K-nearest neighbors.
How to compute K-Nearest Neighbor (KNN) Algorithm?

Determine parameter K = number of nearest neighbors

Calculate the distance between the query-instance and all the
training samples
 We can use Euclidean distance

Sort the distance and determine nearest neighbors based on

the Kth minimum distance
Gather the category of the nearest neighbors
Use simple majority of the category of nearest neighbors as
the prediction value of the query instance
 Any ties can be broken at random with reason.
K Nearest Neighbors: Key issues

The key issues involved in training KNN model includes

1. Setting the variable K (Number of nearest neighbors)
 The numbers of nearest neighbors (K) should be based on
cross validation over a number of K setting.
 When k=1 is a good baseline model to benchmark against.
 A good rule-of-thumb is that K should be less than or equal to
the square root of the total number of training patterns.
K Nearest Neighbors: Key issues

2. Setting the type of distant metric

 We need a measure of distance in order to know who are the
neighbours
 Assume that we have T attributes for the learning problem.
Then one example point x has elements xt ∈ ℜ, t=1,…T.
 The distance between two points xi xj is often defined as the
Euclidean distance:
D 2

Dist ( X , Y ) = ∑ ( Xi − Yi)
i =1
K Nearest Neighbors: Example

We have data from the questionnaires survey (to ask people
opinion) & objective testing with two attributes (acid durability &
strength) to classify whether a special paper tissue is good or not.
Here is four training samples.
X1 = Acid Durability X2 = Strength (kg/m2) Y = Classification
(seconds)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
K Nearest Neighbors: Example
Now the factory produces a new paper tissue that pass laboratory
test with X1 = 3 and X2 = 7. Without undertaking another
expensive survey, guess the goodness of the new tissue? Use
squared Euclidean distance for similarity measurement.
Steps:
1. Determine parameter K = number of nearest neighbors
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training
samples
K Nearest Neighbors: Example

Coordinate of query instance is (3, 7), instead of calculating the

distance we compute square distance which is faster to calculate
(without square root)
X1 = Acid X2 = Square Distance to
Durability Strength query instance (3,
(seconds) (kg/m2) 7)

7 7
7 4
3 4
1 4
K Nearest Neighbors: Example

3. Sort the distance and determine nearest neighbors based on the
K-th minimum distance
X1 = Acid X2 = Square Distance Rank
Durability Strength to query minimu
(seconds) (kg/m2) instance (3, 7) m
distance
7 7 3
7 4 4
3 4 1
1 4 2
K Nearest Neighbors: Example

4. Gather the category (Y) of the nearest neighbors. Notice in the
second row last column that the category of nearest neighbor (Y) is not
included because the rank of this data is more than 3 (=K).

X1 = Acid X2 = Square Distance Rank Is it Y=

Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example

5. Use simple majority of the category of nearest neighbors as the

prediction value of the query instance

X1 = Acid X2 = Square Distance Rank Is it Y=

Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example

We have 2 good and 1 bad, since 2>1 then we conclude that a
new paper tissue that pass laboratory test with X1 = 3 and X2 = 7
is included in Good category. Output of test set
X1 = Acid X2 = Strength Y = Category
Durability (kg/m2) of NN
(seconds)
3 7 Good
K Nearest Neighbors: Exercise
Training set
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4 No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No

Test set Number Lines Line types Rectangles Colours Mondrian?

8 7 2 9 4
When to use K Nearest Neighbor?

kNN works best if the dataset is:

 Low on noise

 Free of outliers

 Labeled

 Composed only of relevant selected features

 Composed of distinguishable groups

KNNs: advantages & Disadvantages

Advantage
 Simple &powerful
 Requires less training time
 Nonparametric architecture
Disadvantage: Difficulties with k-nearest neighbour algorithms
 Classification/estimation is slow
 Have to calculate the distance of the test case from all training
cases
 Memory intensive: just store the training examples
when a test example is given then find the closest matches
Decision Tree
If the input attributes in a data set are primarily nominal or
ordinal, ML algorithms and models, such as decision trees are
more appropriate.
Decision tree is the most powerful and popular tool for
classification and prediction

It is a flowchart like tree structure, where each internal node

denotes a test on an attribute, each branch represents an outcome
of the test, and each leaf node (terminal node) holds a class label.

A decision tree encodes a set of if then, else rules in a tree

structure
Decision Tree

 This Figure illustrates a decision tree for deciding whether

an email is spam or not
 Rectangles with rounded corners represent tests on attributes,
and the square nodes indicate decision, or classification, nodes
Decision Tree

This tree encodes the following rules:

 if the email is from an unknown sender, then it is spam;

 if it isn’t from an unknown sender but contains suspicious words, then it is

spam;

 if it is neither from an unknown sender nor contains suspicious words, then it

is not spam.

 In a decision tree, the decision for an instance is made by

starting at the top of the tree and navigating down through the
tree by applying a sequence of attribute tests to the instance.
Decision Tree

Decision Tree Terminology

Root Nodes – It is the node present at the beginning of a decision tree

from this node the population starts dividing according to various
features.

Decision Nodes – the nodes we get after splitting the root nodes are
called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called
leaf nodes or terminal nodes

Sub-tree – just like a small portion of a graph is called sub-graph

similarly a sub-section of this decision tree is called sub-tree.
Decision Tree

Shall I play a Tennis Today?

Decision Tree

The goal of a decision-tree-learning is to find a set of

classification rules that divide the training data set into sets of
instances that have the same value for the target attribute.

One of the strengths of decision trees is that they are simple to

understand

Also it is possible to create very accurate models based on

decision trees

Although decision trees work well with both nominal and ordinal
data, they struggle with numeric data
Decision Tree

In a decision tree, a separate branch descends from each node for
each value in the domain of the attribute tested at the node.

Numeric attributes, however, have an infinite number of values in

their domains, with the implication that a tree would need an
infinite number of branches

One solution to this problem is to transform numeric attributes into

ordinal attributes, although doing so requires the definition of
appropriate thresholds, which can also be difficult.
Decision Tree

Each node in the tree specifies one attribute to test, and the
process descends the tree node by node by choosing the branch
from the current node with the label matching the value of the test
attribute of the instance.

The final decision is the label of the terminating (or leaf) node
that the instance descends to.

Each path in a decision tree, from root to leaf, defines a

classification rule composed of a sequence of tests
Choosing the Splitting Attribute

At each node, the best attribute is selected for splitting the
training examples using a Goodness function. The best attribute is
the one that separate the classes of the training examples faster
such that it results in the smallest tree
Typical goodness functions: information gain, information gain
ratio, and GINI index
Information Gain: Select the attribute with the highest information gain,
that create small average disorder
•First, compute the disorder using Entropy; the expected information
needed to classify objects into classes
•Second, measure the Information Gain; to calculate by how much the
disorder of a set would reduce by knowing the value of a particular
attribute.
Entropy

The Entropy measures the disorder of a set S containing a total

of n examples of which n+ are positive and n- are negative and it
is given by:

Some useful properties of the Entropy:

Information Gain

The Information Gain measures the expected reduction in entropy due

to splitting on an attribute A

 k ni 
GAIN split = Entropy ( S ) −  ∑ Entropy (i ) 
 i =1 n 

Parent Node, S is split into k partitions; ni is number of records in

partition I

Information Gain: Measures Reduction in Entropy achieved

because of the split. Choose the split that achieves most reduction
(maximizes GAIN)
Example 1: The problem of “Sunburn”
You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
Data Collected: predict based on the observed properties of the
people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Example 1: The problem of “Sunburn”

Attribute Selection by Information Gain to construct the

optimal decision tree:

Entropy: The Disorder of Sunburned

D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
= D(3+ ,5− ) = − log 2 − log 2 = 0.954
8 8 8 8
Example 1: The problem of “Sunburn”

Calculate the Average Disorder Associated with Hair Colour

D(Sblonde) = D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2+,2-) =1

S blonde S blonde 4
D ( S blonde ) = = = 0.5
S S 8
The second and third terms of the sum:
• Sred = {“Emily”}
• Sbrown = { “Alex”, “Pete”, “John”}.
• These are both 0 because within each set all the examples have
the same class
• So the average disorder created when splitting on ‘hair colour’ is
0.5+0+0=0.5
Example 1: The problem of “Sunburn”

Which decision variable minimises the disorder?

Note: The attribute "hair color" is selected as the first test because it
minimizes the entropy
Example 1: The problem of “Sunburn”

The best decision tree?

Once we have finished with hair colour we then need to calculate

the remaining branches of the decision tree.

Which attributes is better to classify the remaining ?

Example 1: The problem of “Sunburn”

The attribute "lotion" is selected because it minimizes the

entropy in the blonde hair subset.

The best Decision Tree This is the completed decision

tree. using the "hair color" and
Thus,
"lotion" tests together ensures the
proper identification of all the samples.
This is the simplest and optimal one
possible and it makes a lot of sense.
It classifies 4 of the people on just
the hair colour alone.
Example 1: The problem of “Sunburn”

You can view Decision Tree as an IF-THEN_ELSE statement

which tells us whether someone will suffer from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)
Strengths and Weakness of Decision Tree approach

The strengths of decision tree methods are:

 Decision trees are able to generate understandable rules.

 Decision trees perform classification without requiring much

computation.

 Decision trees are able to handle both continuous and categorical

variables.

 Decision trees provide a clear indication of which fields are most

important for prediction or classification.
Strengths and Weakness of Decision Tree approach

The Weakness of decision tree methods are:

Decision trees are less appropriate for estimation tasks where the
goal is to predict the value of a continuous attribute.

Decision trees are prone to errors in classification problems with

many class and relatively small number of training examples.

Decision tree can be computationally expensive to train. The

process of growing a decision tree is computationally expensive.

Support vector machine reading assignment.

Naïve Bayes Classifier

Naïve Bayes algorithm is a ML algorithm used for classification

problem

It is primary used for classification which involves high dimensional

training data

Example: Spam alteration and document classification

It is based on Bayes theorem and is a probabilistic classifier

Find out the probability of the previously unseen instance belonging

to each class, then simply pick the most probable class
Naïve Bayes Classifier

Probability Basics:
•Prior, conditional and joint probability
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X = ( X1 , X2 ), P( X ) = P(X1 ,X2 )
– Relationship: P(X1 ,X2 ) = P( X2 | X1 )P( X1 ) = P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 ) = P( X2 ), P( X1 | X2 ) = P( X1 ), P(X1 ,X2 ) = P( X1 )P( X2 )
•Bayesian Rule

P( X |C )P(C ) Likelihood × Prior

P(C |X ) = Posterior =
P( X ) Evidence
Naïve Bayes Classifier

Bayesian classifiers use Bayes theorem, which says:

Where, p(cj | d) = probability of instance d being in class cj,

p(d | cj) = probability of generating instance d given class cj

We can imagine that being in class cj, causes you to have feature
d with some probability
p(cj) = probability of occurrence of class cj, This is just how
frequent the class cj, is in our database

P(d)= the probability of instance d occurring.

Naïve Bayes Classifier

Example: Play Tennis

Naïve Bayes Classifier

•Learning Phase
Naïve Bayes Classifier

•Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Naïve Bayes Classifier

•Learning Phase
Naïve Bayes Classifier

Using Bayes classifier, predict the sex of this individual:

(‘over170cm=no, eye=brown, hair_length=long’)

Naïve Bayes Classifier
Age Income Student Credit rating Buys computer
Youth High No Fair No
Youth High No Excellent No
Mid_age High No Fair Yes
Senior Medium No Fair Yes
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Mid_age Low Yes Excellent Yes
Youth Medium No Fair No
Youth High Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Mid_age Medium No Excellent Yes
Mid_age High Yes Fair Yes
senior Medium No Excellent No

Thus, using these table…

Naïve Bayes Classifier

 Predict the class level of a tuple x using Naïve Bayes (NB)

classifier given the training data on the above table.

The data tuples are described by attribute age, income, student,

and credit rating.

The class level attribute buys computer has two distinct values:
yes/no. The new tuple x is give as follow:

X=(age=youth, income=medium, student=yes, credit rating=fair)

Naïve Bayes Classifier

Advantage and disadvantage of Naïve Bayes classifier:

Advantage:
 Fast to train (single scan). Fast to classify
 Training is very easy and fast; just requiring considering each
attribute in each class separately
 Not sensitive to irrelevant features
 Handles real and discrete data
 Handles streaming data well
 Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
Disadvantage
 Assume independence of feature
Artificial Neural Networks - Deep Learning

Deep learning is a popular area within data science today

They are deep in terms of the number of hidden layers they have.

Deep learning is also just a sexy term for Artificial Neural

Networks (ANN), which have been around for over forty years.

Artificial Neural Networks (ANN), also known as Neural

Networks, are one of the most widely used algorithms within the
field of machine learning.

Neural networks are commonly used in visual and audio

recognition
Artificial Neural Networks - Deep Learning

ANN emphasizes on analyzing data in many layers, and was

inspired by the human brain, which can visually process objects
through layers of neurons

ANN is presented in the form of interconnected neurons that

interact with each other.

Each connection has numeric weight that can be altered and is

based on experience.

The layers or neurons are stacked on top of each other starting

with a broad base.
Artificial Neural Networks - Deep Learning

The bottom layer consists of raw data such as text, images or

sound, which are divided into what we called neurons

Within each neuron is a collection of data.

Each neuron then sends information up to the layer of neurons

above

As the information ascends it becomes less abstract and more

specific, and the more we can learn from the data from each layer.

A simple neural network can be divided into input, hidden, and

output layers.
Artificial Neural Networks - Deep Learning

Data is first received by the input layer, and this first layer detects
broad features

The hidden layer/s then analyze and processes that data, and
through the passing of each layer with less neurons (which
diminish in number at each layer) the data becomes clearer, based
on previous computations.

The final result is shown as the output layer

The middle layers are considered hidden layers, because like
human sight we are unable to naturally break down objects into
layered vision
Artificial Neural Networks - Deep Learning

• The neural network in figure 15 has

five layers: one input layer on the
left containing three neurons, three
hidden layers (the black circles),
and one output layer on the right
containing two neurons.

The input layer has three neurons; the first hidden layer has five
Each of the next two hidden layers has four; and the output layer
has two.
The above ANN is fully connected, feed-forward network.
Artificial Neural Networks - Deep Learning

Most modern algorithms, including decision trees and naive

Bayes are considered shallow algorithms, as they do not analyze
information via numerous layers as ANN can.

The power of deep neural networks comes from the fact that they
can automatically learn useful attributes

It is the ability to automatically learn complex mappings of input

data to useful attribute representations that has made deep-
learning models so accurate in tasks with high-dimensional inputs
(such as image and text processing).
Regression

Both Linear Regression and Logistic regression reading

assignment.
Unsupervised Learning
Unsupervised Learning

Unsupervised machine learning uses unlabeled data

Unsupervised learning allows us to find patterns that would be

unobservable without computer scientists

The algorithm must discover patterns on its own

Clustering

The most common types of unsupervised machine learning is

clustering

Clustering methods find similarities between instances and group

instances

In clustering, the algorithm looks for clusters of instances that are
more similar to each other than they are to other instances in the
data.

A challenge for clustering is figuring out how to measure

similarity
Clustering
An unsupervised clustering algorithm will look for groups of
rows that are more similar to each other than they are to the other
rows in the data.

Each of these groups of similar rows defines a cluster of similar

instances.

Clustering example
Clustering
Simply put, a clustering algorithm computes the distance
between groupings and divides data points into multiple groups
based on their relational distance to one another.

Unlike classification, which starts with predefined labels

reflected in the database table, clustering creates its own labels
after clustering the data set.

Analysis by clustering can be used in various scenarios such as

pattern recognition, image processing and market research.
Clustering
For example, clustering can be applied to uncover customers that
share similar purchasing behavior

By understanding a particular cluster of customer purchasing

preferences, we can then form decisions on which products we can
recommend to the group based on their commonalities.
Clustering
General goal of clustering algorithm is:
 Instances in the same group should be similar
 Instances in the different group should be different
 But the best clustering is hard to define since we don’t have a
test error
 Generally, there is no best method in unsupervised learning
So why cluster?

You may want to know what group are there

You may want to know a group of new instance x

Clustering Algorithms
K-means Clustering Algorithm
Given new data points, k-Means will assign them to the closest
cluster center.
K-means Clustering Algorithm
k-Means clustering is one of the simplest and most commonly
used clustering algorithms

It tries to find cluster centers that are representative of certain

regions of the data.

The algorithm alternates between two steps: assigning each data

point to the closest cluster center, and then setting each cluster
center as the mean of the data points that are assigned to it.

The algorithm is finished when the assignment of instances to

clusters no longer changes.
K-means Clustering Algorithm
Input

The number of cluster ‘k’ (hyper parameter)

Initial guess of the center (the ‘mean’) of each cluster

Output: Clustered instances

Algorithm: Assign each Xi to its closest mean

Update the mean based on the assignment

Repeat until convergence

KNN vs K-means
Don’t confuse KNN classification and K-means clustering
Challenges in unsupervised learning

A major challenge in unsupervised learning is evaluating whether the

algorithm learned something useful

Unsupervised learning algorithms are usually applied to data that

does not contain any label information, so we don’t know what the
right output should be

Therefore it is very hard to say whether a model “did well”

Most of the time unsupervised learning is used in a preprocessing

step for supervised learning not for automatic purpose.
Evaluation of Machine learning algorithms
Confusion Matrix
A confusion matrix, also called a contingency table or error
matrix, is used to visualize the performance of a classifier

The columns of the matrix represent the instances of the

predicted classes and the rows represent the instances of the actual
class.

In the case of binary classification the table has 2 rows and 2
columns
Confusion Matrix
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for
3 classes, it is 3*3 table, and so on.

The matrix is divided into two dimensions, that are predicted

values and actual values along with the total number of predictions.

Predicted values are those values, which are predicted by the model,
and actual values are the true values for the given observations.

It looks like the below table:

Confusion Matrix

True Negative: Model has given prediction No, and the real or actual
value was also No

True Positive: The model has predicted yes, and the actual value was
also true.
Confusion Matrix
False Negative: The model has predicted no, but the actual value was
Yes. it is also called as Type-II error.

False Positive: The model has predicted Yes, but the actual value was
No. It is also called a Type-I error.
Need for Confusion Matrix in Machine learning

It evaluates the performance of the classification models, when

they make predictions on test data, and tells how good our
classification model is.

It not only tells the error made by the classifiers but also the type
of errors such as it is either type-I or type-II error

With the help of the confusion matrix, we can calculate the

different parameters for the model, such as accuracy, precision,
recall, etc.
Need for Confusion Matrix in Machine learning

Example: We can understand the confusion matrix using an

example.

Suppose we are trying to create a model that can predict the

result for the disease that is either a person has that disease or not.
So, the confusion matrix for this is given as:
Need for Confusion Matrix in Machine learning

From the above example, we can conclude that:

The table is given for the two-class classifier, which has two
predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.

The classifier has made a total of 100 predictions. Out of 100

predictions, 89 are true predictions, and 11 are incorrect predictions.

The model has given prediction "yes" for 32 times, and "No" for 68
times. Whereas the actual "Yes" was 27, and actual "No" was 73 times.
Evaluation Matrices:

The most popular evaluation metric used to evaluate the results of ML

(usually classifications ) are: Accuracy, Precision, Recall, F1-Score

Classification Accuracy: It is one of the important parameters to

determine the accuracy of the classification problems. It defines
how often the model predicts the correct output. It can be
calculated as the ratio of the number of correct predictions made by
the classifier to all number of predictions made by the classifiers.
TP+TN
The formula is given below: accuracy=
TP+FP+FN+TN
Evaluation Matrices:

Precision: It can be defined as the number of correct outputs

provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually
true. It can be calculated using the below formula:

𝑇𝑇𝑇𝑇
precision=
𝑇𝑇𝑇𝑇+𝐹𝐹𝑃𝑃

Precision is the ratio of the correctly identified positive cases to all

the predicted positive cases, i.e. the correctly and the incorrectly
cases predicted as positive.
Evaluation Matrices:

Recall: It is defined as the out of total positive classes, how our model
predicted correctly. The recall must be as high as possible.

Recall, also known as sensitivity, is the ratio of the correctly

identified positive cases to all the actual positive cases, which is
the sum of the "False Negatives" and "True Positives".

𝑇𝑇𝑇𝑇
Recall=
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
Evaluation Matrices:

F-measure: If two models have low precision and high recall or

vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the
recall and precision at the same time. The F-score is maximum if
the recall is equal to the precision. It can be calculated using the
below formula:

2 2 (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝∗𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)
F-score= 1 1 =
+ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Evaluation Matrices:

Problem statement: Prediction of corona patients

Let us assume that our model is trying to predict Corona patients.

That model performance is calculated by using this 2×2 matrix.
Evaluation Matrices:
Let’s understand TP, FP, FN, TN in terms of Coronavirus affected
people analogy.
True Positive: Interpretation: Model predicted positive and it’s true.
Model predicted that a person is Corona positive and he actually is having Corona.
True Negative: Interpretation: You predicted negative and it’s true.
Model predicted that person is Corona negative and he actually is
NOT having Corona.
False Positive: (Type 1 Error): Interpretation: Model predicted
positive and it’s false.
Model predicted that that person is Corona positive but he actually was NOT
having Corona.
False Negative: (Type 2 Error)
Interpretation: The model predicted negative and it’s false.
Model predicted that person is Corona negative but actually he was
Corona positive.
Evaluation Matrices:

From the above table, the accuracy of our machine learning model is
calculated as follow:
Accuracy=560+330/560+330+60+50 =>890/1000 =>0.89 =>89%
accuracy.
 This means 89% of values(TP+TN) were correctly classified by the
model.
 Recall=560/560+50 =>560/610 => 0.918
 This means 91.8% of actual positive values were correctly classified.
 Precision= 560/560+60 =>560/620 =>0.8
 Here 0.8 means 80% precision. This means 80% of the positives were
identified correctly.
 F1 Score => 2 x 0.83×0.91/0.83+0.91 => 1.51/1.74 =>0.86
 F1 Score shows the balance between precision and recall.
Evaluation Matrices:
Example 2:

This means that the classifier correctly predicted a male person in 42

cases and it wrongly predicted 8 male instances as female. It correctly
predicted 32 instances as female. 18 cases had been wrongly predicted
as male instead of female
Evaluation Matrices:
The classifier in our previous example predicted correctly
predicted 42 male instances and 32 female instance

Therefore, the accuracy can be calculated by:

 accuracy = (42 + 32) / (42 + 8 + 18 + 32) =0.72 or 72%
Thank
s

Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
19 pages
Week 8
No ratings yet
Week 8
70 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
225 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Machine Learning - Introduction
No ratings yet
Machine Learning - Introduction
73 pages
Machine Learning IAI
No ratings yet
Machine Learning IAI
94 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
16 pages
01 Introduction
No ratings yet
01 Introduction
28 pages
AI Chapter 5
No ratings yet
AI Chapter 5
64 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
University Institute of Engineering Department of Computer Science and Engg
No ratings yet
University Institute of Engineering Department of Computer Science and Engg
27 pages
Unit 5
No ratings yet
Unit 5
30 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
68 pages
Mlintro 2
No ratings yet
Mlintro 2
28 pages
Machine Learning - Introduction
No ratings yet
Machine Learning - Introduction
138 pages
Week 09 Lesson 1 Intro Machine Learning 1 To 32
No ratings yet
Week 09 Lesson 1 Intro Machine Learning 1 To 32
61 pages
01 - ML - Introduction
No ratings yet
01 - ML - Introduction
65 pages
Intro To Machine Learning
No ratings yet
Intro To Machine Learning
31 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
4 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
Module2 ch2
No ratings yet
Module2 ch2
36 pages
Machine Learning: BE Sixth Semester 20CS610
No ratings yet
Machine Learning: BE Sixth Semester 20CS610
211 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
14 pages
Supervised Learning Algorithmn
No ratings yet
Supervised Learning Algorithmn
4 pages
Python UNIT-5
100% (1)
Python UNIT-5
67 pages
Machine Learning BE Merged Modules
No ratings yet
Machine Learning BE Merged Modules
561 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
ML Chapter 1
No ratings yet
ML Chapter 1
37 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
ML Important
No ratings yet
ML Important
8 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
31 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
ML 01
No ratings yet
ML 01
15 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
Unit 3
No ratings yet
Unit 3
33 pages
21CSC305P ML - Unit 1-E
No ratings yet
21CSC305P ML - Unit 1-E
137 pages
Ch3-Machine Learning
No ratings yet
Ch3-Machine Learning
124 pages
Unit 3 ML
No ratings yet
Unit 3 ML
25 pages
AI Chapter 5
No ratings yet
AI Chapter 5
65 pages
Machine Learning for Business
No ratings yet
Machine Learning for Business
42 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
Lecture01 Introduction To Machine Learning (Chapter1)
No ratings yet
Lecture01 Introduction To Machine Learning (Chapter1)
64 pages
Chapter 5 - Machine Learning Basics
No ratings yet
Chapter 5 - Machine Learning Basics
58 pages
Ml-Unit 1
No ratings yet
Ml-Unit 1
53 pages
Machine Learning For Data Science Unit-4
No ratings yet
Machine Learning For Data Science Unit-4
16 pages
Module 2 - ML
No ratings yet
Module 2 - ML
53 pages
Unit 1
No ratings yet
Unit 1
93 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
21 pages
Chapter Four - Part One
No ratings yet
Chapter Four - Part One
44 pages
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
No ratings yet
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
29 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
49 pages
Data Mining Using Decision Trees: Professor J. F. Baldwin
No ratings yet
Data Mining Using Decision Trees: Professor J. F. Baldwin
26 pages
AB Report Group 2
No ratings yet
AB Report Group 2
14 pages
Data Mining: Concepts and Challenges
100% (1)
Data Mining: Concepts and Challenges
24 pages
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
Decision Trees in AI
No ratings yet
Decision Trees in AI
8 pages
Dynamic Default Prediction in P2P Lending
No ratings yet
Dynamic Default Prediction in P2P Lending
9 pages
Operations Research Outline Updated
No ratings yet
Operations Research Outline Updated
8 pages
Machine & Deep Learning Resources
100% (1)
Machine & Deep Learning Resources
21 pages
Data Mining in Mining Equipment
No ratings yet
Data Mining in Mining Equipment
10 pages
Machine Learning Bits
100% (2)
Machine Learning Bits
28 pages
Best Practices For Efficient Soil Sampling Designs
No ratings yet
Best Practices For Efficient Soil Sampling Designs
103 pages
Predictive Modeling of Titanic Survivors
No ratings yet
Predictive Modeling of Titanic Survivors
12 pages
2b Decision Tree 18may
No ratings yet
2b Decision Tree 18may
16 pages
Machine Learning in Crop Yield Prediction
No ratings yet
Machine Learning in Crop Yield Prediction
112 pages
A Survey On ECG Data Classification Using Machine Learning Approaches
No ratings yet
A Survey On ECG Data Classification Using Machine Learning Approaches
8 pages
(English (Auto-Generated) ) All Machine Learning Algorithms Explained in 17 Min (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) All Machine Learning Algorithms Explained in 17 Min (DownSub - Com)
19 pages
Credit Card Fraud Detection Using Random Forest & Cart Algorithm
No ratings yet
Credit Card Fraud Detection Using Random Forest & Cart Algorithm
7 pages
MLT Question Bank Solution
No ratings yet
MLT Question Bank Solution
112 pages
Classification Trees
No ratings yet
Classification Trees
48 pages
19940-Article Text-74945-1-10-20231026
No ratings yet
19940-Article Text-74945-1-10-20231026
11 pages
Machine Learning Algorithms - A Review: January 2019
No ratings yet
Machine Learning Algorithms - A Review: January 2019
7 pages
SQL Server 2012 Tutorials - Analysis Services Data Mining
No ratings yet
SQL Server 2012 Tutorials - Analysis Services Data Mining
215 pages
Speech Recognition Using Backoff N-Gram Modelling in Android Application
No ratings yet
Speech Recognition Using Backoff N-Gram Modelling in Android Application
7 pages
Predicting Customer Churn Strategies
No ratings yet
Predicting Customer Churn Strategies
6 pages
Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
No ratings yet
Agroconsultant: Intelligent Crop Recommendation System Using Machine Learning Algorithms
6 pages
ML Manual2025 - IV YEar
No ratings yet
ML Manual2025 - IV YEar
39 pages
Internship Project.1
100% (1)
Internship Project.1
32 pages
Lab (Work) Experiment File Priyanka Rajak 0901MC221056
No ratings yet
Lab (Work) Experiment File Priyanka Rajak 0901MC221056
19 pages
Lesson 3.1 - Supervised Learning Decision Trees
No ratings yet
Lesson 3.1 - Supervised Learning Decision Trees
51 pages