0% found this document useful (0 votes)
23 views106 pages

CH 4

The document discusses machine learning concepts including supervised and unsupervised learning. It covers classification methods like K-nearest neighbors algorithm which uses the closest training examples to predict the class of new examples. The document provides examples of machine learning applications and outlines the machine learning pipeline.

Uploaded by

Abebe Bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views106 pages

CH 4

The document discusses machine learning concepts including supervised and unsupervised learning. It covers classification methods like K-nearest neighbors algorithm which uses the closest training examples to predict the class of new examples. The document provides examples of machine learning applications and outlines the machine learning pipeline.

Uploaded by

Abebe Bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Chapter Four

MACHINE LEARNING

By:
Irandufa Indebu

irandufa.indebu@haramaya.edu.et
Outlines
Overview of machine learning

Supervised Learning (Classification and Regression)

Unsupervised Learning (More focus on Clustering ..)

Model Evaluation Methods


Overview of Machine Learning
Machine learning falls under the larger umbrella of artificial
intelligence.

Artificial intelligence is a branch of computer science that


includes reasoning, natural language processing, planning, and
machine learning.

Data science encompasses both artificial intelligence and


machine learning.
Overview of Machine Learning
Machine learning is the field of study that develops the
algorithms that the computers follow in order to identify and
extract patterns from data

ML allow us to “teach” computers how to perform tasks by


providing examples of how they should be done.

It gives computers the ability to learn without being explicitly


programmed

It is programming computers to optimize a performance criterion


using example data or past experience
Overview of Machine Learning
ML algorithms and techniques are applied primarily during the
modeling stage of CRISP-DM. ML involves a two-step process.

First, an ML algorithm is applied to a data set to identify useful


patterns in the data

These patterns can be represented in a number of different ways

The model may be predictive to make predictions in the future,


or descriptive to gain knowledge from data, or both.
Overview of Machine Learning
Pattern may be represented as decision tree, regression model,
and neural network.

These representations of patterns are known as “models,” which


is why this stage of the CRISP-DM life cycle is known at the
“modeling stage.”

Once a model has been created and represented, it is used for


analysis.
• Statistics: making inferences from sample data.
• Numerical algorithms (linear algebra, optimization):
ML depends on: optimize criteria, manipulate models.
• Computer science: data structures and programs that solve
a ML problem efficiently.
Some Application of ML
 Fraud detection
• Algorithms are now capable of detecting when a financial
transaction has characteristics of fraud.
• Even Companies can spot fake reviews by recognizing word
patterns and timing for review postings that are more likely to
be fake.

Speech or face recognition: use speech recognition to understand


what we are saying and to respond to our request. Social media uses
complex data analysis to recognize patterns in our photos and can tell
who is in a picture before we even begin tagging.
Some Application of ML
Medical diagnosis: Machine learning could use image
recognition to diagnose x-rays by taking data from several patients
to imaging scans in order to make predictions about new patients.

 Speech or face recognition. .

 Credit scoring: classify customers into high- and low-risk,


based on their income and savings, using data about past loans
(whether they were paid or not).

 Regression: the labels to be predicted are continuous

* Predict the price of a car from its mileage.


Some Application of ML
Stock Predictions: Stock traders look at many variables to decide
on what to do with a stock, whether they want to buy or sell or
wait it out.

They look at certain characteristics of a stock, and trends in the


market environment to make an educated guess on what they
should do.
ML Pipeline
The high-level components of machine learning system are
outlined in the following diagram

This diagram illustrates the machine learning pipeline from


which we obtain data and in which we store data.
ML Pipeline
We then transform it into a form that is usable as input to a
machine learning model; train, test, and refine our model; and
then, deploy the final model to our production system.

The process is then repeated as new data is generated


Types of ML
Some of the main types of machine learning are:

1. Supervised Learning, in which the training data is labeled with


the correct answers

, e.g. “spam” or “ham.”

“Fraudulent or non-Fraudulent”
Types of ML
2. Unsupervised learning, in which we are given a collection of
unlabeled data, which we wish to analyze and discover patterns
within. The two most important examples are dimension
reduction and clustering.

3. Reinforcement learning, in which an agent (e.g., a robot or


controller) seeks to learn the optimal actions to take based the
outcomes of past actions.
Supervised Learning
Supervised Learning

In supervised learning, programmers use labeled data

The data that we are looking at is already predetermined

Supervised learning is “supervised” because each of the instances


in the data set lists both the input values and the output (target)
value for each instance

For supervised learning to take place, each instance in the data


set must be labeled with the value of the target attribute.
Supervised Learning

We are trying to find a relationship between X and Y that we have


chosen. Input output

X Y
After you find a relationship between X and Y, you get a model, which

will predict an outcome based on those relationships that your machine


has observed in the data.

Once we have estimated this relationship, we can predict Y with


X.
Supervised Learning

The goal of supervised learning is to learn a function that maps


from the values of the attributes describing an instance to the value
of another attribute, known as the target attributes of that instance.

The two most common types of supervised learning are


classification (where the outputs are discrete labels, as in spam
filtering) and regression (where the outputs are real-valued).

An easy way to distinguish between classification and regression


tasks is to ask whether there is some kind of continuity in the
output
Classification

Classification requires labeled data and creates non-continuous


predications.

Classification models are probably the most widely used part of


machine learning and data science.

Classifies data (constructs a model) based on the training set and


the values (class labels) in a classifying attribute and uses it in
classifying new data

In a classification problem, a label is a member of a finite set of


classes
Classification

The first type of classification is binary classification

With binary classification, the data is classified into two


categories. The size of the set of classes is two (“sick”/“healthy”,
“spam”/“not_spam”),

We call it binary classification because there are only two


possible categories and all of our data falls into one or the other.

But there are instances when we have more than two categories,
and for this, we use multi-class classification models
Classification

Multiclass classification (also called multinomial) is a classification


problem with three or more classes

E.g.. Sentiment analysis result= positive, negative or neutral


Classification methods

Goal: Predict class Ci = f(x1, x2, .. xn)

There are various classification methods. Popular classification


techniques include the following.

 K-nearest neighbor

 Decision tree classifier: divide decision space into piecewise


constant regions.

 Neural networks: partition by non-linear boundaries


 Bayesian network: a probabilistic model
 Support vector machine
K Nearest Neighbor

K-nearest neighbors are one of the most straightforward and


widely used methods of data classification

It is a supervised machine learning classifier that uses the


observations it memorizes from within a test dataset to predict
classifications for new, unlabeled observations.

kNN makes its predictions based on similarity the more similar


the training observations are to the new, incoming observations,
the more likely it is that the classifier will assign them both the
same class.
K Nearest Neighbor

In KNN, a new data point is classified by the average median


value of its neighbors K.

The purpose of this algorithm is to classify a new object based on


attributes and training samples: (xi, f(xi)), i=1..N.

Given a query point, we find K number of objects or (training


points) closest to the query point.
 The classification is using majority vote among the
classification of the K objects.
 K Nearest neighbor algorithm used neighborhood
classification as the prediction value of the new query
instance.
K Nearest Neighbor

K nearest neighbor algorithm is very simple. It works based on


minimum distance from the query instance to the training samples
to determine the K-nearest neighbors.
How to compute K-Nearest Neighbor (KNN) Algorithm?

Determine parameter K = number of nearest neighbors


Calculate the distance between the query-instance and all the
training samples
 We can use Euclidean distance

Sort the distance and determine nearest neighbors based on


the Kth minimum distance
Gather the category of the nearest neighbors
Use simple majority of the category of nearest neighbors as
the prediction value of the query instance
 Any ties can be broken at random with reason.
K Nearest Neighbors: Key issues

The key issues involved in training KNN model includes


1. Setting the variable K (Number of nearest neighbors)
 The numbers of nearest neighbors (K) should be based on
cross validation over a number of K setting.
 When k=1 is a good baseline model to benchmark against.
 A good rule-of-thumb is that K should be less than or equal to
the square root of the total number of training patterns.
K Nearest Neighbors: Key issues

2. Setting the type of distant metric


 We need a measure of distance in order to know who are the
neighbours
 Assume that we have T attributes for the learning problem.
Then one example point x has elements xt ∈ ℜ, t=1,…T.
 The distance between two points xi xj is often defined as the
Euclidean distance:
D 2

Dist ( X , Y ) = ∑ ( Xi − Yi)
i =1
K Nearest Neighbors: Example

We have data from the questionnaires survey (to ask people
opinion) & objective testing with two attributes (acid durability &
strength) to classify whether a special paper tissue is good or not.
Here is four training samples.
X1 = Acid Durability X2 = Strength (kg/m2) Y = Classification
(seconds)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
K Nearest Neighbors: Example
Now the factory produces a new paper tissue that pass laboratory
test with X1 = 3 and X2 = 7. Without undertaking another
expensive survey, guess the goodness of the new tissue? Use
squared Euclidean distance for similarity measurement.
Steps:
1. Determine parameter K = number of nearest neighbors
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training
samples
K Nearest Neighbors: Example

Coordinate of query instance is (3, 7), instead of calculating the


distance we compute square distance which is faster to calculate
(without square root)
X1 = Acid X2 = Square Distance to
Durability Strength query instance (3,
(seconds) (kg/m2) 7)

7 7
7 4
3 4
1 4
K Nearest Neighbors: Example

3. Sort the distance and determine nearest neighbors based on the
K-th minimum distance
X1 = Acid X2 = Square Distance Rank
Durability Strength to query minimu
(seconds) (kg/m2) instance (3, 7) m
distance
7 7 3
7 4 4
3 4 1
1 4 2
K Nearest Neighbors: Example

4. Gather the category (Y) of the nearest neighbors. Notice in the
second row last column that the category of nearest neighbor (Y) is not
included because the rank of this data is more than 3 (=K).

X1 = Acid X2 = Square Distance Rank Is it Y=


Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example

5. Use simple majority of the category of nearest neighbors as the


prediction value of the query instance

X1 = Acid X2 = Square Distance Rank Is it Y=


Durability Strength to query minimu included Category
(seconds) (kg/m2) instance (3, 7) m in 3- of NN
distance NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
K Nearest Neighbors: Example

We have 2 good and 1 bad, since 2>1 then we conclude that a
new paper tissue that pass laboratory test with X1 = 3 and X2 = 7
is included in Good category. Output of test set
X1 = Acid X2 = Strength Y = Category
Durability (kg/m2) of NN
(seconds)
3 7 Good
K Nearest Neighbors: Exercise
Training set
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4 No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No

Test set Number Lines Line types Rectangles Colours Mondrian?


8 7 2 9 4
When to use K Nearest Neighbor?

kNN works best if the dataset is:

 Low on noise

 Free of outliers

 Labeled

 Composed only of relevant selected features

 Composed of distinguishable groups


KNNs: advantages & Disadvantages

Advantage
 Simple &powerful
 Requires less training time
 Nonparametric architecture
Disadvantage: Difficulties with k-nearest neighbour algorithms
 Classification/estimation is slow
 Have to calculate the distance of the test case from all training
cases
 Memory intensive: just store the training examples
when a test example is given then find the closest matches
Decision Tree
If the input attributes in a data set are primarily nominal or
ordinal, ML algorithms and models, such as decision trees are
more appropriate.
Decision tree is the most powerful and popular tool for
classification and prediction

It is a flowchart like tree structure, where each internal node


denotes a test on an attribute, each branch represents an outcome
of the test, and each leaf node (terminal node) holds a class label.

A decision tree encodes a set of if then, else rules in a tree


structure
Decision Tree

 This Figure illustrates a decision tree for deciding whether


an email is spam or not
 Rectangles with rounded corners represent tests on attributes,
and the square nodes indicate decision, or classification, nodes
Decision Tree

This tree encodes the following rules:


 if the email is from an unknown sender, then it is spam;

 if it isn’t from an unknown sender but contains suspicious words, then it is


spam;

 if it is neither from an unknown sender nor contains suspicious words, then it


is not spam.

 In a decision tree, the decision for an instance is made by


starting at the top of the tree and navigating down through the
tree by applying a sequence of attribute tests to the instance.
Decision Tree

Decision Tree Terminology

Root Nodes – It is the node present at the beginning of a decision tree


from this node the population starts dividing according to various
features.

Decision Nodes – the nodes we get after splitting the root nodes are
called Decision Node
Leaf Nodes – the nodes where further splitting is not possible are called
leaf nodes or terminal nodes

Sub-tree – just like a small portion of a graph is called sub-graph


similarly a sub-section of this decision tree is called sub-tree.
Decision Tree

Shall I play a Tennis Today?


Decision Tree

The goal of a decision-tree-learning is to find a set of


classification rules that divide the training data set into sets of
instances that have the same value for the target attribute.

One of the strengths of decision trees is that they are simple to


understand

Also it is possible to create very accurate models based on


decision trees

Although decision trees work well with both nominal and ordinal
data, they struggle with numeric data
Decision Tree

In a decision tree, a separate branch descends from each node for
each value in the domain of the attribute tested at the node.

Numeric attributes, however, have an infinite number of values in


their domains, with the implication that a tree would need an
infinite number of branches

One solution to this problem is to transform numeric attributes into


ordinal attributes, although doing so requires the definition of
appropriate thresholds, which can also be difficult.
Decision Tree

Each node in the tree specifies one attribute to test, and the
process descends the tree node by node by choosing the branch
from the current node with the label matching the value of the test
attribute of the instance.

The final decision is the label of the terminating (or leaf) node
that the instance descends to.

Each path in a decision tree, from root to leaf, defines a


classification rule composed of a sequence of tests
Choosing the Splitting Attribute

At each node, the best attribute is selected for splitting the
training examples using a Goodness function. The best attribute is
the one that separate the classes of the training examples faster
such that it results in the smallest tree
Typical goodness functions: information gain, information gain
ratio, and GINI index
Information Gain: Select the attribute with the highest information gain,
that create small average disorder
•First, compute the disorder using Entropy; the expected information
needed to classify objects into classes
•Second, measure the Information Gain; to calculate by how much the
disorder of a set would reduce by knowing the value of a particular
attribute.
Entropy

The Entropy measures the disorder of a set S containing a total


of n examples of which n+ are positive and n- are negative and it
is given by:

Some useful properties of the Entropy:


Information Gain

The Information Gain measures the expected reduction in entropy due


to splitting on an attribute A

 k ni 
GAIN split = Entropy ( S ) −  ∑ Entropy (i ) 
 i =1 n 

Parent Node, S is split into k partitions; ni is number of records in


partition I

Information Gain: Measures Reduction in Entropy achieved


because of the split. Choose the split that achieves most reduction
(maximizes GAIN)
Example 1: The problem of “Sunburn”
You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
Data Collected: predict based on the observed properties of the
people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Example 1: The problem of “Sunburn”

Attribute Selection by Information Gain to construct the


optimal decision tree:

Entropy: The Disorder of Sunburned

D({“Sarah”,“Dana”,“Alex”,“Annie”,“Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
= D(3+ ,5− ) = − log 2 − log 2 = 0.954
8 8 8 8
Example 1: The problem of “Sunburn”

Calculate the Average Disorder Associated with Hair Colour

D(Sblonde) = D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2+,2-) =1

S blonde S blonde 4
D ( S blonde ) = = = 0.5
S S 8
The second and third terms of the sum:
• Sred = {“Emily”}
• Sbrown = { “Alex”, “Pete”, “John”}.
• These are both 0 because within each set all the examples have
the same class
• So the average disorder created when splitting on ‘hair colour’ is
0.5+0+0=0.5
Example 1: The problem of “Sunburn”

Which decision variable minimises the disorder?

Note: The attribute "hair color" is selected as the first test because it
minimizes the entropy
Example 1: The problem of “Sunburn”

The best decision tree?

Once we have finished with hair colour we then need to calculate


the remaining branches of the decision tree.

Which attributes is better to classify the remaining ?


Example 1: The problem of “Sunburn”

The attribute "lotion" is selected because it minimizes the


entropy in the blonde hair subset.

The best Decision Tree This is the completed decision


tree. using the "hair color" and
Thus,
"lotion" tests together ensures the
proper identification of all the samples.
This is the simplest and optimal one
possible and it makes a lot of sense.
It classifies 4 of the people on just
the hair colour alone.
Example 1: The problem of “Sunburn”

You can view Decision Tree as an IF-THEN_ELSE statement


which tells us whether someone will suffer from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)
Strengths and Weakness of Decision Tree approach

The strengths of decision tree methods are:

 Decision trees are able to generate understandable rules.

 Decision trees perform classification without requiring much


computation.

 Decision trees are able to handle both continuous and categorical


variables.

 Decision trees provide a clear indication of which fields are most


important for prediction or classification.
Strengths and Weakness of Decision Tree approach

The Weakness of decision tree methods are:

Decision trees are less appropriate for estimation tasks where the
goal is to predict the value of a continuous attribute.

Decision trees are prone to errors in classification problems with


many class and relatively small number of training examples.

Decision tree can be computationally expensive to train. The


process of growing a decision tree is computationally expensive.

Support vector machine reading assignment.


Naïve Bayes Classifier

Naïve Bayes algorithm is a ML algorithm used for classification


problem

It is primary used for classification which involves high dimensional


training data

Example: Spam alteration and document classification

It is based on Bayes theorem and is a probabilistic classifier

Find out the probability of the previously unseen instance belonging


to each class, then simply pick the most probable class
Naïve Bayes Classifier

Probability Basics:
•Prior, conditional and joint probability
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X = ( X1 , X2 ), P( X ) = P(X1 ,X2 )
– Relationship: P(X1 ,X2 ) = P( X2 | X1 )P( X1 ) = P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 ) = P( X2 ), P( X1 | X2 ) = P( X1 ), P(X1 ,X2 ) = P( X1 )P( X2 )
•Bayesian Rule

P( X |C )P(C ) Likelihood × Prior


P(C |X ) = Posterior =
P( X ) Evidence
Naïve Bayes Classifier

Bayesian classifiers use Bayes theorem, which says:

Where, p(cj | d) = probability of instance d being in class cj,

p(d | cj) = probability of generating instance d given class cj


We can imagine that being in class cj, causes you to have feature
d with some probability
p(cj) = probability of occurrence of class cj, This is just how
frequent the class cj, is in our database

P(d)= the probability of instance d occurring.


Naïve Bayes Classifier

Example: Play Tennis


Naïve Bayes Classifier

•Learning Phase
Naïve Bayes Classifier

•Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

Map rule:
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.


Naïve Bayes Classifier

•Learning Phase
Naïve Bayes Classifier

Using Bayes classifier, predict the sex of this individual:

(‘over170cm=no, eye=brown, hair_length=long’)


Naïve Bayes Classifier
Age Income Student Credit rating Buys computer
Youth High No Fair No
Youth High No Excellent No
Mid_age High No Fair Yes
Senior Medium No Fair Yes
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Mid_age Low Yes Excellent Yes
Youth Medium No Fair No
Youth High Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Mid_age Medium No Excellent Yes
Mid_age High Yes Fair Yes
senior Medium No Excellent No

Thus, using these table…


Naïve Bayes Classifier

 Predict the class level of a tuple x using Naïve Bayes (NB)


classifier given the training data on the above table.

The data tuples are described by attribute age, income, student,


and credit rating.

The class level attribute buys computer has two distinct values:
yes/no. The new tuple x is give as follow:

X=(age=youth, income=medium, student=yes, credit rating=fair)


Naïve Bayes Classifier

Advantage and disadvantage of Naïve Bayes classifier:

Advantage:
 Fast to train (single scan). Fast to classify
 Training is very easy and fast; just requiring considering each
attribute in each class separately
 Not sensitive to irrelevant features
 Handles real and discrete data
 Handles streaming data well
 Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
Disadvantage
 Assume independence of feature
Artificial Neural Networks - Deep Learning

Deep learning is a popular area within data science today

They are deep in terms of the number of hidden layers they have.

Deep learning is also just a sexy term for Artificial Neural


Networks (ANN), which have been around for over forty years.

Artificial Neural Networks (ANN), also known as Neural


Networks, are one of the most widely used algorithms within the
field of machine learning.

Neural networks are commonly used in visual and audio


recognition
Artificial Neural Networks - Deep Learning

ANN emphasizes on analyzing data in many layers, and was


inspired by the human brain, which can visually process objects
through layers of neurons

ANN is presented in the form of interconnected neurons that


interact with each other.

Each connection has numeric weight that can be altered and is


based on experience.

The layers or neurons are stacked on top of each other starting


with a broad base.
Artificial Neural Networks - Deep Learning

The bottom layer consists of raw data such as text, images or


sound, which are divided into what we called neurons

Within each neuron is a collection of data.

Each neuron then sends information up to the layer of neurons


above

As the information ascends it becomes less abstract and more


specific, and the more we can learn from the data from each layer.

A simple neural network can be divided into input, hidden, and


output layers.
Artificial Neural Networks - Deep Learning

Data is first received by the input layer, and this first layer detects
broad features

The hidden layer/s then analyze and processes that data, and
through the passing of each layer with less neurons (which
diminish in number at each layer) the data becomes clearer, based
on previous computations.

The final result is shown as the output layer


The middle layers are considered hidden layers, because like
human sight we are unable to naturally break down objects into
layered vision
Artificial Neural Networks - Deep Learning

• The neural network in figure 15 has


five layers: one input layer on the
left containing three neurons, three
hidden layers (the black circles),
and one output layer on the right
containing two neurons.

The input layer has three neurons; the first hidden layer has five
Each of the next two hidden layers has four; and the output layer
has two.
The above ANN is fully connected, feed-forward network.
Artificial Neural Networks - Deep Learning

Most modern algorithms, including decision trees and naive


Bayes are considered shallow algorithms, as they do not analyze
information via numerous layers as ANN can.

The power of deep neural networks comes from the fact that they
can automatically learn useful attributes

It is the ability to automatically learn complex mappings of input


data to useful attribute representations that has made deep-
learning models so accurate in tasks with high-dimensional inputs
(such as image and text processing).
Regression

Both Linear Regression and Logistic regression reading


assignment.
Unsupervised Learning
Unsupervised Learning

Unsupervised machine learning uses unlabeled data

Unsupervised learning allows us to find patterns that would be

unobservable without computer scientists

The algorithm must discover patterns on its own


Clustering

The most common types of unsupervised machine learning is


clustering

Clustering methods find similarities between instances and group


instances

In clustering, the algorithm looks for clusters of instances that are
more similar to each other than they are to other instances in the
data.

A challenge for clustering is figuring out how to measure


similarity
Clustering
An unsupervised clustering algorithm will look for groups of
rows that are more similar to each other than they are to the other
rows in the data.

Each of these groups of similar rows defines a cluster of similar


instances.

Clustering example
Clustering
Simply put, a clustering algorithm computes the distance
between groupings and divides data points into multiple groups
based on their relational distance to one another.

Unlike classification, which starts with predefined labels


reflected in the database table, clustering creates its own labels
after clustering the data set.

Analysis by clustering can be used in various scenarios such as


pattern recognition, image processing and market research.
Clustering
For example, clustering can be applied to uncover customers that
share similar purchasing behavior

By understanding a particular cluster of customer purchasing


preferences, we can then form decisions on which products we can
recommend to the group based on their commonalities.
Clustering
General goal of clustering algorithm is:
 Instances in the same group should be similar
 Instances in the different group should be different
 But the best clustering is hard to define since we don’t have a
test error
 Generally, there is no best method in unsupervised learning
So why cluster?

You may want to know what group are there

You may want to know a group of new instance x


Clustering Algorithms
K-means Clustering Algorithm
Given new data points, k-Means will assign them to the closest
cluster center.
K-means Clustering Algorithm
k-Means clustering is one of the simplest and most commonly
used clustering algorithms

It tries to find cluster centers that are representative of certain


regions of the data.

The algorithm alternates between two steps: assigning each data


point to the closest cluster center, and then setting each cluster
center as the mean of the data points that are assigned to it.

The algorithm is finished when the assignment of instances to


clusters no longer changes.
K-means Clustering Algorithm
Input

The number of cluster ‘k’ (hyper parameter)

Initial guess of the center (the ‘mean’) of each cluster

Output: Clustered instances

Algorithm: Assign each Xi to its closest mean

Update the mean based on the assignment

Repeat until convergence


KNN vs K-means
Don’t confuse KNN classification and K-means clustering
Challenges in unsupervised learning

A major challenge in unsupervised learning is evaluating whether the


algorithm learned something useful

Unsupervised learning algorithms are usually applied to data that


does not contain any label information, so we don’t know what the
right output should be

Therefore it is very hard to say whether a model “did well”

Most of the time unsupervised learning is used in a preprocessing


step for supervised learning not for automatic purpose.
Evaluation of Machine learning algorithms
Confusion Matrix
A confusion matrix, also called a contingency table or error
matrix, is used to visualize the performance of a classifier

The columns of the matrix represent the instances of the


predicted classes and the rows represent the instances of the actual
class.

In the case of binary classification the table has 2 rows and 2
columns
Confusion Matrix
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for
3 classes, it is 3*3 table, and so on.

The matrix is divided into two dimensions, that are predicted


values and actual values along with the total number of predictions.

Predicted values are those values, which are predicted by the model,
and actual values are the true values for the given observations.

It looks like the below table:


Confusion Matrix

True Negative: Model has given prediction No, and the real or actual
value was also No

True Positive: The model has predicted yes, and the actual value was
also true.
Confusion Matrix
False Negative: The model has predicted no, but the actual value was
Yes. it is also called as Type-II error.

False Positive: The model has predicted Yes, but the actual value was
No. It is also called a Type-I error.
Need for Confusion Matrix in Machine learning

It evaluates the performance of the classification models, when


they make predictions on test data, and tells how good our
classification model is.

It not only tells the error made by the classifiers but also the type
of errors such as it is either type-I or type-II error

With the help of the confusion matrix, we can calculate the


different parameters for the model, such as accuracy, precision,
recall, etc.
Need for Confusion Matrix in Machine learning

Example: We can understand the confusion matrix using an


example.

Suppose we are trying to create a model that can predict the


result for the disease that is either a person has that disease or not.
So, the confusion matrix for this is given as:
Need for Confusion Matrix in Machine learning

From the above example, we can conclude that:

The table is given for the two-class classifier, which has two
predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.

The classifier has made a total of 100 predictions. Out of 100


predictions, 89 are true predictions, and 11 are incorrect predictions.

The model has given prediction "yes" for 32 times, and "No" for 68
times. Whereas the actual "Yes" was 27, and actual "No" was 73 times.
Evaluation Matrices:

The most popular evaluation metric used to evaluate the results of ML


(usually classifications ) are: Accuracy, Precision, Recall, F1-Score

Classification Accuracy: It is one of the important parameters to


determine the accuracy of the classification problems. It defines
how often the model predicts the correct output. It can be
calculated as the ratio of the number of correct predictions made by
the classifier to all number of predictions made by the classifiers.
TP+TN
The formula is given below: accuracy=
TP+FP+FN+TN
Evaluation Matrices:

Precision: It can be defined as the number of correct outputs


provided by the model or out of all positive classes that have
predicted correctly by the model, how many of them were actually
true. It can be calculated using the below formula:

𝑇𝑇𝑇𝑇
precision=
𝑇𝑇𝑇𝑇+𝐹𝐹𝑃𝑃

Precision is the ratio of the correctly identified positive cases to all


the predicted positive cases, i.e. the correctly and the incorrectly
cases predicted as positive.
Evaluation Matrices:

Recall: It is defined as the out of total positive classes, how our model
predicted correctly. The recall must be as high as possible.

Recall, also known as sensitivity, is the ratio of the correctly


identified positive cases to all the actual positive cases, which is
the sum of the "False Negatives" and "True Positives".

𝑇𝑇𝑇𝑇
Recall=
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
Evaluation Matrices:

F-measure: If two models have low precision and high recall or


vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the
recall and precision at the same time. The F-score is maximum if
the recall is equal to the precision. It can be calculated using the
below formula:

2 2 (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝∗𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)
F-score= 1 1 =
+ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
Evaluation Matrices:

Problem statement: Prediction of corona patients

Let us assume that our model is trying to predict Corona patients.


That model performance is calculated by using this 2×2 matrix.
Evaluation Matrices:
Let’s understand TP, FP, FN, TN in terms of Coronavirus affected
people analogy.
True Positive: Interpretation: Model predicted positive and it’s true.
Model predicted that a person is Corona positive and he actually is having Corona.
True Negative: Interpretation: You predicted negative and it’s true.
Model predicted that person is Corona negative and he actually is
NOT having Corona.
False Positive: (Type 1 Error): Interpretation: Model predicted
positive and it’s false.
Model predicted that that person is Corona positive but he actually was NOT
having Corona.
False Negative: (Type 2 Error)
Interpretation: The model predicted negative and it’s false.
Model predicted that person is Corona negative but actually he was
Corona positive.
Evaluation Matrices:

From the above table, the accuracy of our machine learning model is
calculated as follow:
Accuracy=560+330/560+330+60+50 =>890/1000 =>0.89 =>89%
accuracy.
 This means 89% of values(TP+TN) were correctly classified by the
model.
 Recall=560/560+50 =>560/610 => 0.918
 This means 91.8% of actual positive values were correctly classified.
 Precision= 560/560+60 =>560/620 =>0.8
 Here 0.8 means 80% precision. This means 80% of the positives were
identified correctly.
 F1 Score => 2 x 0.83×0.91/0.83+0.91 => 1.51/1.74 =>0.86
 F1 Score shows the balance between precision and recall.
Evaluation Matrices:
Example 2:

This means that the classifier correctly predicted a male person in 42


cases and it wrongly predicted 8 male instances as female. It correctly
predicted 32 instances as female. 18 cases had been wrongly predicted
as male instead of female
Evaluation Matrices:
The classifier in our previous example predicted correctly
predicted 42 male instances and 32 female instance

Therefore, the accuracy can be calculated by:


 accuracy = (42 + 32) / (42 + 8 + 18 + 32) =0.72 or 72%
Thank
s

You might also like