MATRUSRI
ENGINEERING COLLEGE
MATRUSRI ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SUBJECT NAME: Machine Learning
FACULTY NAME: Mrs.J.Samatha
MATRUSRI
ENGINEERING COLLEGE
MACHINE LEARNING
COURSE OBJECTIVES:
•To explore the supervised learning paradigms of machine learning
•To explore the unsupervised learning paradigms of machine learning
•To evaluate various machine learning algorithms and techniques
•To explore deep learning technique and various feature extraction
strategies.
•To explore recent trends in machine learning methods for IOT
applications.
COURSE OUTCOMES:
•Extract features and apply supervised learning paradigms.
•Illustrate several clustering algorithms on the give data set.
•Compare and contrast various machine learning algorithms and
get insight of when to apply particular machine learning approach
•Apply basic deep learning algorithms and feature extraction
strategies.
•Get familiarized with advanced topics of machine learning.
MATRUSRI
ENGINEERING COLLEGE
MODULE-I
Evaluating machine learning algorithms and model selection
OUTCOMES:
• Able to evaluate different machine learning algorithms and select a
model
MATRUSRI
ENGINEERING COLLEGE
Evaluation of Learning Algorithms
Evaluating the performance of learning systems is important because:
-Learning are usually designed to predict the class of future unlabeled
data points
Typical choices for performance evaluation:
-Error
-Accuracy
-Precision/recall
Typical choices for sampling methods:
-Hold out method or Training/test set
-K fold cross validation
-Leave one out cross validation
-Boot strap method
MATRUSRI
ENGINEERING COLLEGE
Hold out Method(Training/Validation/Test set)
MATRUSRI
ENGINEERING COLLEGE
MATRUSRI
ENGINEERING COLLEGE
K fold Cross Validation
1.Split the data into k equal subsets
2.Perform k rounds of learning, on each round
-1/k of the data is held out as a test set and
-The remaining examples are used as training data
3. Compute the average test set score of the k rounds
MATRUSRI
ENGINEERING COLLEGE
Boot strap method
•In statistics, the term “bootstrap sampling”, the “bootstrap” or
“bootstrapping” for short, refers to process of “random sampling with
replacement”.
• repeated sampling from data with replacement and repeated
estimation
•Subsample will have same number of observations
•Same observation can be selected many times
•Probability of selecting each observation is same
MATRUSRI
ENGINEERING COLLEGE
Leave one out Cross Validation
•Specific case of K fold CV.
•K=N
•ADV:
•Good way to validate
•Disadv:
•High computation time
MATRUSRI
ENGINEERING COLLEGE
Evaluating Predictions
Suppose we want to make a prediction of a value for a target feature
on example x:
-Y is observed value of target value of target feature on example x
-Y’ is the predicted value of target feature on example x
-How is the error measured?
MATRUSRI
ENGINEERING COLLEGE
Terminology related to Evaluation of Algorithms
Supervised Supervised learning teaches a model from labeled training data and helps you to
learning make predictions about unseen or future data. During the training, you give the
algorithm a dataset that contains correct answers (label y). Then, you validate the
model accuracy with a test data set with correct answers. A data set must be split into
training and test sets.
Classification With classification, you're trying to predict one of a small number of discrete-valued
outputs. For example, you might try to predict whether your label is binary (binary
classification) or categorical (multiclass classification).
Regression In regression, the goal of the learning problem is to predict continuous value output.
Ranking Order items according to some criterion. Ex: web search : returning web pages
relevant to a search query. Many other similar ranking problems arise in the context
of the design of information extraction of natural language processing systems.
Unsupervised Given a data set, try to find tendencies in the data by using techniques like clustering.
learning
Feature Feature is an attribute that is used as input for the model to train. Other names
include dimension or column.
MATRUSRI
ENGINEERING COLLEGE
Terminology related to Evaluation of Algorithms
Bias Bias is the expected difference between the parameters of a
model that perfectly fits your data and the parameters that your
algorithm learned. The sample error is a poor estimator of true
error.
Variance Variance is how much the algorithm is impacted by the training
data and how much the parameters change with new training
data. The smaller the test set, the greater the expected variance.
Underfitting The model is too simple to capture the patterns within the data.
The model performs poorly on data that it was trained on and on
unseen data. High bias, low variance. High training error and
high test error.
Overfitting The model is too complicated or too specific, capturing trends
that don't generalize. The model accurately predicts data that it
was trained on but doesn't accurately predict unseen data. Low
bias, high variance. Low training error and high test error.
Bias- Bias-Variance Trade-off refers to finding a model with the right
MATRUSRI
ENGINEERING COLLEGE
Bias
Bias is the difference between our actual and predicted values. Bias is the
simple assumptions that our model makes about our data to be able to
predict new data.
MATRUSRI
ENGINEERING COLLEGE
Bias
When the Bias is high, assumptions made by our model are too
basic, the model can’t capture the important features of our data. This
means that our model hasn’t captured patterns in the training data and
hence cannot perform well on the testing data too. If this is the case, our
model cannot perform on new data and cannot be sent into production.
This instance, where the model cannot find patterns in our
training set and hence fails for both seen and unseen data, is called
Underfitting.
MATRUSRI
ENGINEERING COLLEGE
Variance
We can define variance as the model’s sensitivity to fluctuations in the
data. Our model may learn from noise. This will cause our model to
consider trivial features as important.
we can see that our model has learned extremely well for our training
data, which has taught it to identify cats. But when given new data, such
as the picture of a fox, our model predicts it as a cat, as that is what it has
learned. This happens when the Variance is high, our model will capture
all the features of the data given to it, including the noise, will tune itself
to the data, and predict it very well but when given new data, it cannot
predict on it as it is too specific to training data.
MATRUSRI
ENGINEERING COLLEGE
Variance
Hence, our model will perform really well on testing data and get high
accuracy but will fail to perform on new, unseen data. New data may not
have the exact same features and the model won’t be able to predict it
very well. This is called Overfitting.
Over-fitted model where we see model performance on, a) training data
b) new data
MATRUSRI
ENGINEERING COLLEGE
Bias & Variance
MATRUSRI
ENGINEERING COLLEGE
Evaluation Metrics
REGRESSION:
•Mean absolute error
•Mean squares error
•Root Mean Squared Error(RMSE)
•Mean absolute percentage error
CLASSIFICATION:
•Confusion matrix
MATRUSRI
ENGINEERING COLLEGE
Mean Absolute Error
We will calculate the residual for every data point, taking only the
absolute value of each so that negative and positive residuals do not
cancel out. We then take the average of all these residuals.
MATRUSRI
ENGINEERING COLLEGE
Mean Absolute Error
•Take the absolute difference between Y and Ŷ for each of the
available observations: ⎮Yᵢ-Ŷᵢ⎮ where i ϵ [1, the total number of
points in the dataset].
•Sum each absolute difference to get a total error: Σ⎮Yᵢ-Ŷᵢ⎮
•Divide the sum by a total number of observations to get a mean
error value: Σ⎮Yᵢ-Ŷᵢ⎮ / n
MAE = Σ⎮Yᵢ-Ŷᵢ⎮ / n
MATRUSRI
ENGINEERING COLLEGE
Mean square error
•Take the difference between Y and Ŷ for each of the available
observations: Yᵢ-Ŷᵢ
•Square each of the difference value : (Yᵢ-Ŷᵢ)²
•Sum Squared values: Σ (Yᵢ-Ŷᵢ)² where i ϵ [1, the total number of
points in the dataset]
•Divide by the total number of observations: Σ (Yᵢ-Ŷᵢ)² / n
MATRUSRI
ENGINEERING COLLEGE
Root Mean squared error
As RMSE is clear by the name itself, that it is a simple square root of mean
squared error.
MATRUSRI
ENGINEERING COLLEGE
Mean absolute percentage error
MAPE will be lower when the prediction is lower than the actual
compared to a prediction that is higher by the same amount.
MATRUSRI
ENGINEERING COLLEGE
Confusion matrix for 2-class problems
MATRUSRI
ENGINEERING COLLEGE
Confusion Matrix
Basic terminology
•True Positives (TP): we correctly predicted that they do have diabetes
•True Negatives (TN): we correctly predicted that they don't have
diabetes
•False Positives (FP): we incorrectly predicted that they do have diabetes
•False Negatives (FN): we incorrectly predicted that they don't have
diabetes
MATRUSRI
ENGINEERING COLLEGE
Other accuracy metrics
MATRUSRI
ENGINEERING COLLEGE
Other accuracy metrics
•Precision tells us how many of the correctly predicted cases actually turned out
to be positive.
•Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.
MATRUSRI
ENGINEERING COLLEGE
Other accuracy metrics
Recall and sensitivity both are same.
MATRUSRI
ENGINEERING COLLEGE
Other measures of performance
Using the data in the confusion matrix of a classifier of
two-class dataset, several measures of performance
have been defined.
Accuracy = (TP + TN)/( TP + TN + FP + FN )
Error rate = 1− Accuracy
Sensitivity = TP/( TP + FN)
Precision = TP/( TP + FP)
Specificity = TN /(TN + FP)
F-measure = (2 × TP)/( 2 × TP + FP + FN)
MATRUSRI
ENGINEERING COLLEGE
Suppose we had a classification dataset with 1000 data points.
We fit a classifier on it and get the below confusion matrix
True Positive (TP) = 560; meaning 560 positive class data points were
correctly classified by the model
True Negative (TN) = 330; meaning 330 negative class data points were
correctly classified by the model
False Positive (FP) = 60; meaning 60 negative class data points were
incorrectly classified as belonging to the positive class by the model
False Negative (FN) = 50; meaning 50 positive class data points were
incorrectly classified as belonging to the negative class by the model
MATRUSRI
ENGINEERING COLLEGE
Example
The total outcome values are:
TP = 30, TN = 930, FP = 30, FN = 10
So, the accuracy for our model turns out to be:
MATRUSRI
ENGINEERING COLLEGE
Precision vs. Recall
Precision tells us how many of the correctly predicted cases actually
turned out to be positive.
Recall tells us how many of the actual positive cases we were able to
predict correctly with our model.
MATRUSRI
ENGINEERING COLLEGE
Practice problem-1
Suppose a computer program for recognizing dogs in photographs
identifies eight dogs in a picture containing 12 dogs and some cats. Of the
eight dogs identified, five actually are dogs while the rest are cats.
Compute the precision and recall of the computer program.
MATRUSRI
ENGINEERING COLLEGE
Practice problem-2
Let there be 10 balls (6 white and 4 red balls) in a box and let it be
required to pick up the red balls from them. Suppose we pick up 7 balls as
the red balls of which only 2 are actually red balls. What are the values of
precision and recall in picking red ball?
MATRUSRI
ENGINEERING COLLEGE
Practice Problem-3
A database contains 80 records on a particular topic of which 55 are
relevant to a certain investigation. A search was conducted on that topic
and 50 records were retrieved. Of the 50 records retrieved, 40 were
relevant. Construct the confusion matrix for the search and calculate
the precision and recall scores for the search.
SOLUTION:
Each record may be assigned a class label “relevant" or “not relevant”. All
the 80 records were tested for relevance. The test classified 50 records as
“relevant”. But only 40 of them were actually relevant. Hence we have
the following confusion matrix for the search:
MATRUSRI
ENGINEERING COLLEGE
MATRUSRI
ENGINEERING COLLEGE
Suppose we have a test dataset of 10 records with expected
outcomes and a set of predictions from our classification algorithm.
Compute the accuracy, precision, sensitivity
and specificity of the data.
MATRUSRI
ENGINEERING COLLEGE
Sample problem
Suppose 10000 patients get tested for flu; out
of them, 9000 are actually healthy and 1000
are actually sick. For the sick people, a test
was positive for 620 and negative for 380. For
the healthy people, the same test was
positive for 180 and negative for 8820.
Construct a confusion matrix for the data and
compute the accuracy, precision and recall for
the data.
MATRUSRI
ENGINEERING COLLEGE
Receiver Operating Characteristic (ROC)
•The acronym ROC stands for Receiver Operating
Characteristic, a terminology coming from signal
detection theory.
•The ROC curve was first developed by electrical
engineers and radar engineers during World War II
for detecting enemy objects in battlefields.
•They are now increasingly used in machine learning
and data mining research.
MATRUSRI
ENGINEERING COLLEGE
TPR and FPR
Let a binary classifier classify a collection of test data.
TP = Number of true positives
TN = Number of true negatives
FP = Number of false positives
FN = Number of false negatives
TPR = True Positive Rate = TP/( TP + FN )= Fraction of
positive examples correctly classified = Sensitivity
FPR = False Positive Rate = FP /(FP + TN) = Fraction of
negative examples incorrectly classified = 1 −
Specificity
MATRUSRI
ENGINEERING COLLEGE
ROC space
•We plot the values of FPR along the horizontal axis
(that is , x-axis) and the values of TPR along the
vertical axis (that is, y-axis) in a plane.
•For each classifier, there is a unique point in this
plane with coordinates (FPR,TPR).
•The ROC space is the part of the plane whose points
correspond to (FPR,TPR).
•Each prediction result or instance of a confusion
matrix represents one point in the ROC space.
MATRUSRI
ENGINEERING COLLEGE
ROC space
The position of the point (FPR,TPR) in the ROC
space gives an indication of the performance
of the classifier.
For example, let us consider some special
points in the space
One step higher for positive examples and
one step right for negative examples
MATRUSRI
ENGINEERING COLLEGE
MATRUSRI
ENGINEERING COLLEGE
Special points in ROC space
•The left bottom corner point (0, 0):
•Always negative prediction
•A classifier which produces this point in the ROC
space never classifies an example as positive,
neither rightly nor wrongly, because for this point
TP = 0 and FP = 0.
•It always makes negative predictions.
•All positive instances are wrongly predicted and
all negative instances are correctly predicted.
•It commits no false positive errors.
MATRUSRI
ENGINEERING COLLEGE
Special points in ROC space
•The right top corner point (1, 1):
•Always positive prediction
•A classifier which produces this point in the ROC
space always classifies an example as positive
because for this point FN = 0 and TN = 0.
•All positive instances are correctly predicted and
all negative instances are wrongly predicted.
• It commits no false negative errors.
MATRUSRI
ENGINEERING COLLEGE
Special points in ROC space
•The left top corner point (0, 1):
•Perfect prediction
•A classifier which produces this point in the ROC
space may be thought as a perfect classifier.
•It produces no false positives and no false
negatives
MATRUSRI
ENGINEERING COLLEGE
Special points in ROC space
• Points along the diagonal:
• Random performance
• Consider a classifier where the class labels
are randomly guessed, say by flipping a
coin.
• Then, the corresponding points in the ROC
space will be lying very near the diagonal
line joining the points (0, 0) and (1, 1).
MATRUSRI
ENGINEERING COLLEGE
ROC Space & some special points in the
space
MATRUSRI
ENGINEERING COLLEGE
ROC curve
•In the case of certain classification algorithms, the
classifier may depend on a parameter.
•Different values of the parameter will give different
classifiers and these in turn give different values to
TPR and FPR.
•The ROC curve is the curve obtained by plotting in
the ROC space the points (TPR , FPR) obtained by
assigning all possible values to the parameter in the
classifier
MATRUSRI
ENGINEERING COLLEGE
ROC curve
•The closer the ROC curve is to the top left corner (0,
1) of the ROC space, the better the accuracy of the
classifier.
•Among the three classifiers A, B, C with ROC curves ,
the classifier C is closest to the top left corner of the
ROC space.
•Hence, among the three, it gives the best accuracy
in predictions.
MATRUSRI
ENGINEERING COLLEGE
Area under the ROC curve (AUC)
•The measure of the area under the ROC curve is
denoted by the acronym AUC .
• The value of AUC is a measure of the performance
of a classifier.
•For the perfect classifier, AUC = 1.0.
MATRUSRI
ENGINEERING COLLEGE
MATRUSRI
ENGINEERING COLLEGE
Sample problem on ROC & AUC
•The body mass index (BMI) of a person is defined as
(weight(kg)/height(m)2 ).
•Researchers have established a link between BMI and the risk of
breast cancer among women.
•The higher the BMI the higher the risk of developing breast cancer.
•The critical threshold value of BMI may depend on several
parameters like food habits, socio-cultural-economic background,
life-style, etc
•Gives real data of a breast cancer study with a sample having 100
patients and 200 normal persons.
•The table also shows the values of TPR and FPR for various cut-off
values of BMI.
MATRUSRI
ENGINEERING COLLEGE
Data for various values of BMI
MATRUSRI
ENGINEERING COLLEGE
ROC curve of the data
MATRUSRI
ENGINEERING COLLEGE
Given the following data, construct the ROC
curve of the data. Compute the AUC.
MATRUSRI
ENGINEERING COLLEGE
Statistical Learning Theory
•Statistical learning theory is a framework for machine
learning, drawing from the fields of statistics and
functional analysis.
•Statistical learning theory deals with the problem of
finding a predictive function based on data.
•The goal of learning is prediction. Learning falls into
many categories:
-Supervised learning
-Unsupervised learning
-Semi-supervised learning
-Transfer Learning
-Online learning
-Reinforcement learning
MATRUSRI
ENGINEERING COLLEGE
Statistical Learning Theory
•Statistical learning theory was introduced in the late 1960s but until
1990s it was simply a problem of function estimation from a given
collection of data.
•In the middle of the 1990s new types of learning algorithms (e.g.,
support vector machines) based on the developed theory were proposed.
This made statistical learning theory not only a tool for the theoretical
analysis but also a tool for creating practical algorithms for estimating
multidimensional functions.
•Statistical learning plays a key role in many areas of science, finance
and industry.
MATRUSRI
ENGINEERING COLLEGE
Statistical modeling from the perspective of
supervised learning
•In supervised learning, an algorithm is given samples
that are labeled in some useful way. For example, the
samples might be descriptions of apples, and the labels
could be whether or not the apples are edible.
•Supervised learning involves learning from a training
set of data. Every point in the training is an input-
output pair, where the input maps to an output. The
learning problem consists of inferring the function that
maps between the input and the output in a predictive
fashion, such that the learned function can be used to
predict output from future input.
MATRUSRI
ENGINEERING COLLEGE
Machine learning Vs Statistics
MATRUSRI
ENGINEERING COLLEGE
Machine Learning Vs Statistical Modelling
•Machine Learning is … an algorithm that can
learn from data without relying on rules-
based programming.
•Statistical Modelling is … formalization of
relationships between variables in the form of
mathematical equations.
MATRUSRI
ENGINEERING COLLEGE
Machine Learning Vs Statistical Modelling
•Machine Learning is … a subfield of computer
science and artificial intelligence which deals
with building systems that can learn from
data, instead of explicitly programmed
instructions.
•Statistical Modelling is … a subfield of
mathematics which deals with finding
relationship between variables to predict an
outcome
MATRUSRI
ENGINEERING COLLEGE
Examples of the learning problems
•Predict whether a patient, hospitalized due to a heart attack, will have a
second heart attack. The prediction is to be based on demographic, diet
and clinical measurements for that patient.
•Predict the price of a stock in 6 months from now, on the basis of
company performance measures and economic data.
•Estimate the amount of glucose in the blood of a diabetic person, from
the infrared absorption spectrum of that person’s blood.
•Identify the risk factors for prostate cancer, based on clinical and
demographic variables.
MATRUSRI
ENGINEERING COLLEGE
Sample problem
•Consider a set of patients coming for treatment in a certain clinic.
•Let A denote the event that a “Patient has liver disease” and B the
event that a “Patient is an alcoholic.”
•It is known from experience that 10% of the patients entering the
clinic have liver disease and 5% of the patients are alcoholics.
•Also, among those patients diagnosed with liver disease, 7% are
alcoholics.
•Given that a patient is alcoholic, what is the probability that he will
have liver disease?
MATRUSRI
ENGINEERING COLLEGE
Using the notations of probability,
P(A)= 10% = 0.10
P(B)= 5% = 0.05
P(B∣A)= 7% = 0.07
P(A∣B)= P(B∣A)P(A) / P(B)
= 0.07×0.10/ 0.05
= 0.14
MATRUSRI
ENGINEERING COLLEGE
A good learner is one that accurately predicts such an outcome.
•In essence, a statistical learning problem is learning from
the data.
• In a typical scenario, we have an outcome measurement,
usually quantitative (such as a stock price) or categorical
(such as heart attack/no heart attack), that we wish to predict
based on a set of features (such as diet and clinical
measurements).
•We have a Training Set which is used to observe the
outcome and feature measurements for a set of objects.
•Using this data we build a Prediction Model, or
a Statistical Learner , which enables us to predict the
outcome for a set of new unseen objects.
MATRUSRI
ENGINEERING COLLEGE
Statistics + Machine Learning=Statistical Learning
MATRUSRI
ENGINEERING COLLEGE
Questions & Answers
1) What are the different choices for performance evaluation.
2) What k-fold cross validation.
3) Differentiate bias & variance
4) Explain the terms underfitting & overfitting
5) What are the different evaluation metrics for classification &
regression
6) Explain the terminology in confusion matrix.
7) Explain about ROC & AUC