Lesson 7.
DATA SCIENCE AND
AUTOMATION COURSE
MASTER DEGREE SMART
TECHNOLOGY ENGINEERING
Performance metrics
TEACHER
Mirko Mazzoleni
PLACE
University of Bergamo
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
2 /14
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
3 /14
Metrics
It is extremely important to use quantitative metrics for evaluating a machine learning
model
• Until now, we relied on the cost function value for regression and classification
• Other metrics can be used to better evaluate and understand the model
• For classification
Accuracy/Precision/Recall/F1-score, ROC curves,…
• For regression
Normalized RMSE, Normalized Mean Absolute Error (NMAE),…
4 /14
Classification case: metrics for skewed classes
Disease dichotomic classification example
Train logistic regression model ℎ 𝒙 , with 𝑦 = 1 if disease, 𝑦 = 0 otherwise.
Find that you got 1% error on test set (99% correct diagnoses)
The 𝑦 = 1 class has very few examples with
Only 0.50% of patients actually have disease
respect to the 𝑦 = 0 class
If I use a predictor that predicts always the 𝟎 class, I get 99.5% of accuracy!!
For skewed classes, the accuracy metric can be deceptive
5 /14
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
6 /14
Precision and recall
Suppose that 𝑦 = 1 in presence of a rare class that we want to detect
Precision (How much we are precise in the detection) Confusion matrix
Of all patients where we predicted 𝑦 = 1,
what fraction actually has the disease? Actual class
Predicted class
1 (p) 0 (n)
True Positive True Positive
=
# Predicted Positive True Positive + False Positive
True positive False positive
1 (Y)
(TP) (FP)
Recall (How much we are good at detecting)
Of all patients that actually have the disease, what False negative True negative
fraction did we correctly detect as having the disease? 0 (N)
(FN) (TN)
True Positive True Positive
=
# Actual Positive True Positive + False Negative
7 /14
Trading off precision and recall
Logistic regression: 0 ≤ ℎ 𝒙 ≤ 1
At different thresholds, correspond
• Predict 1 if ℎ 𝒙 ≥ 0.5 different confusion matrices!
These thresholds can
be different from 0.5!
• Predict 0 if ℎ 𝒙 < 0.5
Suppose we want to predict 𝑦 = 1 (disease) only if very confident
• Increase threshold → Higher precision, lower recall
Suppose we want to avoid missing too many cases of disease (avoid false negatives).
• Decrease threshold → Higher recall, lower precision
8 /14
F1-score
It is usually better to compare models by means of one number only. The F1 − score can
be used to combine precision and recall
Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444 The best is Algorithm 1
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392
Algorithm 3 predict always 𝟏 Average says not correctly
that Algorithm 3 is the best
P+R PR • P = 0 or R = 0 ⇒ F1 score = 0
Average = F1 score = 2
2 P+R
• P = 1 and R = 1 ⇒ F1 score = 1
9 /14
Summaries of the confusion matrix
Different metrics can be computed from the confusion matrix, depending on the class of
interest (https://en.wikipedia.org/wiki/Precision_and_recall)
10 /14
Outline
1. Metrics
2. Precision and recall
3. Receiver Operating Characteristic (ROC) curves
11 /14
Ranking instead of classifying
Classifiers such as logistic regression can output a probability of belonging to a class (or
something similar).
• We can use this to rank the different istances and take actions on the cases at top of
the list
• We may have a budget, so we have to target most promising individuals
• Ranking enables to use different techniques for visualizing model performance
12 /14
Ranking instead of classifying
p n
Y 0 0 p n
Instance
True class Score N 100 100 Y
1 0
description
99 100
…………… 1 0,99 N
…………… 1 0,98
…………… 0 0,96 p n
2 0
…………… 0 0,90 Y
…………… 1 0,88 N 98 100
p n
…………… 1 0,87 2 1
Y
…………… 0 0,85 98 99
N
…………… 1 0,80 p n
…………… 0 0,70 Y
6 4
Different confusion
N 94 96
matrices by changing
Adapated from [1] the threshold
13 /14
ROC curves
ROC curves are a very general way to represent and compare the performance of
different models (on a binary classification task)
Perfection Observations
• 0,0 : predict always negative
Random • 1,1 : predict always positive
True positive rate
guessing
• Diagonal line: random classifier
• Below diagonal line: worse than random classifier
• Different classifiers can be compared
• Area Under the Curve (AUC): probability that a randomly
chosen positive instance will be ranked ahead of randomly
chosen negative instance
False positive rate
14 /14