Linear Regression. Logistic Regression.
Overfitting and Regularization.
COMP5318/COMP4318 Machine Learning and Data Mining
semester 1, 2023, week 3
Irena Koprinska
Reference: Witten ch.4: 128-131, Müller & Guido: ch.2: 28-31, 47-63,
Geron: ch.4 132-137, 149-161
1
Outline
• Linear regression
• Logistic regression
• Overfitting and regularization
• Ridge and Lasso regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
2
Introduction
• Linear regression is a prediction method used for regression tasks
• Regression tasks – the predicted variables is numeric
• Examples: predict the exchange rate of AU$ based on economic
indicators, predict the sales of a company based on the amount
spent for advertising
• Logistic regression is an extension of linear regression for
classification tasks
• Classification tasks – the predicted variable is nominal
• Both linear regression and logistic regression are very popular in
statistics
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
3
Linear Regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
4
Simple (Bivariate) Regression
• Given: a dataset with 2 continuous variables:
• feature x (also called independent variable)
• predicted variable y (also called target variable or dependent
variable)
• Goal: Approximate the relationship between these variables with a
straight line for the given dataset
• Prediction (typical task in DM): Given a new value of independent
variable, use the line to predict the value of the dependent variable
• Descriptive analysis (typical task in psychology, health and social
sciences): assess the strength of the relationship between x and y
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
5
Example – cereals dataset
• Contains nutritional information for 77 breakfast cereals
• 14 features
• cereal manufacturer, type (hot or cold), calories, protein [g], fat [g], sodium
[mg], fiber [g], carbohydrates [g], sugar [g], potassium [mg],
%recommended daily vitamins, weight of 1 serving, number of cups per
serving, shelf location (bottom, middle or top)
• Class variable (numeric): nutritional rating
• Task: Predict the nutritional rating of a cereal based on its sugar content
1. Use this data to build the model
2. Given the sugar content of a new cereal, use the model to predict is
nutritional rating
• New cereal = cereal not used for building of the model
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
6
Task
Task: Predict the nutritional rating of a cereal based on its sugar content
1. Use this data to build the model
2. Given the sugar content of a new cereal, use the model to predict is
nutritional rating
Dependent variable?
rating
Independent variable?
sugars
Example from D. Larose, Data Mining: Methods and Models, 2006, Wiley
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
7
Regression model
• The relationship between sugars and rating is modeled by a line
• The line is used to make predictions
model (regression line)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
8
Equation of a line
y = b0 + b1 x
intercept slope
y = 5 + 2x
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
9
Equation of a regression line
y = b0 + b1 x
Estimated (predicted)
y value of y from the
regression line
b0 and b1 Regression
coefficients
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
10
How to make predictions
• In our case the computed regression line (model) is
𝑦ො = 59.4 − 2.42𝑥
• It can be used to make predictions
• e.g. predict the nutritional rating of a new cereal type (not in the original
data) that contains x=1g sugar
𝑦ො = 59.4 − 2.42 ∗ 1 = 56.98
56.98
• The predicted value lies precisely
on the regression line
1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
11
How to make predictions (2)
• We have a cereal type in our dataset with sugar =1g: Cheerios
• Its nutritional rating is: 50.765 (actual value) not 56.98 (predicted)
• The difference is called prediction error or residual
56.98
50.765
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
12
Fitting a line
• There are many lines that can be fitted to the
given dataset. Which one is the best one?
• The one “closest” to the data
• Mathematically:
• Prediction error (residual) =
actual value - predicted value:
= yi − yi
• Performance index: sum of squared prediction
errors (SSE): SSE = ( yi − yi ) 2
i
• Our goal: select the line which minimizes SSE
• Can be solved using the method of the least
squares
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
13
Solution using the least squares method
σ 𝑥𝑖 𝑦𝑖 − [(σ𝑥𝑖 ሻ (σ𝑦𝑖 ሻ ]Τ𝑛
𝑏1 = 2
σ 𝑥𝑖2 − ൫σ𝑥𝑖 ሻ ൗ𝑛
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
x - mean value of x
y - mean value of y
n – number of training examples (= data points, observations)
• This solution is obtained by minimizing SSE using differential calculus
• If you are interested to see how this was done, please see Appendix 1 at the
end
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
14
Coefficient of determination R2
• The least squares method finds the best fit to the data but doesn’t tell
us how good this fit it
• E.g. SSE=12; is this large or small?
• R2 measures the goodness of fit of the regression line found by the
least squares method:
SSR
R = 2
SST
• Values between 0 and 1; the higher the better
• = 1: the regression line fits perfectly the training data
• close to 0: poor fit
• What are SSR and SST?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
15
Three types of errors
• 1. SSE - Sum of squared prediction errors
n
SSE = ( yi − yi ) 2 = actual value – predicted value
i
• 2. SST - Sum of squared total errors
n
SST = ( yi − y ) 2 = actual value – mean value
i =1
• Hence, SST measures the prediction error when the predicted value is the
mean value
• SST is a function of the variance of y (variance = standard deviation^2) => SST
is a measure of the variability of y, without considering x
n Can be used as a baseline -
SST = ( yi − y ) 2
= (n − 1) var( y ) predicting y without knowing x
i
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
16
Three types of errors (2)
• 3. SSR - Sum of squared regression errors = predicted value – mean value
n
SSR = ( yˆ i − y ) 2
i
Ex.: Distance
travelled for a
SSE number of hours
SST
SSR
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
17
Relation between SST, SSR and SSE
• From the graph: yi − yi = ( yˆ i − yi ) + ( yi − yi )
• It can be shown that SST=SSR+SSE
(For the interested students: n n n
2
How? By squaring each side: i i i i)
( y i − y i ) 2
= ( ˆ
y − y ) 2
+ ( y − y
i =1 i =1 i =1
The cross product cancels out as shown in this book:
N. Draper and H. Smith, Applied Regression Analysis, Wiley, 1998)
SSE
SST
SSR
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
18
Coefficient of determination R2 - again
• Measures the goodness of fit of the regression line to the
SSR
R =
2 training data
SST • Values between 0 and 1; the higher the better
• 1: perfect fit, SSE=0 ; Why is it 1 when SSE=0?
• 0: x is not helpful for predicting y, SSR=0
If SSE=0
𝑆𝑆𝑅 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝑇
𝑅2 = = = =1
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇 Is R2 high or low?
SSE
SSR
SST
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
19
Relation R2 and r
• r - correlation coefficient; measures linear relationship between 2 vectors x and
y (see slides for week 1b):
covar(x, y ) covar(x, y )
r = corr(x, y ) = =
std (x) std( y ) var(x) var(y )
• R2 – coefficient of determination; measures how well the regression line
represents the data: SSR
2
R =
SST
• It can be shown that r = R2
Except for the sign of r, which depends on the direction of the relationship,
positive or negative, so:
𝑟 = ± 𝑅2
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
20
MAE, MSE and RMSE
• MAE, MSE and RMSE are other performance measures for evaluating:
• how good the model is (performance on training data) and
• how well it works on new data (performance on test data)
• They are widely used in ML and DM
𝑛
1
𝑀𝐴𝐸 = |𝑦ො𝑖 − 𝑦𝑖 |
• Mean Absolute Error (MAE): 𝑛
𝑖=1
𝑛
1
• Mean Squared Error (MSE): 𝑀𝑆𝐸 = (𝑦ො𝑖 − 𝑦𝑖 ሻ2
𝑛
𝑖=1
• Root Mean Squared Error (RMSE):
𝑛
1
𝑅𝑀𝑆𝐸 = (𝑦ො𝑖 − 𝑦𝑖 ሻ2
𝑛
𝑖=1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
21
Multiple regression
• Simple regression: 1 feature
• Multiple regression: more than 1 feature
• The line becomes a plane in 2-dim. space and a hyperplane in >2-dim.
space
• R2 is similarly defined, called multiple coefficient of determination
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
22
Question time
• True or False?
• 1) The regression line minimizes the sum of the residuals
• 2) If all residuals are 0, SST=SSR
• 3) If the value of the correlation coefficient is negative, this indicates
that the 2 variables are negatively correlated
• 4) The value of the correlation coefficient can be calculated given the
value of R2
• 5) SSR represents an overall measure of the prediction error on the
training set by using the regression line
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
23
Answers
• True or False?
• 1) The regression line minimizes the sum of the residuals False
No, the sum of squared residuals
• 2) If all residuals are 0, SST=SSR True
If the residuals are 0 =>SSE will be 0; SST=SSR+SSE => SST=SSR
• 3) If the value of the correlation coefficient is negative, this indicates
that the 2 variables are negatively correlated True
• 4) The value of the correlation coefficient can be calculated given the
value of R2 False 2
𝑟=± 𝑅
• 5) SSR represents an overall measure of the prediction error on the
training set by using the regression line False
No, this is SSE, 𝑅2 or other measures such as MAE, MSE, RMSE
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
24
Negative R squared
• Note that if the LR model is fitted on one dataset but tested on another
dataset, then it is possible that the R2 value is negative
• We will see such case during the tutorial – a LR model trained on the
training set and tested on the test set
• R2 on the training set: 0.69
• R2 on the test set: -0.73 (negative)
• Negative value means a poor fit
• R2 on the training set: 0.69 - good fit on the training data
• R2 on the test set: -0.73 - poor fit on the test data
• => overfitting
• If the model is trained and tested on the same dataset, R2 is always
between 0 and 1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
25
Logistic Regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
26
Logistic regression
• Used for classification tasks
• Two classes: 0 and 1 (there are extensions for more than 2 classes)
• Fits the data to a logistic (sigmoidal) curve instead of fitting it to a straight line
• => assumes that the relationship between the feature and class variable
is nonlinear
logistic regression
linear regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
27
Simple (bivariate) logistic regression
• Example: Predicting the presence (class=1) or absence (class=0) of a particular
disease, given the patient’s age
ID age disease ID age disease
logistic regression
1 25 0 11 50 0
2 29 0 12 59 1
linear regression
3 30 0 13 60 0
4 31 0 14 62 0
5 32 0 15 68 1
6 41 0 16 72 0
7 41 0 17 79 1
8 42 0 18 80 0
9 44 1 19 81 1
10 49 1 20 84 1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
28
Logistic regression – example
• What will be the prediction of Logistic Regression for patient 11 from the
training data (age=50, disease=0)?
logistic regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
29
Logistic curve
• The equation of the logistic (sigmoidal) curve is:
𝑒 𝑏0+𝑏1 𝑥
𝑝=
1 + 𝑒 𝑏0 +𝑏1 𝑥
• It gives a value between 0 and 1 that is interpreted as the probability for
class membership:
• p is the probability for class 1 and 1-p is the probability for class 0
• It uses the maximum likelihood method to find the parameters b0 and b1 - the
curve that best fits the data
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
30
How to make predictions
• The logistic regression produced b0 = -4.372, b1= 0.06696
• => the probability for a patient aged 50 (training example 11) to have the
disease:
• => 26% to have the disease and 74% not to have the disease
• We can use the probability directly or convert it into 0/1 answer required for
classification tasks, e.g. 0 if p<0.5 and 1 if p>=0.5
• => We predict class 0 for this patient
• Other thresholds (not 0.5) are also possible depending on domain knowledge
• The class for new examples can be predicted similarly – e.g. make a
prediction for a patient aged 45
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
31
Logistic regression equation
𝑒 𝑏0+𝑏1𝑥
𝑝=
1 + 𝑒 𝑏0+𝑏1𝑥
• It also follows that: How can this be shown? See Appendix 2 at the end.
𝑝
𝑏0 + 𝑏1 𝑥 = ln
1−𝑝
𝑝
ln = 𝑏0 + 𝑏1 𝑥 linear calculation, as in linear regression
1−𝑝
called odds ratio for the
default class (class 1)
ln(𝑜𝑑𝑑𝑠ሻ = 𝑏0 + 𝑏1 𝑥 => 𝑜𝑑𝑑𝑠 = 𝑒 (𝑏0+𝑏1𝑥1൯
The model is still a linear
Compare:
combination of the input features,
• Logistic regression: ln(𝑜𝑑𝑑𝑠ሻ = 𝑏0 + 𝑏1 𝑥 but this combination determines the
log odds of the class not directly
• Linear regression: 𝑦ො = b0 + b1 𝑥 the predicted value
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
32
Overfitting and Regularization
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
33
Overfitting
• Overfitting:
• Small error on the training set but high error on test set (new examples)
• The classifier has memorized the training examples but has not learned
to generalize to new examples!
• It occurs when
• we fit a model too closely to the particularities of the training set – the
resulting model is too specific, works well on the training data but doesn’t
work well on new data
Ex.1 Ex.2
may be overfitting –
too complex Rule1: may be overfitting – too specific
If age>45, income>100K, has_children=3,
divorced=no -> buy_boat=yes
Rule2:
If age>45, income>100K -> buy_boat=yes
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
34
Overfitting (2)
• Various reasons, e.g.
• Issues with the data
• Noise in the training data
• Training data does not contain enough representative examples – too
small
• Training data very different than test data – not representative enough
• How the algorithm operates
• Some algorithms are more susceptible to overfitting than others
• Different algorithms have different strategies to deal with overfitting,
e.g.
• Decision tree – prune the tree
• Neural networks – early stopping of the training
• …
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
35
Underfitting
• The model is too simple and doesn’t capture all important aspects
of the data
• It performs badly on both training and test data
Rule1: may be overfitting – too specific
If age>45, income>100K, has_children=3,
divorced=no -> buy_boat=yes
Rule2:
If age>45, income>100K -> buy_boat=yes
Rule3: may be underfitting – too general
If owns_hourse=yes -> buy_boat=yes
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
36
Trade-off between model complexity and
generalization performance
• generalization performance = accuracy on test set
• Usually, the more complex we allow the model to be, the better it will
predict on the training data
• However, if it becomes to complex, it will start focusing too much on each
individual data point, and will not generalize well on new data
Image from A. Mueller and S. Guido, Introduction to ML with Python
• There is point in between,
which will yield the best test
accuracy
• This is the model we want to
find
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
37
Regularization
• Regularization means explicitly restricting a model to avoid
overfitting
• It is used in some regression models (e.g. Ridge and Lasso
regression) and in some neural networks
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
38
Ridge and Lasso Regression
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
39
Ridge regression
• A regularized version of the standard Linear Regression (LR)
• Also called Tikhonov regularization
• Uses the same equation as LR to make predictions
• However, the regression coefficients w are chosen so that they not only
fit well the training data (as in LR) but also satisfy an additional
constraint:
• the magnitude of the coefficients is as small as possible, i.e. close
to 0
• Small values of the coefficients means
• each feature will have little effect on the outcome
• small slope of the regression line
• Rationale: a more restricted model (less complex) is less likely to overfit
• Ridge regression uses the so called L2 regularization (L2 norm of the
weight vector)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
40
Ridge regression (2)
• Minimizes the following cost function:
1 𝑛 n - number of training examples
σ (𝑦ො − 𝑦𝑖 ሻ2 + 𝛼 σ𝑚
𝑖=1 𝑤𝑖
2
𝑛 𝑖=1 𝑖 m – number of regression coefficients (weights)
MSE regularization term
Goal: high accuracy low complexity
on training data (low model – w close to 0
MSE)
• Parameter controls the trade-off between the performance on the
training set and model complexity
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
41
Ridge regression (3)
1 𝑛
σ (𝑦ො − 𝑦𝑖 ሻ2 + 𝛼 σ𝑚
𝑖=1 𝑤𝑖
2
𝑛 𝑖=1 𝑖
MSE regularization term
(L2 norm)
• controls the trade-off between the
performance on the training set and
model complexity
• Increasing makes the coefficients
smaller (close to 0); this typically
decreases the performance on the
training set but may improve the
performance on the test set
• Decreasing means less restricted Image from A. Geron, Hands-on ML with
Scikit-learn, Keras & TensorFlow
coefficients. For very small , Ridge
Regression will behave similarly to LR
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
42
Lasso regression
• Another regularized version of the standard LR
• LASSO = Least Absolute Shrinkage and Selection Operator Regression
• As Ridge Regression, it adds a regularization term to the cost function
but it uses the L1 norm of the regression coefficient vector w
1 𝑛
σ (𝑦ො − 𝑦𝑖 ሻ2 + 𝛼 σ𝑚
𝑖=1 |𝑤𝑖 |
𝑛 𝑖=1 𝑖
MSE regularization term
(L1 norm)
Goal: high accuracy
low complexity model
on training data (low
MSE)
• Consequence of using L1 – some w will become exactly 0
• => some features will be completely ignored by the model – a form of
automatic feature selection
• Less features – simpler model, easier to interpret
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
43
Lasso regression (2)
1 𝑛
σ (𝑦ො − 𝑦𝑖 ሻ2 + 𝛼 σ𝑚
𝑖=1 |𝑤𝑖 |
𝑛 𝑖=1 𝑖
MSE regularization term
(L1 norm)
• As in Ridge Regression:
• controls the trade-off between the
performance on the training set and
model complexity
• Increasing/decreasing - similar
reasoning as before
Image from A. Geron, Hands-on ML with
Scikit-learn, Keras & TensorFlow
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
44
Summary
• Linear regression
• Simple (bivariate) - a line is used to approximate the relationship between 2
continuous variables (feature x and class variable y)
• Multiple – more than 1 feature; the line becomes a hyperplane
• The least-square method is used to find the line (hyperplane) which best fit
the given data (training data)
• “Best fit”: minimizes the sum of the squared errors (SSE) between the
actual and predicted values of y, over all data points
• R2 = coefficient of determination=SSR/SST – how well the line fits the data;
the higher the better
• MAE, MSE and RMSE – widely used accuracy measures in ML (can be
measured on both training and test data)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
45
Summary (2)
• Logistic regression
•Simple (bivariate) - a sigmoidal curve is used to approximate the relationship
between the feature x and class variable y
• => assumes the relationship between the feature and class variable is
nonlinear
• Multiple – more than 1 feature; the sigmoidal curve becomes a sigmoidal
hyperplane
• Uses the maximum likelihood method to find the curve (hyperplane) which
best fit the given data (training data)
• Overfitting and regularization
• Overfitting - high accuracy on training data but low accuracy on test data (low
generalization)
• High model complexity -> low generalization
• Regularization is a method to avoid overfitting – it makes the model more
restrictive (less complex)
• Ridge and Lasso regression are regularized linear regression models
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
46
Acknowledgements
• M. Lewis-Beck, Applied statistics, SAGE University Paper Series on
Quantitative Analysis.
• D. Larose, Data Mining: Methods and Models, 2006, Wiley.
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
47
Appendix 1: Minimizing SSE
• For interested students; not examinable
• From D. Larose, Data Mining: Methods and Models, 2006, Wiley; p.36-37
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
48
Appendix 1: Minimizing SSE (2)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
49
Appendix 2: Logistic regression
• For interested students; not examinable
Irena Koprinska, irena.koprinska@sydney.edu.au COMP5318 ML&DM, week 3, 2023
50