0% found this document useful (0 votes)
7 views38 pages

Week 3: Code: 68426307

Gfvccv

Uploaded by

Toufeeq HusSain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Week 3: Code: 68426307

Gfvccv

Uploaded by

Toufeeq HusSain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Week 3

Dr. Miqing Li
School of Computer Science
University of Birmingham

Code: 68426307
◆ A cohort of very diverse background.

◆ There is no complicated gradient descent-related calculation in


the class test and exams.

◆ Drop in my weekly office hour and the help desk session.


◆ My office hour: 12:30-13:30, 16:30-17:30 Wednesday, room 212 in the
School, Zoom link: https://bham-ac-uk.zoom.us/j/2066382427
◆ Help desk session: 15-17:00 Thursday, Room 222 in the School, Zoom
link: https://bham-ac-uk.zoom.us/j/5206341281
Recap of Gradient Descent
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 1): Calculate the cost function at the initial point of gradient descent.

Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3

(𝑥 (j),𝑓(𝑥 (j)))

We care about the distance


between the predicted
value and the target value (𝑥 (j),𝑦(j))
when 𝑥 is 𝑥 (𝑗)
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 1): Calculate the cost function at the initial point of gradient descent.

Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3

𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 = 𝑥2 when 𝜃0 = 𝜃1 = 0, 𝜃2 = 1

1 𝑚 2 1 2 2 2
𝑗 𝑗 (1) 1 (2) 2 (3) 3
𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 = 𝑥2 −𝑦 + 𝑥2 −𝑦 + 𝑥2 −𝑦
𝑚 𝑗=1 3

1 2 2 2
2
= 1−1 + 2−3 + 3−4 =
3 3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) 𝑗 2 1 𝑚 (𝑗) 𝑗 2
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑗=1(𝑓(𝑥 ) − 𝑦 ) ) 𝑚 σ𝑗=1 𝜕(𝑓(𝑥 ) − 𝑦 )
= =
𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑗 (𝑗) And according to derivative calculation rule
As we have 𝑓 𝑥 𝑗 − 𝑦 𝑗 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗
𝑑(𝑎𝜃 − 𝑏)2
= 2 𝑎𝜃 − 𝑏 𝑎
𝜕𝐽(𝜃) 1 𝑚 𝑑𝜃
𝑗 𝑗
= ෍ 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )
𝜕𝜃0 𝑚 𝑗=1
2 4 𝜕𝐽 𝜃 4 2
= (1 − 1) + (2 − 3) + (3 − 4) = − 𝜃0 ← 𝜃0 − 𝛼 = 0 − 0.1 × − =
3 3 𝜕𝜃0 3 15
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃1 𝑥 + 𝜃2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= ෍ 2 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 𝑥1 𝜕𝐽 𝜃 1
𝜕𝜃1 𝑚 𝑗=1 𝜃1 ← 𝜃1 − 𝛼 = 0 − 0.1 × −2 =
2 𝜕𝜃1 5
= 1 − 1 × 0 + (2 − 3) × 1 + (3 − 4) × 2 = −2
3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
𝑚
example
1 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃 1 𝑥 + 𝜃 2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃2 𝜕𝜃2 𝜕𝜃2
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= ෍ 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )𝑥2 𝜕𝐽 𝜃 10 4
𝜕𝜃2 𝑚 𝑗=1 𝜃2 ← 𝜃2 − 𝛼 = 1 − 0.1 × − =
𝜕𝜃2 3 3
2 10
= 1 − 1 × 1 + (2 − 3) × 2 + (3 − 4) × 3 = −
3 3
Training and Testing, and Polynomial
Model
Recap – Linear Regression
◆ Determine a model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥
1 𝑚
◆ Measure the model: 𝐽 𝜃0 , 𝜃1 = σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚
◆ Find the best parameters by gradient descent.

y (price x (size
100000*£) 1000*feet2)
1.2 0.5
2.0 0.9 𝜃0 = −1.55, 𝜃1 = 4.49
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6 Cost (error): 0.9523
7.0 1.8
9.0 2.3

𝑓 𝑥 = −1.55 + 4.49𝑥
The purpose of building the model is to predict

◆ We need testing.

◆ We need to evaluate how the model performs when applied


to another dataset.
Testing ◆ Model: 𝑓 𝑥 = −1.55 + 4.49𝑥
1 𝑚
◆ Cost function: σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚

y (price x (size y (price x (size


100000*£) 1000*feet2) 100000*£) 1000*feet2)
1.2 0.5 2.7 0.69
2.0 0.9 Training error: 1.75 0.723 Testing error:
3.4 1.0 4.3995 0.9
3.0 1.3 0.9523 2.25 0.915 1.5281
6.0 1.4 4.0995 0.95
4.0 1.6 4.2 1.367
7.0 1.8 6.5 1.458
9.0 2.3 10.5 2.308 Can we do better?
Polynomial model
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2
1 𝑚
◆ Same cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚
◆ Same method to find best 𝜃: Gradient descent. (in the calculation
the coefficient associated with 𝜃2 is 𝑥 2 )

y (price x (size
100000*£) 1000*feet2)
◆ 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2
1.2 0.5
2.0 0.9 ◆ Error = 0.8986
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6
7.0 1.8
9.0 2.3
Testing y (price x (size
100000*£) 1000*feet2)
2.7 0.69
◆ Model: 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2 1.75
4.3995
0.723
0.9
2.25 0.915
1 𝑚
◆ Cost function: σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2 4.0995 0.95
𝑚 4.2 1.367
6.5 1.458
10.5 2.308

Training error: Testing error:


0.8986 1.4041
Quadratic better than linear model
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041

Testing error: 1.5281 Test error: 1.4041


3 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041

Training: MSE = 0.8872

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃0 = −0.23

Testing: MSE = 1.4746


3 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746

Training: MSE = 0.8872

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃3 = −0.23

Testing: MSE = 1.4746


5 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746
5 order 0.8730 1.6020
Training: MSE = 0.8743

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃5 𝑥 5

𝜃0 = 0.74, 𝜃1 = 0.67, 𝜃2 = 0.60,


𝜃3 = 0.49, 𝜃4 = 0.30, 𝜃5 = −0.17

Testing: MSE = 1.6398


Higher order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746
5 order 0.8730 1.6020
9 order 0.8546 2.0339 Training: MSE = 0.8546

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃9 𝑥 9

Testing: MSE = 2.0339


Error of different models in training
Training
◆ Models 0.96

0.94

Mean Squared Error


◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 0.92

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 0.9

0.88
2 3
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 + 𝜃3 𝑥 0.86

𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
0.84
◆ 0 2 4 6 8 10

𝜃4 𝑥 4 + 𝜃5 𝑥 5 Polynomial Order

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
Lower order
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 model

◆ Why higher order model’s cost is


always not worse than lower order Higher order model
model’s in training?
A more complex model yields lower error on training data
Error of different models in testing
◆ Models MSE MSE
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 (Training) (Testing)
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 Linear 0.9523 1.5281
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 Quadratic 0.8986 1.4041
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 3 order 0.8872 1.4746
𝜃4 𝑥 4 + 𝜃5 𝑥 5 5 order 0.8730 1.6020
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 9 order 0.8546 2.0339
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 2.5

Mean Squared Error


2
1.5
1
0.5
0
0 2 4 6 8 10
Polynomial Order

Training Testing

A more complex model may lead to higher error on testing data, called overfitting
Error of different models in testing
◆ Models MSE MSE
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 (Training) (Testing)
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 Linear 0.9523 1.5281
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 Quadratic 0.8986 1.4041
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 3 order 0.8872 1.4746
𝜃4 𝑥 4 + 𝜃5 𝑥 5 5 order 0.8730 1.6020
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 9 order 0.8546 2.0339
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 2.5

Mean Squared Error


2
1.5
1
0.5
0
0 2 4 6 8 10
Polynomial Order

Training Testing

A more complex model may lead to higher error on testing data, called overfitting
Underfitting and Overfitting
Training and Testing
Training ◆ Linear model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1
Define the model: a set
1 𝑚
of functions determined ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2
𝑚
by 𝜃0 ,… 𝜃𝑛
◆ Optimisation algorithm: Gradient descent
𝑓 𝑥1 = 3 + 5𝑥1
Measure functions: How
good is a function
◆ Quadratic model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12
1 𝑚
Find the best function ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗 )2
𝑚

◆ Optimisation algorithm: Gradient descent

Testing 𝑓 𝑥1 = 1 − 𝑥1 + 2𝑥12
Evaluate the built model
function on new data
Underfitting and overfitting
◆ Underfitting occurs when a model cannot ◆ Overfitting occurs when a model captures the
capture the underlying trend of the data. noise of the data.
◆ It is the effect of considering oversimple ◆ It is the effect of considering overcomplex
model. model relative to the training data size.
◆ Underfitting means high bias of the model. ◆ Overfitting means high variance of the model.

𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃5 𝑥 5

House price
House price

House size House size


What cause them?
◆ Consider the set of functions that a model represents.
◆ There are three models, represented by three ovals.
◆ The set of functions that a simpler model represents is a subset of that a
more complex model represents.
An oversimple model (e.g., few features, linear
model)
Optimal functions found
under the oversimple
model for different datasets

Optimal functions found


under the overcomplex
model for different datasets

True optimal function


An overcomplex model (e.g.,
many features, high order of
polynomial)
Diagnose underfitting and overfitting
House price prediction example

MSE MSE

Mean squared error


(Training) (Testing)
Testing
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
Training
3 order 0.8872 1.4746
5 order 0.8730 1.6020
9 order 0.8546 2.0339
model complexity

◆ Underfitting (bias): high error in both training and testing.


◆ Overfitting (variance): low error in training, high error in testing.
Address underfitting
An oversimple model (e.g., few features,
linear model)
Optimal functions found
under the oversimple
model for different datasets

True optimal function

◆ Try getting additional features (#bedroom, house type, location).


◆ Try adding polynomial features (𝑥12 , 𝑥22 , 𝑥1 𝑥2 …).
Address overfitting

Optimal functions found


under the overcomplex
model for different datasets

True optimal function


An overcomplex model (many
features, high order of
polynomial)

◆ Get more training data.


◆ Try smaller set of features (remove less important features and/or
high order terms).
Regularisation
Address Overfitting
𝑥1 : size of house
◆ Complex model: 𝑓𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝑥2 : #bedrooms
𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 + ⋯ 𝑥3 : house type
𝑥4 : #floors
𝑥5 : age of the house
◆ Reduce the number of features (or high 𝑥6 : size of lounge
order terms). 𝑥7 : location
𝑥8 : distance to train station
◆ Regularisation works on cost function. …

Step2: Measure
Step1: Choose a Step3: Find the
goodness of
model best function
functions
Formalisation
◆ Model: 𝑓𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + ⋯ 𝜃𝑛 𝑥𝑛
1
◆ Original cost function: min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 (𝑥 (𝑗) ) − 𝑦 𝑗 2
𝑚
Regularisation parameter
◆ Cost function with regularisation:
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

Regularisation makes the model less “perfect” in training, then hopefully better in testing
Why this regularisation term?

1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥
𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

◆ It is independent of the data. The parameters for different datasets


will be more similar, and the model function less depends on a
specific dataset.
◆ With the regularisation term, features will have a smaller slope. We
don’t want changing feature 𝑥 a bit leads to a big change of 𝑓(𝑥).
◆ Less important features will have near-zero parameters.
What regularisation actually does

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3 + ⋯ + 𝜃80 𝑥80 + ⋯ + ⋯

◆ Remove some less important features

𝑥1 : size of house Very close to zero


𝑥2 : #bedrooms
𝑥3 : house type

𝑥80 : if there is an apple tree in back garden

Regularisation makes the line smoother

◆ With all the data points, a linear model actually fits


very well (left figure).
◆ When the data is not enough (only have red points),
the model built can be overfitting (right figure).
Regularisation makes the line smoother

Regularisation makes the wiggle model smoother, better


for the data that have not been seen in the training.
Gradient descent with Regularisation
◆ Cost function with regularisation:
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚
◆ Gradient descent with regularisation:
1 2
𝜕𝐽(𝜃) 𝜕𝐽(𝜃) 𝜕 𝑚 σ𝑚
𝑗=1 𝑓𝜃 𝑥
𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝜃𝑖 ← 𝜃𝑖 − 𝛼 =
𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖

𝜕𝐽(𝜃) 1 σ𝑚
𝑗=1 𝜕(𝑓𝜃 𝑥
𝑗
−𝑦 𝑗
)2 𝜆 σ𝑛𝑘=1 𝜕𝜃𝑘2
= +
𝜕𝜃𝑖 𝑚 𝜕𝜃𝑖 𝜕𝜃𝑖

2 𝑚
𝜃0 ← 𝜃0 − 𝛼 ෍ (𝑓𝜃 (𝑥 𝑗 ) − 𝑦 𝑗 )
𝑚 𝑗=1
2 𝑚
𝑗
𝜃1 ← 𝜃1 − 𝛼 ෍ 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥1 + 𝜆𝜃1
𝑚 𝑗=1

2 𝑚
(𝑗)
𝜃𝑛 ← 𝜃𝑛 − 𝛼 ෍ 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥𝑛 + 𝜆𝜃𝑛
𝑚 𝑗=1
How to set the regularisation parameter
1 2
◆ Cost function: 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗
−𝑦 𝑗
+ 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

𝜆=0 𝜆 = 10000

Testing

1. Try 𝜆 = 0
2. Try 𝜆 = 0.01
3. Try 𝜆 = 0.02
Cost
4. Try 𝜆 = 0.04
Note that when testing,
5. Try 𝜆 = 0.08 use the cost function Training
6. Try 𝜆 = 0.16

without regularisation.
12. Try 𝜆 = 10.24
𝜆

You might also like