Week 3: Code: 68426307
Week 3: Code: 68426307
Dr. Miqing Li
School of Computer Science
University of Birmingham
Code: 68426307
◆ A cohort of very diverse background.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3
(𝑥 (j),𝑓(𝑥 (j)))
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3
𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 = 𝑥2 when 𝜃0 = 𝜃1 = 0, 𝜃2 = 1
1 𝑚 2 1 2 2 2
𝑗 𝑗 (1) 1 (2) 2 (3) 3
𝐽 𝜃 = 𝑓 𝑥 −𝑦 = 𝑥2 −𝑦 + 𝑥2 −𝑦 + 𝑥2 −𝑦
𝑚 𝑗=1 3
1 2 2 2
2
= 1−1 + 2−3 + 3−4 =
3 3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) 𝑗 2 1 𝑚 (𝑗) 𝑗 2
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑗=1(𝑓(𝑥 ) − 𝑦 ) ) 𝑚 σ𝑗=1 𝜕(𝑓(𝑥 ) − 𝑦 )
= =
𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑗 (𝑗) And according to derivative calculation rule
As we have 𝑓 𝑥 𝑗 − 𝑦 𝑗 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗
𝑑(𝑎𝜃 − 𝑏)2
= 2 𝑎𝜃 − 𝑏 𝑎
𝜕𝐽(𝜃) 1 𝑚 𝑑𝜃
𝑗 𝑗
= 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )
𝜕𝜃0 𝑚 𝑗=1
2 4 𝜕𝐽 𝜃 4 2
= (1 − 1) + (2 − 3) + (3 − 4) = − 𝜃0 ← 𝜃0 − 𝛼 = 0 − 0.1 × − =
3 3 𝜕𝜃0 3 15
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃1 𝑥 + 𝜃2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= 2 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 𝑥1 𝜕𝐽 𝜃 1
𝜕𝜃1 𝑚 𝑗=1 𝜃1 ← 𝜃1 − 𝛼 = 0 − 0.1 × −2 =
2 𝜕𝜃1 5
= 1 − 1 × 0 + (2 − 3) × 1 + (3 − 4) × 2 = −2
3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
𝑚
example
1 2
Cost function: 𝐽 𝜃 = 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃 1 𝑥 + 𝜃 2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃2 𝜕𝜃2 𝜕𝜃2
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )𝑥2 𝜕𝐽 𝜃 10 4
𝜕𝜃2 𝑚 𝑗=1 𝜃2 ← 𝜃2 − 𝛼 = 1 − 0.1 × − =
𝜕𝜃2 3 3
2 10
= 1 − 1 × 1 + (2 − 3) × 2 + (3 − 4) × 3 = −
3 3
Training and Testing, and Polynomial
Model
Recap – Linear Regression
◆ Determine a model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥
1 𝑚
◆ Measure the model: 𝐽 𝜃0 , 𝜃1 = σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚
◆ Find the best parameters by gradient descent.
y (price x (size
100000*£) 1000*feet2)
1.2 0.5
2.0 0.9 𝜃0 = −1.55, 𝜃1 = 4.49
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6 Cost (error): 0.9523
7.0 1.8
9.0 2.3
𝑓 𝑥 = −1.55 + 4.49𝑥
The purpose of building the model is to predict
◆ We need testing.
y (price x (size
100000*£) 1000*feet2)
◆ 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2
1.2 0.5
2.0 0.9 ◆ Error = 0.8986
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6
7.0 1.8
9.0 2.3
Testing y (price x (size
100000*£) 1000*feet2)
2.7 0.69
◆ Model: 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2 1.75
4.3995
0.723
0.9
2.25 0.915
1 𝑚
◆ Cost function: σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2 4.0995 0.95
𝑚 4.2 1.367
6.5 1.458
10.5 2.308
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃5 𝑥 5
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃9 𝑥 9
0.94
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 0.9
0.88
2 3
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 + 𝜃3 𝑥 0.86
𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
0.84
◆ 0 2 4 6 8 10
𝜃4 𝑥 4 + 𝜃5 𝑥 5 Polynomial Order
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
Lower order
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 model
Training Testing
A more complex model may lead to higher error on testing data, called overfitting
Error of different models in testing
◆ Models MSE MSE
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 (Training) (Testing)
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 Linear 0.9523 1.5281
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 Quadratic 0.8986 1.4041
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 3 order 0.8872 1.4746
𝜃4 𝑥 4 + 𝜃5 𝑥 5 5 order 0.8730 1.6020
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 9 order 0.8546 2.0339
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 2.5
Training Testing
A more complex model may lead to higher error on testing data, called overfitting
Underfitting and Overfitting
Training and Testing
Training ◆ Linear model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1
Define the model: a set
1 𝑚
of functions determined ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2
𝑚
by 𝜃0 ,… 𝜃𝑛
◆ Optimisation algorithm: Gradient descent
𝑓 𝑥1 = 3 + 5𝑥1
Measure functions: How
good is a function
◆ Quadratic model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12
1 𝑚
Find the best function ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗 )2
𝑚
Testing 𝑓 𝑥1 = 1 − 𝑥1 + 2𝑥12
Evaluate the built model
function on new data
Underfitting and overfitting
◆ Underfitting occurs when a model cannot ◆ Overfitting occurs when a model captures the
capture the underlying trend of the data. noise of the data.
◆ It is the effect of considering oversimple ◆ It is the effect of considering overcomplex
model. model relative to the training data size.
◆ Underfitting means high bias of the model. ◆ Overfitting means high variance of the model.
𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃5 𝑥 5
House price
House price
MSE MSE
Step2: Measure
Step1: Choose a Step3: Find the
goodness of
model best function
functions
Formalisation
◆ Model: 𝑓𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + ⋯ 𝜃𝑛 𝑥𝑛
1
◆ Original cost function: min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 (𝑥 (𝑗) ) − 𝑦 𝑗 2
𝑚
Regularisation parameter
◆ Cost function with regularisation:
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚
Regularisation makes the model less “perfect” in training, then hopefully better in testing
Why this regularisation term?
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥
𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3 + ⋯ + 𝜃80 𝑥80 + ⋯ + ⋯
𝜕𝐽(𝜃) 1 σ𝑚
𝑗=1 𝜕(𝑓𝜃 𝑥
𝑗
−𝑦 𝑗
)2 𝜆 σ𝑛𝑘=1 𝜕𝜃𝑘2
= +
𝜕𝜃𝑖 𝑚 𝜕𝜃𝑖 𝜕𝜃𝑖
2 𝑚
𝜃0 ← 𝜃0 − 𝛼 (𝑓𝜃 (𝑥 𝑗 ) − 𝑦 𝑗 )
𝑚 𝑗=1
2 𝑚
𝑗
𝜃1 ← 𝜃1 − 𝛼 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥1 + 𝜆𝜃1
𝑚 𝑗=1
…
2 𝑚
(𝑗)
𝜃𝑛 ← 𝜃𝑛 − 𝛼 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥𝑛 + 𝜆𝜃𝑛
𝑚 𝑗=1
How to set the regularisation parameter
1 2
◆ Cost function: 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗
−𝑦 𝑗
+ 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚
𝜆=0 𝜆 = 10000
Testing
1. Try 𝜆 = 0
2. Try 𝜆 = 0.01
3. Try 𝜆 = 0.02
Cost
4. Try 𝜆 = 0.04
Note that when testing,
5. Try 𝜆 = 0.08 use the cost function Training
6. Try 𝜆 = 0.16
…
without regularisation.
12. Try 𝜆 = 10.24
𝜆