0% found this document useful (0 votes)

7 views38 pages

Week 3: Code: 68426307

Gfvccv

Uploaded by

Toufeeq HusSain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views38 pages

Week 3: Code: 68426307

Gfvccv

Uploaded by

Toufeeq HusSain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Week 3

Dr. Miqing Li
School of Computer Science
University of Birmingham

Code: 68426307
◆ A cohort of very diverse background.

◆ There is no complicated gradient descent-related calculation in

the class test and exams.

◆ Drop in my weekly office hour and the help desk session.

◆ My office hour: 12:30-13:30, 16:30-17:30 Wednesday, room 212 in the
School, Zoom link: https://bham-ac-uk.zoom.us/j/2066382427
◆ Help desk session: 15-17:00 Thursday, Room 222 in the School, Zoom
link: https://bham-ac-uk.zoom.us/j/5206341281
Recap of Gradient Descent
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 1): Calculate the cost function at the initial point of gradient descent.

Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3

(𝑥 (j),𝑓(𝑥 (j)))

We care about the distance

between the predicted
value and the target value (𝑥 (j),𝑦(j))
when 𝑥 is 𝑥 (𝑗)
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 1): Calculate the cost function at the initial point of gradient descent.

Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2 1 1 0 1
𝑗 𝑗
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 2 3 1 2
𝑚 𝑗=1
3 4 2 3

𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 = 𝑥2 when 𝜃0 = 𝜃1 = 0, 𝜃2 = 1

1 𝑚 2 1 2 2 2
𝑗 𝑗 (1) 1 (2) 2 (3) 3
𝐽 𝜃 = ෍ 𝑓 𝑥 −𝑦 = 𝑥2 −𝑦 + 𝑥2 −𝑦 + 𝑥2 −𝑦
𝑚 𝑗=1 3

1 2 2 2
2
= 1−1 + 2−3 + 3−4 =
3 3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) 𝑗 2 1 𝑚 (𝑗) 𝑗 2
𝜕𝐽(𝜃) 𝜕(𝑚 σ𝑗=1(𝑓(𝑥 ) − 𝑦 ) ) 𝑚 σ𝑗=1 𝜕(𝑓(𝑥 ) − 𝑦 )
= =
𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑗 (𝑗) And according to derivative calculation rule
As we have 𝑓 𝑥 𝑗 − 𝑦 𝑗 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗
𝑑(𝑎𝜃 − 𝑏)2
= 2 𝑎𝜃 − 𝑏 𝑎
𝜕𝐽(𝜃) 1 𝑚 𝑑𝜃
𝑗 𝑗
= ෍ 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )
𝜕𝜃0 𝑚 𝑗=1
2 4 𝜕𝐽 𝜃 4 2
= (1 − 1) + (2 − 3) + (3 − 4) = − 𝜃0 ← 𝜃0 − 𝛼 = 0 − 0.1 × − =
3 3 𝜕𝜃0 3 15
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
example
1 𝑚 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃1 𝑥 + 𝜃2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= ෍ 2 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 𝑥1 𝜕𝐽 𝜃 1
𝜕𝜃1 𝑚 𝑗=1 𝜃1 ← 𝜃1 − 𝛼 = 0 − 0.1 × −2 =
2 𝜕𝜃1 5
= 1 − 1 × 0 + (2 − 3) × 1 + (3 − 4) × 2 = −2
3
◆ Suppose we want to find a linear model 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 to fit a
dataset 𝐷 𝑥1 , 𝑥2 , 𝑦 = 0,1,1 , 1,2,3 , (2,3,4) . We consider the cost function
1 2
𝐽 𝜃 = σ𝑚 𝑗=1 𝑓 𝑥 𝑗 − 𝑦 𝑗 . We use the gradient descent to iteratively update
𝑚
the parameters 𝜃 starting from 𝜃0 = 𝜃1 = 0, 𝜃2 = 1.
Question 2) calculate 𝜃 after the first iteration of the update process in the gradient
descent, where the learning rate is set to 0.1.
Model: 𝑓 𝑥1 , 𝑥2 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 data
y x1 x2
𝑚
example
1 2
Cost function: 𝐽 𝜃 = ෍ 𝑓 𝑥 𝑗 −𝑦 𝑗 1 1 0 1
𝑚 𝑗=1 2 3 1 2
𝜕𝐽(𝜃) 3 4 2 3
Gradient descent: 𝜃𝑖 ← 𝜃𝑖 − 𝛼
𝜕𝜃𝑖
1 𝑚 (𝑗) ) − 𝑦 𝑗 )2 ) 1 𝑚 𝑗 (𝑗) 𝑗 )2
𝜕𝐽(𝜃) 𝜕( σ 𝑗=1(𝑓(𝑥 σ 𝑗=1 𝜕(𝜃 0 + 𝜃 1 𝑥 + 𝜃 2 𝑥 − 𝑦
= 𝑚 = 𝑚
1 2
𝜕𝜃2 𝜕𝜃2 𝜕𝜃2
𝑑(𝑎𝜃 − 𝑏)2
According to derivative calculation rule = 2 𝑎𝜃 − 𝑏 𝑎
𝑑𝜃
𝜕𝐽(𝜃) 1 𝑚
𝑗 𝑗 𝑗
= ෍ 2(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 − 𝑦 𝑗 )𝑥2 𝜕𝐽 𝜃 10 4
𝜕𝜃2 𝑚 𝑗=1 𝜃2 ← 𝜃2 − 𝛼 = 1 − 0.1 × − =
𝜕𝜃2 3 3
2 10
= 1 − 1 × 1 + (2 − 3) × 2 + (3 − 4) × 3 = −
3 3
Training and Testing, and Polynomial
Model
Recap – Linear Regression
◆ Determine a model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥
1 𝑚
◆ Measure the model: 𝐽 𝜃0 , 𝜃1 = σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚
◆ Find the best parameters by gradient descent.

y (price x (size
100000*£) 1000*feet2)
1.2 0.5
2.0 0.9 𝜃0 = −1.55, 𝜃1 = 4.49
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6 Cost (error): 0.9523
7.0 1.8
9.0 2.3

𝑓 𝑥 = −1.55 + 4.49𝑥
The purpose of building the model is to predict

◆ We need testing.

◆ We need to evaluate how the model performs when applied

to another dataset.
Testing ◆ Model: 𝑓 𝑥 = −1.55 + 4.49𝑥
1 𝑚
◆ Cost function: σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚

y (price x (size y (price x (size

100000*£) 1000*feet2) 100000*£) 1000*feet2)
1.2 0.5 2.7 0.69
2.0 0.9 Training error: 1.75 0.723 Testing error:
3.4 1.0 4.3995 0.9
3.0 1.3 0.9523 2.25 0.915 1.5281
6.0 1.4 4.0995 0.95
4.0 1.6 4.2 1.367
7.0 1.8 6.5 1.458
9.0 2.3 10.5 2.308 Can we do better?
Polynomial model
◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2
1 𝑚
◆ Same cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2
𝑚
◆ Same method to find best 𝜃: Gradient descent. (in the calculation
the coefficient associated with 𝜃2 is 𝑥 2 )

y (price x (size
100000*£) 1000*feet2)
◆ 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2
1.2 0.5
2.0 0.9 ◆ Error = 0.8986
3.4 1.0
3.0 1.3
6.0 1.4
4.0 1.6
7.0 1.8
9.0 2.3
Testing y (price x (size
100000*£) 1000*feet2)
2.7 0.69
◆ Model: 𝑓 𝑥 = −0.17 + 1.68𝑥 + 0.99𝑥 2 1.75
4.3995
0.723
0.9
2.25 0.915
1 𝑚
◆ Cost function: σ𝑗=1(𝑓(𝑥 (𝑗) ) −𝑦 𝑗 )2 4.0995 0.95
𝑚 4.2 1.367
6.5 1.458
10.5 2.308

Training error: Testing error:

0.8986 1.4041
Quadratic better than linear model
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041

Testing error: 1.5281 Test error: 1.4041

3 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041

Training: MSE = 0.8872

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃0 = −0.23

Testing: MSE = 1.4746

3 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746

Training: MSE = 0.8872

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃3 = −0.23

Testing: MSE = 1.4746

5 order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746
5 order 0.8730 1.6020
Training: MSE = 0.8743

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃5 𝑥 5

𝜃0 = 0.74, 𝜃1 = 0.67, 𝜃2 = 0.60,

𝜃3 = 0.49, 𝜃4 = 0.30, 𝜃5 = −0.17

Testing: MSE = 1.6398

Higher order of polynomial
MSE MSE
(Training) (Testing)
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
3 order 0.8872 1.4746
5 order 0.8730 1.6020
9 order 0.8546 2.0339 Training: MSE = 0.8546

◆ Model: 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + … + 𝜃9 𝑥 9

Testing: MSE = 2.0339

Error of different models in training
Training
◆ Models 0.96

0.94

Mean Squared Error

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 0.92

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 0.9

0.88
2 3
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 + 𝜃3 𝑥 0.86

𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
0.84
◆ 0 2 4 6 8 10

𝜃4 𝑥 4 + 𝜃5 𝑥 5 Polynomial Order

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 +
Lower order
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 model

◆ Why higher order model’s cost is

always not worse than lower order Higher order model
model’s in training?
A more complex model yields lower error on training data
Error of different models in testing
◆ Models MSE MSE
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 (Training) (Testing)
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 Linear 0.9523 1.5281
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 Quadratic 0.8986 1.4041
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 3 order 0.8872 1.4746
𝜃4 𝑥 4 + 𝜃5 𝑥 5 5 order 0.8730 1.6020
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 9 order 0.8546 2.0339
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 2.5

Mean Squared Error

2
1.5
1
0.5
0
0 2 4 6 8 10
Polynomial Order

Training Testing

A more complex model may lead to higher error on testing data, called overfitting
Error of different models in testing
◆ Models MSE MSE
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 (Training) (Testing)
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 Linear 0.9523 1.5281
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 Quadratic 0.8986 1.4041
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 3 order 0.8872 1.4746
𝜃4 𝑥 4 + 𝜃5 𝑥 5 5 order 0.8730 1.6020
◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 9 order 0.8546 2.0339
𝜃4 𝑥 4 + 𝜃5 𝑥 5 + ⋯ + 𝜃9 𝑥 9 2.5

Mean Squared Error

2
1.5
1
0.5
0
0 2 4 6 8 10
Polynomial Order

Training Testing

A more complex model may lead to higher error on testing data, called overfitting
Underfitting and Overfitting
Training and Testing
Training ◆ Linear model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1
Define the model: a set
1 𝑚
of functions determined ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗
)2
𝑚
by 𝜃0 ,… 𝜃𝑛
◆ Optimisation algorithm: Gradient descent
𝑓 𝑥1 = 3 + 5𝑥1
Measure functions: How
good is a function
◆ Quadratic model: 𝑓 𝑥1 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12
1 𝑚
Find the best function ◆ Cost function: 𝐽 𝜃 = σ𝑗=1(𝑓(𝑥 (𝑗) ) − 𝑦 𝑗 )2
𝑚

◆ Optimisation algorithm: Gradient descent

Testing 𝑓 𝑥1 = 1 − 𝑥1 + 2𝑥12
Evaluate the built model
function on new data
Underfitting and overfitting
◆ Underfitting occurs when a model cannot ◆ Overfitting occurs when a model captures the
capture the underlying trend of the data. noise of the data.
◆ It is the effect of considering oversimple ◆ It is the effect of considering overcomplex
model. model relative to the training data size.
◆ Underfitting means high bias of the model. ◆ Overfitting means high variance of the model.

𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + ⋯ + 𝜃5 𝑥 5

House price
House price

House size House size

What cause them?
◆ Consider the set of functions that a model represents.
◆ There are three models, represented by three ovals.
◆ The set of functions that a simpler model represents is a subset of that a
more complex model represents.
An oversimple model (e.g., few features, linear
model)
Optimal functions found
under the oversimple
model for different datasets

Optimal functions found

under the overcomplex
model for different datasets

True optimal function

An overcomplex model (e.g.,
many features, high order of
polynomial)
Diagnose underfitting and overfitting
House price prediction example

MSE MSE

Mean squared error

(Training) (Testing)
Testing
Linear 0.9523 1.5281
Quadratic 0.8986 1.4041
Training
3 order 0.8872 1.4746
5 order 0.8730 1.6020
9 order 0.8546 2.0339
model complexity

◆ Underfitting (bias): high error in both training and testing.

◆ Overfitting (variance): low error in training, high error in testing.
Address underfitting
An oversimple model (e.g., few features,
linear model)
Optimal functions found
under the oversimple
model for different datasets

True optimal function

◆ Try getting additional features (#bedroom, house type, location).

◆ Try adding polynomial features (𝑥12 , 𝑥22 , 𝑥1 𝑥2 …).
Address overfitting

Optimal functions found

under the overcomplex
model for different datasets

True optimal function

An overcomplex model (many
features, high order of
polynomial)

◆ Get more training data.

◆ Try smaller set of features (remove less important features and/or
high order terms).
Regularisation
Address Overfitting
𝑥1 : size of house
◆ Complex model: 𝑓𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝑥2 : #bedrooms
𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 + ⋯ 𝑥3 : house type
𝑥4 : #floors
𝑥5 : age of the house
◆ Reduce the number of features (or high 𝑥6 : size of lounge
order terms). 𝑥7 : location
𝑥8 : distance to train station
◆ Regularisation works on cost function. …

Step2: Measure
Step1: Choose a Step3: Find the
goodness of
model best function
functions
Formalisation
◆ Model: 𝑓𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + ⋯ 𝜃𝑛 𝑥𝑛
1
◆ Original cost function: min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 (𝑥 (𝑗) ) − 𝑦 𝑗 2
𝑚
Regularisation parameter
◆ Cost function with regularisation:
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

Regularisation makes the model less “perfect” in training, then hopefully better in testing
Why this regularisation term?

1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥
𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

◆ It is independent of the data. The parameters for different datasets

will be more similar, and the model function less depends on a
specific dataset.
◆ With the regularisation term, features will have a smaller slope. We
don’t want changing feature 𝑥 a bit leads to a big change of 𝑓(𝑥).
◆ Less important features will have near-zero parameters.
What regularisation actually does

◆ 𝑓 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3 + ⋯ + 𝜃80 𝑥80 + ⋯ + ⋯

◆ Remove some less important features

𝑥1 : size of house Very close to zero

𝑥2 : #bedrooms
𝑥3 : house type
…
𝑥80 : if there is an apple tree in back garden
…
Regularisation makes the line smoother

◆ With all the data points, a linear model actually fits

very well (left figure).
◆ When the data is not enough (only have red points),
the model built can be overfitting (right figure).
Regularisation makes the line smoother

Regularisation makes the wiggle model smoother, better

for the data that have not been seen in the training.
Gradient descent with Regularisation
◆ Cost function with regularisation:
1 2
min 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚
◆ Gradient descent with regularisation:
1 2
𝜕𝐽(𝜃) 𝜕𝐽(𝜃) 𝜕 𝑚 σ𝑚
𝑗=1 𝑓𝜃 𝑥
𝑗 −𝑦 𝑗 + 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝜃𝑖 ← 𝜃𝑖 − 𝛼 =
𝜕𝜃𝑖 𝜕𝜃𝑖 𝜕𝜃𝑖

𝜕𝐽(𝜃) 1 σ𝑚
𝑗=1 𝜕(𝑓𝜃 𝑥
𝑗
−𝑦 𝑗
)2 𝜆 σ𝑛𝑘=1 𝜕𝜃𝑘2
= +
𝜕𝜃𝑖 𝑚 𝜕𝜃𝑖 𝜕𝜃𝑖

2 𝑚
𝜃0 ← 𝜃0 − 𝛼 ෍ (𝑓𝜃 (𝑥 𝑗 ) − 𝑦 𝑗 )
𝑚 𝑗=1
2 𝑚
𝑗
𝜃1 ← 𝜃1 − 𝛼 ෍ 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥1 + 𝜆𝜃1
𝑚 𝑗=1
…
2 𝑚
(𝑗)
𝜃𝑛 ← 𝜃𝑛 − 𝛼 ෍ 𝑓𝜃 𝑥 𝑗 − 𝑦 𝑗 𝑥𝑛 + 𝜆𝜃𝑛
𝑚 𝑗=1
How to set the regularisation parameter
1 2
◆ Cost function: 𝐽 𝜃 = σ𝑚
𝑗=1 𝑓𝜃 𝑥 𝑗
−𝑦 𝑗
+ 𝜆 σ𝑛𝑘=1 𝜃𝑘2
𝑚

𝜆=0 𝜆 = 10000

Testing

1. Try 𝜆 = 0
2. Try 𝜆 = 0.01
3. Try 𝜆 = 0.02
Cost
4. Try 𝜆 = 0.04
Note that when testing,
5. Try 𝜆 = 0.08 use the cost function Training
6. Try 𝜆 = 0.16
…
without regularisation.
12. Try 𝜆 = 10.24
𝜆

Solution Week2
No ratings yet
Solution Week2
5 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Linear Regression Quiz Analysis
100% (1)
Linear Regression Quiz Analysis
6 pages
Lecture 03 04
No ratings yet
Lecture 03 04
28 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Regression
No ratings yet
Regression
30 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Notes 1
No ratings yet
Notes 1
30 pages
CS229
No ratings yet
CS229
69 pages
cs229 2
No ratings yet
cs229 2
275 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Lecture 3
No ratings yet
Lecture 3
56 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
ML, WK 04-Questions With Answers
No ratings yet
ML, WK 04-Questions With Answers
4 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
1 page
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Tut5 Questions
No ratings yet
Tut5 Questions
2 pages
Sam HW2
No ratings yet
Sam HW2
4 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
89 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture 04
No ratings yet
Lecture 04
24 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Algorithms For Data Science: Attendance: 88772147
No ratings yet
Algorithms For Data Science: Attendance: 88772147
35 pages
Practice Midterm Sol
No ratings yet
Practice Midterm Sol
15 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Achine Learning Inear Egression With Variable: Ntroduction
No ratings yet
Achine Learning Inear Egression With Variable: Ntroduction
18 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Cost Function
No ratings yet
Cost Function
17 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
SIA Software Company, Inc.: SIA-Smaart Pro Case Study #5: Live Music Show in A Midsize Auditorium
No ratings yet
SIA Software Company, Inc.: SIA-Smaart Pro Case Study #5: Live Music Show in A Midsize Auditorium
4 pages
Gen-Z Python Publication
No ratings yet
Gen-Z Python Publication
6 pages
Inspiron 5584: Service Manual
No ratings yet
Inspiron 5584: Service Manual
128 pages
Liverpool Hospital RFID Blood Tracking
No ratings yet
Liverpool Hospital RFID Blood Tracking
4 pages
UNIX Networking for CS Students
No ratings yet
UNIX Networking for CS Students
35 pages
User Manual For CDM-6240N Cash Dispenser Module v1.2
No ratings yet
User Manual For CDM-6240N Cash Dispenser Module v1.2
24 pages
VDOnMp2puZPezpJ1VDlnvIPrp0bOjRjd PDF
No ratings yet
VDOnMp2puZPezpJ1VDlnvIPrp0bOjRjd PDF
112 pages
Crypto ReadingList
No ratings yet
Crypto ReadingList
4 pages
Powerpoint Homework
100% (1)
Powerpoint Homework
7 pages
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
No ratings yet
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
20 pages
Manual 3com SuperStack II 3000 TX
No ratings yet
Manual 3com SuperStack II 3000 TX
144 pages
Integrator Portal DFEV3 Connector SAP 60 Manual en
No ratings yet
Integrator Portal DFEV3 Connector SAP 60 Manual en
34 pages
ĐỀ
No ratings yet
ĐỀ
7 pages
Questionpaperanalysis COA
No ratings yet
Questionpaperanalysis COA
47 pages
Session1a Slides
No ratings yet
Session1a Slides
20 pages
Department of Education: Caraga Region Schools Division of Surigao Del Sur Gamut National High School
No ratings yet
Department of Education: Caraga Region Schools Division of Surigao Del Sur Gamut National High School
2 pages
Agnet Over Satcom
No ratings yet
Agnet Over Satcom
6 pages
Ruckus ICX 7150 Switch Hardware Installation Guide
No ratings yet
Ruckus ICX 7150 Switch Hardware Installation Guide
133 pages
Data Security Perspectives Quiz Answers NSE 1 Information Security Awareness Fortinet
100% (1)
Data Security Perspectives Quiz Answers NSE 1 Information Security Awareness Fortinet
3 pages
Training Activity Matrix For ACP NCII
No ratings yet
Training Activity Matrix For ACP NCII
7 pages
Execution Plans The Secret To Query Tuning Success
No ratings yet
Execution Plans The Secret To Query Tuning Success
98 pages
Constello 2K24-March 6
No ratings yet
Constello 2K24-March 6
15 pages
Week5worksheet 2
No ratings yet
Week5worksheet 2
9 pages
Introduction To The History of Computing A Computing History Primer Gerard O'Regan Full Access
No ratings yet
Introduction To The History of Computing A Computing History Primer Gerard O'Regan Full Access
106 pages
Mathematics of Codes
No ratings yet
Mathematics of Codes
29 pages
Transmission Diagnostic Trouble Codes F375-F399
50% (2)
Transmission Diagnostic Trouble Codes F375-F399
2 pages
QR Codes in Academic Libraries
No ratings yet
QR Codes in Academic Libraries
12 pages
B.E. All Course Time Table For May - 2018 Exam - 02.052018
No ratings yet
B.E. All Course Time Table For May - 2018 Exam - 02.052018
30 pages
NLP Assignment 1
No ratings yet
NLP Assignment 1
4 pages
Metric Screw Thread Standards
No ratings yet
Metric Screw Thread Standards
24 pages

Week 3: Code: 68426307

Uploaded by

Week 3: Code: 68426307

Uploaded by

Week 3

◆ There is no complicated gradient descent-related calculation in

◆ Drop in my weekly office hour and the help desk session.

We care about the distance

◆ We need to evaluate how the model performs when applied

y (price x (size y (price x (size

Training error: Testing error:

Testing error: 1.5281 Test error: 1.4041

Training: MSE = 0.8872

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃0 = −0.23

Testing: MSE = 1.4746

Training: MSE = 0.8872

𝜃0 = 0.12, 𝜃1 = 1.21, 𝜃2 = 1.72, 𝜃3 = −0.23

Testing: MSE = 1.4746

𝜃0 = 0.74, 𝜃1 = 0.67, 𝜃2 = 0.60,

Testing: MSE = 1.6398

Testing: MSE = 2.0339

Mean Squared Error

◆ Why higher order model’s cost is

Mean Squared Error

Mean Squared Error

◆ Optimisation algorithm: Gradient descent

House size House size

Optimal functions found

True optimal function

Mean squared error

◆ Underfitting (bias): high error in both training and testing.

True optimal function

◆ Try getting additional features (#bedroom, house type, location).

Optimal functions found

True optimal function

◆ Get more training data.

◆ It is independent of the data. The parameters for different datasets

◆ Remove some less important features

𝑥1 : size of house Very close to zero

◆ With all the data points, a linear model actually fits

Regularisation makes the wiggle model smoother, better

You might also like