0% found this document useful (0 votes)

57 views42 pages

Lec 03

This document provides an overview of linear regression and gradient descent optimization. It introduces linear regression as a model where the prediction is a linear combination of the input features. It defines the mean squared error as a loss function to measure error. Gradient descent is presented as an optimization algorithm that can be used to find the optimal parameters by iteratively taking steps in the direction of steepest descent as defined by the gradient of the loss function.

Uploaded by

JustSaint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views42 pages

Lec 03

Uploaded by

JustSaint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

CSC 311: Introduction to Machine Learning

Lecture 3 - Linear Regression &

Gradient Descent for Optimization

Amir-massoud Farahmand & Emad A.M. Andrews

University of Toronto

Intro ML (UofT) CSC311-Lec3 1 / 42

Modular Approach to ML Algorithm Design

So far, we have talked about procedures for learning.

I KNN and decision trees.
For the remainder of this course, we will take a more modular
approach:
I choose a model describing the relationships between variables of
interest
I define a loss function quantifying how bad the fit to the data is
I choose a regularizer saying how much we prefer different candidate
models (or explanations of data)
I fit the model that minimizes the loss function and satisfy the
constraint/penalty imposed by the regularizer, possibly using an
optimization algorithm
Mixing and matching these modular components gives us a lot of
new ML methods.

Intro ML (UofT) CSC311-Lec3 2 / 42

The Supervised Learning Setup

Recall that in supervised learning:

There is target t ∈ T (also called response, outcome, output, class)
There are features x ∈ X (also called inputs and covariates)
Objective is to learn a function f : X → T such that
t ≈ y = f (x)
based on some data D = {(x(i) , t(i) ) for i = 1, 2, ..., N }.
Intro ML (UofT) CSC311-Lec3 3 / 42
Linear Regression – Model

Model: In linear regression, we use linear functions of the inputs

x = (x1 , . . . , xD ) to make predictions y of the target value t:
X
y =f (x) = wj x j + b
j

I y is the prediction
I w is the weights
I b is the bias (or intercept) (do not confuse with the bias-variance
tradeoff in the next lecture)
w and b together are the parameters
We hope that our prediction is close to the target: y ≈ t.

Intro ML (UofT) CSC311-Lec3 4 / 42

What is Linear? 1 Feature vs. D Features

2.0 Fitted line

Data
1.5
1.0 If we have only 1 feature:
y: response

0.5
0.0
y = wx + b where w, x, b ∈ R.
0.5
1.0 y is linear in x.
2 1 0 1 2
x: features

If we have D features:
y = w> x + b where w, x ∈ RD ,
b∈R
y is linear in x.

Relation between the prediction y and inputs x is linear in both cases.

Intro ML (UofT) CSC311-Lec3 5 / 42

Linear Regression

We have a dataset D = {(x(i) , t(i) ) for i = 1, 2, ..., N } where,

(i) (i) (i)
x(i) = (x1 , x2 , ..., xD )> ∈ RD are the inputs, e.g., age, height.
t(i) ∈ R is the target or response (e.g. income),
predict t(i) with a linear function of x(i) :

2.02.0 Data
Fitted line
Data
1.51.5
1.01.0 t(i) ≈ y (i) = w> x(i) + b
y: response
y: response

0.50.5 Find the “best” line (w, b).

0.00.0 minimize N (i) (i)
P
i=1 L(y , t )
0.50.5 (w,b)

1.01.0
22 11 00 11 22
x:x:features
features

Intro ML (UofT) CSC311-Lec3 6 / 42

Linear Regression – Loss Function
How to quantify the quality of the fit to data?
A loss function L(y, t) defines how bad it is if, for some example x,
the algorithm predicts y, but the target is actually t.
Squared error loss function:
L(y, t) = 12 (y − t)2
y − t is the residual, and we want to make its magnitude small
The 21 factor is just to make the calculations convenient.
Cost function: loss function averaged over all training examples
N
1 X (i) 2
J (w, b) = y − t(i)
2N i=1
N
1 X > (i) 2
= w x + b − t(i)
2N i=1

The terminology is not universal. Some might call “loss” pointwise

loss and the “cost function” the empirical loss or average loss.
Intro ML (UofT) CSC311-Lec3 7 / 42
Vector Notation

We can organize all the training examples into a design matrix X

with one row per training example, and all the targets into the
target vector t.

Computing the predictions for the whole dataset:

 T (1)   (1) 
w x +b y
.
.. . 
Xw + b1 =   =  ..  = y
  

wT x(N ) + b y (N )

Intro ML (UofT) CSC311-Lec3 8 / 42

Vectorization
Computing the squared error cost across the whole dataset:
y = Xw + b1
1
J = ky − tk2
2N
Note that sometimes we may use J = 21 ky − tk2 , without
normalizer. That would correspond to the sum of losses, and not
the average loss. The minimizer does not depend on N .
We can also add a column of 1s to the design matrix, combine the
bias and the weights, and conveniently write
 
b
1 [x(1) ]>
 
w1 
(2) > 
X = 1 [x ]  ∈ RN ×D+1 and w = w  ∈ RD+1
  
..  2
1 . ..
.
Then, our predictions reduce to y = Xw.
Intro ML (UofT) CSC311-Lec3 9 / 42
Solving the Minimization Problem

We defined a cost function. This is what we would like to

minimize.
Recall from your calculus class: minimum of a smooth function (if
it exists) occurs at a critical point, i.e., point where the derivative
is zero.
Multivariate generalization: set the partial derivatives to zero (or
equivalently the gradient).
We would like to find a point where the gradient is (close to) zero.
How can we do it?
I Sometimes it is possible to directly find the parameters that make
the gradient zero in a closed-form. We call this the direct solution.
I We may also use optimization techniques that iteratively get us
closer to the solution. We will get back to this soon.

Intro ML (UofT) CSC311-Lec3 10 / 42

Direct Solution

Partial derivatives: derivatives of a multivariate function with

respect to one of its arguments.
∂ f (x1 + h, x2 ) − f (x1 , x2 )
f (x1 , x2 ) = lim
∂x1 h→0 h
To compute, take the single variable derivatives, pretending the
other arguments are constant.
Example: partial derivatives of the prediction y
 
∂y ∂ X
= wj 0 xj 0 + b
∂wj ∂wj 0 j

= xj
 
∂y ∂ X
= wj 0 xj 0 + b
∂b ∂b 0j

Intro ML (UofT) CSC311-Lec3 11 / 42

Direct Solution

Chain rule for derivatives:

∂L dL ∂y
=
∂wj dy ∂wj

d 1
= (y − t)2 · xj
dy 2
= (y − t)xj
∂L
=y−t
∂b

Cost derivatives (average over data points):

N
∂J 1 X (i) (i)
= (y − t(i) ) xj
∂wj N i=1
N
∂J 1 X (i)
= y − t(i)
∂b N i=1

Intro ML (UofT) CSC311-Lec3 12 / 42

Direct Solution

The minimum must occur at a point where the partial derivatives

are zero, i.e.,
∂J ∂J
= 0 (∀j), = 0.
∂wj ∂b
If ∂J /∂wj 6= 0, you could reduce the cost by changing wj .
This turns out to give a system of linear equations, which we can
solve efficiently. Full derivation in the preliminaries.pdf.
Optimal weights:
wLS = (XT X)−1 XT t

Linear regression is one of only a handful of models in this course

that permit direct solution.

Intro ML (UofT) CSC311-Lec3 13 / 42

Feature Mapping (Basis Expansion)

The relation between the input and output may not be linear.

We can still use linear regression by mapping the input feature to

another space using feature mapping (or basis expansion)
ψ(x) : RD → Rd and treat the mapped feature (in Rd ) as the
input of a linear regression procedure.
Let us see how it works when x ∈ R and we use polynomial feature
mapping.
Intro ML (UofT) CSC311-Lec3 14 / 42
Polynomial Feature Mapping

Fit the data using a degree-M polynomial function of the form:

M
X
y = w0 + w1 x + w2 x2 + ... + wM xM = wi x i
i=0

Good model (M=3): Achieves small test error (generalizes well).

1 M =3
t

−1

0 x 1

Intro ML (UofT) CSC311-Lec3 20 / 42

Model Complexity and Generalization

1 M =9
t

−1

0 x 1

As M increases, the magnitude of coefficients gets larger.

For M = 9, the coefficients have become finely tuned to the data.
Between data points, the function exhibits large oscillations.

Intro ML (UofT) CSC311-Lec3 21 / 42

Regularization for Controlling the Model Complexity

The degree of the polynomial M controls the complexity of the

model.
The value of M is a hyperparameter for polynomial expansion,
just like k in KNN. We can tune it using a validation set.
Restricting the number of parameters of a model (M here) is a
crude approach to control the complexity of the model.
A better solution: keep the number of parameters of the model
large, but enforce “simpler” solutions within the same space of
parameters.
This is done through regularization or penalization.
I Regularizer (or penalty): a function that quantifies how much we
prefer one hypothesis vs. another
Q: How?!

Intro ML (UofT) CSC311-Lec3 22 / 42

`2 (or L2 ) Regularization
We can encourage the weights to be small by choosing as our regularizer
the `2 (or L2 ) penalty.
1X 2
R(w) = 12 kwk22 = w .
2 j j
I Note: To be precise, we are regularizing the squared `2 norm.
The regularized cost function makes a tradeoff between fit to the data
and the norm of the weights:
λX 2
Jreg (w) = J (w) + λR(w) = J (w) + w .
2 j j

The basic idea is that “simpler” functions have smaller `2 -norm of their
weights w, and we prefer them to functions with larger `2 -norms.
If you fit training data poorly, J is large. If your optimal weights have
high values, R is large.
Large λ penalizes weight values more.
Here, λ is a hyperparameter that we can tune with a validation set.
Intro ML (UofT) CSC311-Lec3 23 / 42
`2 Regularized Least Squares: Ridge Regression

1
For the least squares problem, we have J (w) = 2N kXw − tk2 .
When λ > 0 (with regularization), regularized cost gives

1 λ
wλRidge = argmin Jreg (w) = argmin kXw − tk22 + kwk22
w w 2N 2
=(XT X + λN I)−1 XT t

The case λ = 0 (no regularization) reduces to least squares solution!

Q: What happens when λ → ∞?
Note that it is also common to formulate this problem as
argminw kXw − tk22 + λ2 kwk22 in which case the solution is
wλRidge = (XT X + λI)−1 XT t.

Intro ML (UofT) CSC311-Lec3 24 / 42

`1 vs. `2 Regularization

The `1 norm, or sum of absolute values, is another regularizer that encourages

weights to be exactly zero. (How can you tell?)
We can design regularizers based on whatever property we’d like to encourage.

— Bishop, Pattern Recognition and Machine Learning

Intro ML (UofT) CSC311-Lec3 25 / 42

Probabilistic Interpretation of the Squared Error
For the least squares: we minimize the sum of the squares of the errors
between the predictions for each data point x(i) and the corresponding
target values t(i) , i.e.,
n
X
minimize (w> x(i) + b − t(i) )2
(w,w0 )
i=1

t ≈ x> w + b, (w, b) ∈ RD × R
We measure the quality of the fit using the
squared error loss. Why?
Even though the squared error loss looks
natural, we did not really justify it.
We provide a probabilistic perspective here.
There are other justifications too; we get to
them in the next lecture.
Intro ML (UofT) CSC311-Lec3 26 / 42
Probabilistic Interpretation of the Squared Error

Suppose that our model arose from a

statistical model (b = 0 for simplicity):

y (i) = w> x(i) + (i)

where (i) ∼ N (0, σ 2 ) is independent

of anything else.
Thus, y (i) |x(i) ∼ p(y|x(i) , w) =
N (w> x(i) , σ 2 ).

Intro ML (UofT) CSC311-Lec3 27 / 42

Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
Suppose that the input data {x(1) , x(2) , . . . , x(N ) } are given and
the outputs are independently drawn from
y (i) ∼ p(y|x(i) , w)
with an unknown parameter w. So the dataset is
D = {(x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(N ) , y (N ) )}.
The likelihood function is Pr(D|w).
The maximum likelihood estimation (MLE) is the “principle” that
suggests we have to find a parameter ŵ that maximizes the
likelihood, i.e.,
ŵ ← argmax Pr(D|w).
w

Maximum likelihood estimation: after observing the data samples

(x(i) , y (i) ) for i = 1, 2, ..., N , we should choose w that maximizes the
likelihood.
Intro ML (UofT) CSC311-Lec3 28 / 42
Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
For independent samples, the likelihood function of samples D is the product
of their likelihoods
N
Y
p(y (1) , y (2) , ..., y (n) |x(1) , x(2) , ..., x(N ) , w) = p(y (i) |x(i) , w) = L(w).
i=1

Product of N terms is not easy to minimize. Taking log reduces it to a sum!

Two objectives are equivalent since log is strictly increasing.
Maximizing the likelihood is equivalent to minimizing the negative
log-likelihood:
N
Y n
X
`(w) = − log L(w) = − log p(z (i) |w) = − log p(z (i) |w)
i=1 i=1

Maximum Likelihood Estimator (MLE)

After observing z (i) = (x(i) , y (i) ) for i = 1, ..., n i.i.d. samples from p(z|w), MLE is
n
X
wMLE = argmin l(w) = − log p(z (i) |w).
w
i=1
Intro ML (UofT) CSC311-Lec3 29 / 42
Probabilistic Interpretation of the Squared Error: From
MLE to Squared Error
Suppose that our model arose from a statistical model:
y (i) = w> x(i) + (i)
where (i) ∼ N (0, σ 2 ) is independent of anything else.
p(y (i) |x(i) , w) = √ 1 2 exp − 2σ1 2 (y (i) − w> x(i) )2

2πσ √
log p(y (i) |x(i) , w) = − 2σ1 2 (y (i) − w> x(i) )2 − log( 2πσ 2 )
The MLE solution is
N
X
w MLE
= argmin L(w) = 1
2σ 2
(y (i) − w> x(i) )2 + C.
w
i=1

As C and σ do not depend on w, they do not contribute to the

minimization.

wMLE = wLS when we work with Gaussian densities.

Intro ML (UofT) CSC311-Lec3 30 / 42
Gradient Descent

Intro ML (UofT) CSC311-Lec3 31 / 42

Gradient Descent

Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply
an update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros)
and repeatedly adjust them in the direction of steepest descent.

Intro ML (UofT) CSC311-Lec3 32 / 42

Gradient Descent

Observe:
I if ∂J /∂wj > 0, then increasing wj increases J .
I if ∂J /∂wj < 0, then increasing wj decreases J .
The following update decreases the cost function:
∂J
wj ← wj − α
∂wj
N
α X (i) (i)
= wj − (y − t(i) ) xj
N
i=1

α is a learning rate. The larger it is, the faster w changes.

I We’ll see later how to tune the learning rate, but values are
typically small, e.g. 0.01 or 0.0001

Intro ML (UofT) CSC311-Lec3 33 / 42

Gradient Descent
This gets its name from the gradient:
 ∂J 
1 ∂w
∂J  .. 
= . 
∂w ∂J
∂wD

I This is the direction of fastest increase in J .

Update rule in vector form:
∂J
w ←w−α
∂w
N
α X (i)
=w− (y − t(i) ) x(i)
N
i=1

Hence, gradient descent updates the weights in the direction of

fastest decrease.
Observe that once it converges, we get a critical point, i.e. ∂J
∂w = 0.
Intro ML (UofT) CSC311-Lec3 34 / 42
Gradient Descent for Linear regression

Even for linear regression, where there is a direct solution, we

sometimes need to use GD.
Why gradient descent, if we can find the optimum directly?
I GD can be applied to a much broader set of models
I GD can be easier to implement than direct solutions
I For regression in high-dimensional spaces, GD is more efficient than
direct solution
I Linear regression solution: (XT X)−1 XT t
I matrix inversion is an O(D3 ) algorithm
I each GD update costs O(N D)
I Huge difference if D 1

Intro ML (UofT) CSC311-Lec3 35 / 42

Gradient Descent under the `2 Regularization

Recall the gradient descent update:

∂J
w ←w−α
∂w
The gradient descent update of the regularized cost J + λR has
an interesting interpretation as weight decay:

∂J ∂R
w ←w−α +λ
∂w ∂w

∂J
=w−α + λw
∂w
∂J
= (1 − αλ)w − α
∂w

Intro ML (UofT) CSC311-Lec3 36 / 42

Learning Rate (Step Size)

In gradient descent, the learning rate α is a hyperparameter we

need to tune. Here are some things that can go wrong:

α too small: α too large: α much too large:

slow progress oscillations instability
Good values are typically between 0.001 and 0.1. You should do a
grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).

Intro ML (UofT) CSC311-Lec3 37 / 42

Training Curves

To diagnose optimization problems, it’s useful to look at training

curves: plot the training cost as a function of iteration.

Warning: it’s very hard to tell from the training curves whether an
optimizer has converged. They can reveal major problems, but
they can’t guarantee convergence.
Intro ML (UofT) CSC311-Lec3 38 / 42
Brief Matrix and Vector Calculus

For a function f : Rp → R, ∇f (z) denotes the gradient at z which

points in the direction of the greatest rate of increase.
∂
∇f (x) ∈ Rp is a vector with [∇f (x)]i = ∂xi f (x).
2
∇2 f (x) ∈ Rp×p is a matrix with [∇2 f (x)]ij = ∂x∂i ∂xj f (x)
At any minimum of a function f , we have ∇f (w) = 0,
∇2 f (w) 0.
Consider the problem minimize `(w) = 12 ky − Xwk22 ,
w
∇`(w) = X > (Xw − y) = 0 =⇒ ŵ = (X > X)−1 X > y (assuming
X > X is invertible)

At an arbitrary point x (old/new observation), our prediction is

y = ŵ> x.

Intro ML (UofT) CSC311-Lec3 39 / 42

Vectorization

Computing the prediction using a for loop:

For-loops in Python are slow, so we vectorize algorithms by

expressing them in terms of vectors and matrices.

w = (w1 , . . . , wD )T x = (x1 , . . . , xD )

y = wT x + b

This is simpler and much faster:

Intro ML (UofT) CSC311-Lec3 40 / 42

Vectorization

Why vectorize?
The equations, and the code, will be simpler and more readable.
Gets rid of dummy variables/indices!
Vectorized code is much faster
I Cut down on Python interpreter overhead
I Use highly optimized linear algebra libraries
I Matrix multiplication is very fast on a Graphics Processing Unit
(GPU)

Intro ML (UofT) CSC311-Lec3 41 / 42

Conclusion

Linear regression exemplifies recurring themes of this course:

choose a model and a loss function
formulate an optimization problem
solve the minimization problem using one of two strategies
I direct solution (set derivatives to zero)
I gradient descent (see appendix)
vectorize the algorithm, i.e. represent in terms of linear algebra
make a linear model more powerful using features
improve the generalization by adding a regularizer
Probabilistic Interpretation as MLE with Gaussian noise model

Intro ML (UofT) CSC311-Lec3 42 / 42

Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Lec06 Matt
No ratings yet
Lec06 Matt
60 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Da321m Part 3
No ratings yet
Da321m Part 3
48 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Week 2
No ratings yet
Week 2
43 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Group 30
No ratings yet
Group 30
33 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
Day 1
No ratings yet
Day 1
41 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Cost Function
No ratings yet
Cost Function
17 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
03 Linear Regression Intuition
No ratings yet
03 Linear Regression Intuition
23 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
2-LR Optim
No ratings yet
2-LR Optim
60 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
61 pages
04 Linear
No ratings yet
04 Linear
31 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
89 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Linear Regression for Astronomers
No ratings yet
Linear Regression for Astronomers
24 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
NN Theory
No ratings yet
NN Theory
138 pages
IML Summary
No ratings yet
IML Summary
12 pages
ML Cheatsheet PDF
100% (1)
ML Cheatsheet PDF
211 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
ML Cheatsheet for Beginners
100% (1)
ML Cheatsheet for Beginners
211 pages
Linear Regression Lecture Notes
No ratings yet
Linear Regression Lecture Notes
34 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
36 pages
C-Band Radar Attenuation Correction
No ratings yet
C-Band Radar Attenuation Correction
6 pages
Monaco TPS Strategies Monaco Tips and Tricks
100% (2)
Monaco TPS Strategies Monaco Tips and Tricks
86 pages
Brajard Et Al 2021 Combining Data Assimilation and Machine Learning To Infer Unresolved Scale Parametrization
No ratings yet
Brajard Et Al 2021 Combining Data Assimilation and Machine Learning To Infer Unresolved Scale Parametrization
16 pages
Module 2
No ratings yet
Module 2
44 pages
Linaer Programming Noted
No ratings yet
Linaer Programming Noted
13 pages
Advanced Geoscience Remote Sensing
No ratings yet
Advanced Geoscience Remote Sensing
275 pages
The Maths That Drives Ai: Work / Technology & Tools
No ratings yet
The Maths That Drives Ai: Work / Technology & Tools
3 pages
Lecture 4 - CS50 - S Introduction To Artificial Intelligence With Python1
No ratings yet
Lecture 4 - CS50 - S Introduction To Artificial Intelligence With Python1
18 pages
Math 1010 Optimization Project
No ratings yet
Math 1010 Optimization Project
5 pages
Machine: Learning ATO Z - I
No ratings yet
Machine: Learning ATO Z - I
131 pages
07 CMOST Tutorial
100% (3)
07 CMOST Tutorial
37 pages
Chp13 LinearOptimization2
No ratings yet
Chp13 LinearOptimization2
45 pages
Variational Quantum Algorithm For Enhanced Continuous Variable Optical Phase Sensing
No ratings yet
Variational Quantum Algorithm For Enhanced Continuous Variable Optical Phase Sensing
14 pages
Man Sci 25 - 4.2-1
No ratings yet
Man Sci 25 - 4.2-1
27 pages
Design of Missile Control System Using Model Predictive Control
No ratings yet
Design of Missile Control System Using Model Predictive Control
7 pages
Elmachtoub-2022-Smart Predict, Then Optimize
No ratings yet
Elmachtoub-2022-Smart Predict, Then Optimize
19 pages
Empirical Risk Minimization Guide
No ratings yet
Empirical Risk Minimization Guide
6 pages
Optimization With COMSOL Multiphysics
No ratings yet
Optimization With COMSOL Multiphysics
55 pages
(Ind) 3. CHP 10 Horngren - Ca16 - PPT - 10
No ratings yet
(Ind) 3. CHP 10 Horngren - Ca16 - PPT - 10
47 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
Optimal Replacement Time Estimation For Machines and Equipment Based On Cost Function
No ratings yet
Optimal Replacement Time Estimation For Machines and Equipment Based On Cost Function
4 pages
Introduction To Quality Engineering by Design
No ratings yet
Introduction To Quality Engineering by Design
26 pages
Handbook of Dynamic Game Theory
No ratings yet
Handbook of Dynamic Game Theory
1,288 pages
Calculaus
No ratings yet
Calculaus
8 pages
Chapter 10 Determining How Costs Behave
No ratings yet
Chapter 10 Determining How Costs Behave
133 pages
456 - Super Grain Corporation
No ratings yet
456 - Super Grain Corporation
5 pages
Preparing For The Future of Ai
No ratings yet
Preparing For The Future of Ai
58 pages
Mba 206
No ratings yet
Mba 206
240 pages
Module 2 Quiz - Correct
No ratings yet
Module 2 Quiz - Correct
4 pages
Chapter 10 - Determining How Costs Behave
100% (1)
Chapter 10 - Determining How Costs Behave
41 pages

Lec 03

Uploaded by

Lec 03

Uploaded by

CSC 311: Introduction to Machine Learning

Lecture 3 - Linear Regression &

Amir-massoud Farahmand & Emad A.M. Andrews

Intro ML (UofT) CSC311-Lec3 1 / 42

So far, we have talked about procedures for learning.

Intro ML (UofT) CSC311-Lec3 2 / 42

Recall that in supervised learning:

Model: In linear regression, we use linear functions of the inputs

Intro ML (UofT) CSC311-Lec3 4 / 42

2.0 Fitted line

Relation between the prediction y and inputs x is linear in both cases.

Intro ML (UofT) CSC311-Lec3 5 / 42

We have a dataset D = {(x(i) , t(i) ) for i = 1, 2, ..., N } where,

0.50.5 Find the “best” line (w, b).

Intro ML (UofT) CSC311-Lec3 6 / 42

The terminology is not universal. Some might call “loss” pointwise

We can organize all the training examples into a design matrix X

Computing the predictions for the whole dataset:

Intro ML (UofT) CSC311-Lec3 8 / 42

We defined a cost function. This is what we would like to

Intro ML (UofT) CSC311-Lec3 10 / 42

Partial derivatives: derivatives of a multivariate function with

Intro ML (UofT) CSC311-Lec3 11 / 42

Chain rule for derivatives:

Cost derivatives (average over data points):

Intro ML (UofT) CSC311-Lec3 12 / 42

The minimum must occur at a point where the partial derivatives

Linear regression is one of only a handful of models in this course

Intro ML (UofT) CSC311-Lec3 13 / 42

We can still use linear regression by mapping the input feature to

Fit the data using a degree-M polynomial function of the form:

Here the feature mapping is ψ(x) = [1, x, x2 , ...]> .

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 16 / 42

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 17 / 42

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 18 / 42

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 19 / 42

Good model (M=3): Achieves small test error (generalizes well).

Intro ML (UofT) CSC311-Lec3 20 / 42

As M increases, the magnitude of coefficients gets larger.

Intro ML (UofT) CSC311-Lec3 21 / 42

The degree of the polynomial M controls the complexity of the

Intro ML (UofT) CSC311-Lec3 22 / 42

The case λ = 0 (no regularization) reduces to least squares solution!

Intro ML (UofT) CSC311-Lec3 24 / 42

The `1 norm, or sum of absolute values, is another regularizer that encourages

— Bishop, Pattern Recognition and Machine Learning

Intro ML (UofT) CSC311-Lec3 25 / 42

Suppose that our model arose from a

y (i) = w> x(i) + (i)

where (i) ∼ N (0, σ 2 ) is independent

Intro ML (UofT) CSC311-Lec3 27 / 42

Maximum likelihood estimation: after observing the data samples

Product of N terms is not easy to minimize. Taking log reduces it to a sum!

Maximum Likelihood Estimator (MLE)

As C and σ do not depend on w, they do not contribute to the

wMLE = wLS when we work with Gaussian densities.

Intro ML (UofT) CSC311-Lec3 31 / 42

Intro ML (UofT) CSC311-Lec3 32 / 42

α is a learning rate. The larger it is, the faster w changes.

Intro ML (UofT) CSC311-Lec3 33 / 42

I This is the direction of fastest increase in J .

Hence, gradient descent updates the weights in the direction of

Even for linear regression, where there is a direct solution, we

Intro ML (UofT) CSC311-Lec3 35 / 42

Recall the gradient descent update:

Intro ML (UofT) CSC311-Lec3 36 / 42

In gradient descent, the learning rate α is a hyperparameter we

α too small: α too large: α much too large:

Intro ML (UofT) CSC311-Lec3 37 / 42

To diagnose optimization problems, it’s useful to look at training

For a function f : Rp → R, ∇f (z) denotes the gradient at z which

At an arbitrary point x (old/new observation), our prediction is

y (i) = w> x(i) + (i)

where (i) ∼ N (0, σ 2 ) is independent