CSC 311: Introduction to Machine Learning
Lecture 3 - Linear Regression &
Gradient Descent for Optimization
Amir-massoud Farahmand & Emad A.M. Andrews
University of Toronto
Intro ML (UofT) CSC311-Lec3 1 / 42
Modular Approach to ML Algorithm Design
So far, we have talked about procedures for learning.
I KNN and decision trees.
For the remainder of this course, we will take a more modular
approach:
I choose a model describing the relationships between variables of
interest
I define a loss function quantifying how bad the fit to the data is
I choose a regularizer saying how much we prefer different candidate
models (or explanations of data)
I fit the model that minimizes the loss function and satisfy the
constraint/penalty imposed by the regularizer, possibly using an
optimization algorithm
Mixing and matching these modular components gives us a lot of
new ML methods.
Intro ML (UofT) CSC311-Lec3 2 / 42
The Supervised Learning Setup
Recall that in supervised learning:
There is target t ∈ T (also called response, outcome, output, class)
There are features x ∈ X (also called inputs and covariates)
Objective is to learn a function f : X → T such that
t ≈ y = f (x)
based on some data D = {(x(i) , t(i) ) for i = 1, 2, ..., N }.
Intro ML (UofT) CSC311-Lec3 3 / 42
Linear Regression – Model
Model: In linear regression, we use linear functions of the inputs
x = (x1 , . . . , xD ) to make predictions y of the target value t:
X
y =f (x) = wj x j + b
j
I y is the prediction
I w is the weights
I b is the bias (or intercept) (do not confuse with the bias-variance
tradeoff in the next lecture)
w and b together are the parameters
We hope that our prediction is close to the target: y ≈ t.
Intro ML (UofT) CSC311-Lec3 4 / 42
What is Linear? 1 Feature vs. D Features
2.0 Fitted line
Data
1.5
1.0 If we have only 1 feature:
y: response
0.5
0.0
y = wx + b where w, x, b ∈ R.
0.5
1.0 y is linear in x.
2 1 0 1 2
x: features
If we have D features:
y = w> x + b where w, x ∈ RD ,
b∈R
y is linear in x.
Relation between the prediction y and inputs x is linear in both cases.
Intro ML (UofT) CSC311-Lec3 5 / 42
Linear Regression
We have a dataset D = {(x(i) , t(i) ) for i = 1, 2, ..., N } where,
(i) (i) (i)
x(i) = (x1 , x2 , ..., xD )> ∈ RD are the inputs, e.g., age, height.
t(i) ∈ R is the target or response (e.g. income),
predict t(i) with a linear function of x(i) :
2.02.0 Data
Fitted line
Data
1.51.5
1.01.0 t(i) ≈ y (i) = w> x(i) + b
y: response
y: response
0.50.5 Find the “best” line (w, b).
0.00.0 minimize N (i) (i)
P
i=1 L(y , t )
0.50.5 (w,b)
1.01.0
22 11 00 11 22
x:x:features
features
Intro ML (UofT) CSC311-Lec3 6 / 42
Linear Regression – Loss Function
How to quantify the quality of the fit to data?
A loss function L(y, t) defines how bad it is if, for some example x,
the algorithm predicts y, but the target is actually t.
Squared error loss function:
L(y, t) = 12 (y − t)2
y − t is the residual, and we want to make its magnitude small
The 21 factor is just to make the calculations convenient.
Cost function: loss function averaged over all training examples
N
1 X (i) 2
J (w, b) = y − t(i)
2N i=1
N
1 X > (i) 2
= w x + b − t(i)
2N i=1
The terminology is not universal. Some might call “loss” pointwise
loss and the “cost function” the empirical loss or average loss.
Intro ML (UofT) CSC311-Lec3 7 / 42
Vector Notation
We can organize all the training examples into a design matrix X
with one row per training example, and all the targets into the
target vector t.
Computing the predictions for the whole dataset:
T (1) (1)
w x +b y
.
.. .
Xw + b1 = = .. = y
wT x(N ) + b y (N )
Intro ML (UofT) CSC311-Lec3 8 / 42
Vectorization
Computing the squared error cost across the whole dataset:
y = Xw + b1
1
J = ky − tk2
2N
Note that sometimes we may use J = 21 ky − tk2 , without
normalizer. That would correspond to the sum of losses, and not
the average loss. The minimizer does not depend on N .
We can also add a column of 1s to the design matrix, combine the
bias and the weights, and conveniently write
b
1 [x(1) ]>
w1
(2) >
X = 1 [x ] ∈ RN ×D+1 and w = w ∈ RD+1
.. 2
1 . ..
.
Then, our predictions reduce to y = Xw.
Intro ML (UofT) CSC311-Lec3 9 / 42
Solving the Minimization Problem
We defined a cost function. This is what we would like to
minimize.
Recall from your calculus class: minimum of a smooth function (if
it exists) occurs at a critical point, i.e., point where the derivative
is zero.
Multivariate generalization: set the partial derivatives to zero (or
equivalently the gradient).
We would like to find a point where the gradient is (close to) zero.
How can we do it?
I Sometimes it is possible to directly find the parameters that make
the gradient zero in a closed-form. We call this the direct solution.
I We may also use optimization techniques that iteratively get us
closer to the solution. We will get back to this soon.
Intro ML (UofT) CSC311-Lec3 10 / 42
Direct Solution
Partial derivatives: derivatives of a multivariate function with
respect to one of its arguments.
∂ f (x1 + h, x2 ) − f (x1 , x2 )
f (x1 , x2 ) = lim
∂x1 h→0 h
To compute, take the single variable derivatives, pretending the
other arguments are constant.
Example: partial derivatives of the prediction y
∂y ∂ X
= wj 0 xj 0 + b
∂wj ∂wj 0 j
= xj
∂y ∂ X
= wj 0 xj 0 + b
∂b ∂b 0j
=1
Intro ML (UofT) CSC311-Lec3 11 / 42
Direct Solution
Chain rule for derivatives:
∂L dL ∂y
=
∂wj dy ∂wj
d 1
= (y − t)2 · xj
dy 2
= (y − t)xj
∂L
=y−t
∂b
Cost derivatives (average over data points):
N
∂J 1 X (i) (i)
= (y − t(i) ) xj
∂wj N i=1
N
∂J 1 X (i)
= y − t(i)
∂b N i=1
Intro ML (UofT) CSC311-Lec3 12 / 42
Direct Solution
The minimum must occur at a point where the partial derivatives
are zero, i.e.,
∂J ∂J
= 0 (∀j), = 0.
∂wj ∂b
If ∂J /∂wj 6= 0, you could reduce the cost by changing wj .
This turns out to give a system of linear equations, which we can
solve efficiently. Full derivation in the preliminaries.pdf.
Optimal weights:
wLS = (XT X)−1 XT t
Linear regression is one of only a handful of models in this course
that permit direct solution.
Intro ML (UofT) CSC311-Lec3 13 / 42
Feature Mapping (Basis Expansion)
The relation between the input and output may not be linear.
We can still use linear regression by mapping the input feature to
another space using feature mapping (or basis expansion)
ψ(x) : RD → Rd and treat the mapped feature (in Rd ) as the
input of a linear regression procedure.
Let us see how it works when x ∈ R and we use polynomial feature
mapping.
Intro ML (UofT) CSC311-Lec3 14 / 42
Polynomial Feature Mapping
Fit the data using a degree-M polynomial function of the form:
M
X
y = w0 + w1 x + w2 x2 + ... + wM xM = wi x i
i=0
Here the feature mapping is ψ(x) = [1, x, x2 , ...]> .
We can still use least squares to find w since y = ψ(x)> w is linear
in w0 , w1 , ....
In general, ψ can be any function. Another example: ψ =
[1, sin(2πx), cos(2πx), sin(4πx), cos(4πx), sin(6πx), cos(6πx), · · · ]> .
Intro ML (UofT) CSC311-Lec3 15 / 42
Polynomial Feature Mapping with M = 0
y = w0
1 M =0
t
−1
0 x 1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Intro ML (UofT) CSC311-Lec3 16 / 42
Polynomial Feature Mapping with M = 1
y = w0 + w1 x
1 M =1
t
−1
0 x 1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Intro ML (UofT) CSC311-Lec3 17 / 42
Polynomial Feature Mapping with M = 3
y = w0 + w1 x + w2 x 2 + w3 x 3
1 M =3
t
−1
0 x 1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Intro ML (UofT) CSC311-Lec3 18 / 42
Polynomial Feature Mapping with M = 9
y = w0 + w1 x + w2 x2 + w3 x3 + . . . + w9 x9
1 M =9
t
−1
0 x 1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Intro ML (UofT) CSC311-Lec3 19 / 42
Model Complexity and Generalization
Underfitting (M=0): model is too simple — does not fit the data.
Overfitting (M=9): model is too complex — fits perfectly.
1 M =0 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
Good model (M=3): Achieves small test error (generalizes well).
1 M =3
t
−1
0 x 1
Intro ML (UofT) CSC311-Lec3 20 / 42
Model Complexity and Generalization
1 M =9
t
−1
0 x 1
As M increases, the magnitude of coefficients gets larger.
For M = 9, the coefficients have become finely tuned to the data.
Between data points, the function exhibits large oscillations.
Intro ML (UofT) CSC311-Lec3 21 / 42
Regularization for Controlling the Model Complexity
The degree of the polynomial M controls the complexity of the
model.
The value of M is a hyperparameter for polynomial expansion,
just like k in KNN. We can tune it using a validation set.
Restricting the number of parameters of a model (M here) is a
crude approach to control the complexity of the model.
A better solution: keep the number of parameters of the model
large, but enforce “simpler” solutions within the same space of
parameters.
This is done through regularization or penalization.
I Regularizer (or penalty): a function that quantifies how much we
prefer one hypothesis vs. another
Q: How?!
Intro ML (UofT) CSC311-Lec3 22 / 42
`2 (or L2 ) Regularization
We can encourage the weights to be small by choosing as our regularizer
the `2 (or L2 ) penalty.
1X 2
R(w) = 12 kwk22 = w .
2 j j
I Note: To be precise, we are regularizing the squared `2 norm.
The regularized cost function makes a tradeoff between fit to the data
and the norm of the weights:
λX 2
Jreg (w) = J (w) + λR(w) = J (w) + w .
2 j j
The basic idea is that “simpler” functions have smaller `2 -norm of their
weights w, and we prefer them to functions with larger `2 -norms.
If you fit training data poorly, J is large. If your optimal weights have
high values, R is large.
Large λ penalizes weight values more.
Here, λ is a hyperparameter that we can tune with a validation set.
Intro ML (UofT) CSC311-Lec3 23 / 42
`2 Regularized Least Squares: Ridge Regression
1
For the least squares problem, we have J (w) = 2N kXw − tk2 .
When λ > 0 (with regularization), regularized cost gives
1 λ
wλRidge = argmin Jreg (w) = argmin kXw − tk22 + kwk22
w w 2N 2
=(XT X + λN I)−1 XT t
The case λ = 0 (no regularization) reduces to least squares solution!
Q: What happens when λ → ∞?
Note that it is also common to formulate this problem as
argminw kXw − tk22 + λ2 kwk22 in which case the solution is
wλRidge = (XT X + λI)−1 XT t.
Intro ML (UofT) CSC311-Lec3 24 / 42
`1 vs. `2 Regularization
The `1 norm, or sum of absolute values, is another regularizer that encourages
weights to be exactly zero. (How can you tell?)
We can design regularizers based on whatever property we’d like to encourage.
— Bishop, Pattern Recognition and Machine Learning
Intro ML (UofT) CSC311-Lec3 25 / 42
Probabilistic Interpretation of the Squared Error
For the least squares: we minimize the sum of the squares of the errors
between the predictions for each data point x(i) and the corresponding
target values t(i) , i.e.,
n
X
minimize (w> x(i) + b − t(i) )2
(w,w0 )
i=1
t ≈ x> w + b, (w, b) ∈ RD × R
We measure the quality of the fit using the
squared error loss. Why?
Even though the squared error loss looks
natural, we did not really justify it.
We provide a probabilistic perspective here.
There are other justifications too; we get to
them in the next lecture.
Intro ML (UofT) CSC311-Lec3 26 / 42
Probabilistic Interpretation of the Squared Error
Suppose that our model arose from a
statistical model (b = 0 for simplicity):
y (i) = w> x(i) + (i)
where (i) ∼ N (0, σ 2 ) is independent
of anything else.
Thus, y (i) |x(i) ∼ p(y|x(i) , w) =
N (w> x(i) , σ 2 ).
Intro ML (UofT) CSC311-Lec3 27 / 42
Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
Suppose that the input data {x(1) , x(2) , . . . , x(N ) } are given and
the outputs are independently drawn from
y (i) ∼ p(y|x(i) , w)
with an unknown parameter w. So the dataset is
D = {(x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(N ) , y (N ) )}.
The likelihood function is Pr(D|w).
The maximum likelihood estimation (MLE) is the “principle” that
suggests we have to find a parameter ŵ that maximizes the
likelihood, i.e.,
ŵ ← argmax Pr(D|w).
w
Maximum likelihood estimation: after observing the data samples
(x(i) , y (i) ) for i = 1, 2, ..., N , we should choose w that maximizes the
likelihood.
Intro ML (UofT) CSC311-Lec3 28 / 42
Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
For independent samples, the likelihood function of samples D is the product
of their likelihoods
N
Y
p(y (1) , y (2) , ..., y (n) |x(1) , x(2) , ..., x(N ) , w) = p(y (i) |x(i) , w) = L(w).
i=1
Product of N terms is not easy to minimize. Taking log reduces it to a sum!
Two objectives are equivalent since log is strictly increasing.
Maximizing the likelihood is equivalent to minimizing the negative
log-likelihood:
N
Y n
X
`(w) = − log L(w) = − log p(z (i) |w) = − log p(z (i) |w)
i=1 i=1
Maximum Likelihood Estimator (MLE)
After observing z (i) = (x(i) , y (i) ) for i = 1, ..., n i.i.d. samples from p(z|w), MLE is
n
X
wMLE = argmin l(w) = − log p(z (i) |w).
w
i=1
Intro ML (UofT) CSC311-Lec3 29 / 42
Probabilistic Interpretation of the Squared Error: From
MLE to Squared Error
Suppose that our model arose from a statistical model:
y (i) = w> x(i) + (i)
where (i) ∼ N (0, σ 2 ) is independent of anything else.
p(y (i) |x(i) , w) = √ 1 2 exp − 2σ1 2 (y (i) − w> x(i) )2
2πσ √
log p(y (i) |x(i) , w) = − 2σ1 2 (y (i) − w> x(i) )2 − log( 2πσ 2 )
The MLE solution is
N
X
w MLE
= argmin L(w) = 1
2σ 2
(y (i) − w> x(i) )2 + C.
w
i=1
As C and σ do not depend on w, they do not contribute to the
minimization.
wMLE = wLS when we work with Gaussian densities.
Intro ML (UofT) CSC311-Lec3 30 / 42
Gradient Descent
Intro ML (UofT) CSC311-Lec3 31 / 42
Gradient Descent
Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply
an update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros)
and repeatedly adjust them in the direction of steepest descent.
Intro ML (UofT) CSC311-Lec3 32 / 42
Gradient Descent
Observe:
I if ∂J /∂wj > 0, then increasing wj increases J .
I if ∂J /∂wj < 0, then increasing wj decreases J .
The following update decreases the cost function:
∂J
wj ← wj − α
∂wj
N
α X (i) (i)
= wj − (y − t(i) ) xj
N
i=1
α is a learning rate. The larger it is, the faster w changes.
I We’ll see later how to tune the learning rate, but values are
typically small, e.g. 0.01 or 0.0001
Intro ML (UofT) CSC311-Lec3 33 / 42
Gradient Descent
This gets its name from the gradient:
∂J
1 ∂w
∂J ..
= .
∂w ∂J
∂wD
I This is the direction of fastest increase in J .
Update rule in vector form:
∂J
w ←w−α
∂w
N
α X (i)
=w− (y − t(i) ) x(i)
N
i=1
Hence, gradient descent updates the weights in the direction of
fastest decrease.
Observe that once it converges, we get a critical point, i.e. ∂J
∂w = 0.
Intro ML (UofT) CSC311-Lec3 34 / 42
Gradient Descent for Linear regression
Even for linear regression, where there is a direct solution, we
sometimes need to use GD.
Why gradient descent, if we can find the optimum directly?
I GD can be applied to a much broader set of models
I GD can be easier to implement than direct solutions
I For regression in high-dimensional spaces, GD is more efficient than
direct solution
I Linear regression solution: (XT X)−1 XT t
I matrix inversion is an O(D3 ) algorithm
I each GD update costs O(N D)
I Huge difference if D 1
Intro ML (UofT) CSC311-Lec3 35 / 42
Gradient Descent under the `2 Regularization
Recall the gradient descent update:
∂J
w ←w−α
∂w
The gradient descent update of the regularized cost J + λR has
an interesting interpretation as weight decay:
∂J ∂R
w ←w−α +λ
∂w ∂w
∂J
=w−α + λw
∂w
∂J
= (1 − αλ)w − α
∂w
Intro ML (UofT) CSC311-Lec3 36 / 42
Learning Rate (Step Size)
In gradient descent, the learning rate α is a hyperparameter we
need to tune. Here are some things that can go wrong:
α too small: α too large: α much too large:
slow progress oscillations instability
Good values are typically between 0.001 and 0.1. You should do a
grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).
Intro ML (UofT) CSC311-Lec3 37 / 42
Training Curves
To diagnose optimization problems, it’s useful to look at training
curves: plot the training cost as a function of iteration.
Warning: it’s very hard to tell from the training curves whether an
optimizer has converged. They can reveal major problems, but
they can’t guarantee convergence.
Intro ML (UofT) CSC311-Lec3 38 / 42
Brief Matrix and Vector Calculus
For a function f : Rp → R, ∇f (z) denotes the gradient at z which
points in the direction of the greatest rate of increase.
∂
∇f (x) ∈ Rp is a vector with [∇f (x)]i = ∂xi f (x).
2
∇2 f (x) ∈ Rp×p is a matrix with [∇2 f (x)]ij = ∂x∂i ∂xj f (x)
At any minimum of a function f , we have ∇f (w) = 0,
∇2 f (w) 0.
Consider the problem minimize `(w) = 12 ky − Xwk22 ,
w
∇`(w) = X > (Xw − y) = 0 =⇒ ŵ = (X > X)−1 X > y (assuming
X > X is invertible)
At an arbitrary point x (old/new observation), our prediction is
y = ŵ> x.
Intro ML (UofT) CSC311-Lec3 39 / 42
Vectorization
Computing the prediction using a for loop:
For-loops in Python are slow, so we vectorize algorithms by
expressing them in terms of vectors and matrices.
w = (w1 , . . . , wD )T x = (x1 , . . . , xD )
y = wT x + b
This is simpler and much faster:
Intro ML (UofT) CSC311-Lec3 40 / 42
Vectorization
Why vectorize?
The equations, and the code, will be simpler and more readable.
Gets rid of dummy variables/indices!
Vectorized code is much faster
I Cut down on Python interpreter overhead
I Use highly optimized linear algebra libraries
I Matrix multiplication is very fast on a Graphics Processing Unit
(GPU)
Intro ML (UofT) CSC311-Lec3 41 / 42
Conclusion
Linear regression exemplifies recurring themes of this course:
choose a model and a loss function
formulate an optimization problem
solve the minimization problem using one of two strategies
I direct solution (set derivatives to zero)
I gradient descent (see appendix)
vectorize the algorithm, i.e. represent in terms of linear algebra
make a linear model more powerful using features
improve the generalization by adding a regularizer
Probabilistic Interpretation as MLE with Gaussian noise model
Intro ML (UofT) CSC311-Lec3 42 / 42