0% found this document useful (0 votes)
57 views42 pages

Lec 03

This document provides an overview of linear regression and gradient descent optimization. It introduces linear regression as a model where the prediction is a linear combination of the input features. It defines the mean squared error as a loss function to measure error. Gradient descent is presented as an optimization algorithm that can be used to find the optimal parameters by iteratively taking steps in the direction of steepest descent as defined by the gradient of the loss function.

Uploaded by

JustSaint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views42 pages

Lec 03

This document provides an overview of linear regression and gradient descent optimization. It introduces linear regression as a model where the prediction is a linear combination of the input features. It defines the mean squared error as a loss function to measure error. Gradient descent is presented as an optimization algorithm that can be used to find the optimal parameters by iteratively taking steps in the direction of steepest descent as defined by the gradient of the loss function.

Uploaded by

JustSaint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CSC 311: Introduction to Machine Learning

Lecture 3 - Linear Regression &


Gradient Descent for Optimization

Amir-massoud Farahmand & Emad A.M. Andrews

University of Toronto

Intro ML (UofT) CSC311-Lec3 1 / 42


Modular Approach to ML Algorithm Design

So far, we have talked about procedures for learning.


I KNN and decision trees.
For the remainder of this course, we will take a more modular
approach:
I choose a model describing the relationships between variables of
interest
I define a loss function quantifying how bad the fit to the data is
I choose a regularizer saying how much we prefer different candidate
models (or explanations of data)
I fit the model that minimizes the loss function and satisfy the
constraint/penalty imposed by the regularizer, possibly using an
optimization algorithm
Mixing and matching these modular components gives us a lot of
new ML methods.

Intro ML (UofT) CSC311-Lec3 2 / 42


The Supervised Learning Setup

Recall that in supervised learning:


There is target t ∈ T (also called response, outcome, output, class)
There are features x ∈ X (also called inputs and covariates)
Objective is to learn a function f : X → T such that
t ≈ y = f (x)
based on some data D = {(x(i) , t(i) ) for i = 1, 2, ..., N }.
Intro ML (UofT) CSC311-Lec3 3 / 42
Linear Regression – Model

Model: In linear regression, we use linear functions of the inputs


x = (x1 , . . . , xD ) to make predictions y of the target value t:
X
y =f (x) = wj x j + b
j

I y is the prediction
I w is the weights
I b is the bias (or intercept) (do not confuse with the bias-variance
tradeoff in the next lecture)
w and b together are the parameters
We hope that our prediction is close to the target: y ≈ t.

Intro ML (UofT) CSC311-Lec3 4 / 42


What is Linear? 1 Feature vs. D Features

2.0 Fitted line


Data
1.5
1.0 If we have only 1 feature:
y: response

0.5
0.0
y = wx + b where w, x, b ∈ R.
0.5
1.0 y is linear in x.
2 1 0 1 2
x: features

If we have D features:
y = w> x + b where w, x ∈ RD ,
b∈R
y is linear in x.

Relation between the prediction y and inputs x is linear in both cases.

Intro ML (UofT) CSC311-Lec3 5 / 42


Linear Regression

We have a dataset D = {(x(i) , t(i) ) for i = 1, 2, ..., N } where,


(i) (i) (i)
x(i) = (x1 , x2 , ..., xD )> ∈ RD are the inputs, e.g., age, height.
t(i) ∈ R is the target or response (e.g. income),
predict t(i) with a linear function of x(i) :

2.02.0 Data
Fitted line
Data
1.51.5
1.01.0 t(i) ≈ y (i) = w> x(i) + b
y: response
y: response

0.50.5 Find the “best” line (w, b).


0.00.0 minimize N (i) (i)
P
i=1 L(y , t )
0.50.5 (w,b)

1.01.0
22 11 00 11 22
x:x:features
features

Intro ML (UofT) CSC311-Lec3 6 / 42


Linear Regression – Loss Function
How to quantify the quality of the fit to data?
A loss function L(y, t) defines how bad it is if, for some example x,
the algorithm predicts y, but the target is actually t.
Squared error loss function:
L(y, t) = 12 (y − t)2
y − t is the residual, and we want to make its magnitude small
The 21 factor is just to make the calculations convenient.
Cost function: loss function averaged over all training examples
N
1 X  (i) 2
J (w, b) = y − t(i)
2N i=1
N
1 X  > (i) 2
= w x + b − t(i)
2N i=1

The terminology is not universal. Some might call “loss” pointwise


loss and the “cost function” the empirical loss or average loss.
Intro ML (UofT) CSC311-Lec3 7 / 42
Vector Notation

We can organize all the training examples into a design matrix X


with one row per training example, and all the targets into the
target vector t.

Computing the predictions for the whole dataset:


 T (1)   (1) 
w x +b y
.
.. . 
Xw + b1 =   =  ..  = y
  

wT x(N ) + b y (N )

Intro ML (UofT) CSC311-Lec3 8 / 42


Vectorization
Computing the squared error cost across the whole dataset:
y = Xw + b1
1
J = ky − tk2
2N
Note that sometimes we may use J = 21 ky − tk2 , without
normalizer. That would correspond to the sum of losses, and not
the average loss. The minimizer does not depend on N .
We can also add a column of 1s to the design matrix, combine the
bias and the weights, and conveniently write
 
b
1 [x(1) ]>
 
w1 
(2) > 
X = 1 [x ]  ∈ RN ×D+1 and w = w  ∈ RD+1
  
..  2
1 . ..
.
Then, our predictions reduce to y = Xw.
Intro ML (UofT) CSC311-Lec3 9 / 42
Solving the Minimization Problem

We defined a cost function. This is what we would like to


minimize.
Recall from your calculus class: minimum of a smooth function (if
it exists) occurs at a critical point, i.e., point where the derivative
is zero.
Multivariate generalization: set the partial derivatives to zero (or
equivalently the gradient).
We would like to find a point where the gradient is (close to) zero.
How can we do it?
I Sometimes it is possible to directly find the parameters that make
the gradient zero in a closed-form. We call this the direct solution.
I We may also use optimization techniques that iteratively get us
closer to the solution. We will get back to this soon.

Intro ML (UofT) CSC311-Lec3 10 / 42


Direct Solution

Partial derivatives: derivatives of a multivariate function with


respect to one of its arguments.
∂ f (x1 + h, x2 ) − f (x1 , x2 )
f (x1 , x2 ) = lim
∂x1 h→0 h
To compute, take the single variable derivatives, pretending the
other arguments are constant.
Example: partial derivatives of the prediction y
 
∂y ∂ X
= wj 0 xj 0 + b
∂wj ∂wj 0 j

= xj
 
∂y ∂ X
= wj 0 xj 0 + b
∂b ∂b 0j

=1

Intro ML (UofT) CSC311-Lec3 11 / 42


Direct Solution

Chain rule for derivatives:


∂L dL ∂y
=
∂wj dy ∂wj
 
d 1
= (y − t)2 · xj
dy 2
= (y − t)xj
∂L
=y−t
∂b

Cost derivatives (average over data points):


N
∂J 1 X (i) (i)
= (y − t(i) ) xj
∂wj N i=1
N
∂J 1 X (i)
= y − t(i)
∂b N i=1

Intro ML (UofT) CSC311-Lec3 12 / 42


Direct Solution

The minimum must occur at a point where the partial derivatives


are zero, i.e.,
∂J ∂J
= 0 (∀j), = 0.
∂wj ∂b
If ∂J /∂wj 6= 0, you could reduce the cost by changing wj .
This turns out to give a system of linear equations, which we can
solve efficiently. Full derivation in the preliminaries.pdf.
Optimal weights:
wLS = (XT X)−1 XT t

Linear regression is one of only a handful of models in this course


that permit direct solution.

Intro ML (UofT) CSC311-Lec3 13 / 42


Feature Mapping (Basis Expansion)

The relation between the input and output may not be linear.

We can still use linear regression by mapping the input feature to


another space using feature mapping (or basis expansion)
ψ(x) : RD → Rd and treat the mapped feature (in Rd ) as the
input of a linear regression procedure.
Let us see how it works when x ∈ R and we use polynomial feature
mapping.
Intro ML (UofT) CSC311-Lec3 14 / 42
Polynomial Feature Mapping

Fit the data using a degree-M polynomial function of the form:


M
X
y = w0 + w1 x + w2 x2 + ... + wM xM = wi x i
i=0

Here the feature mapping is ψ(x) = [1, x, x2 , ...]> .


We can still use least squares to find w since y = ψ(x)> w is linear
in w0 , w1 , ....
In general, ψ can be any function. Another example: ψ =
[1, sin(2πx), cos(2πx), sin(4πx), cos(4πx), sin(6πx), cos(6πx), · · · ]> .
Intro ML (UofT) CSC311-Lec3 15 / 42
Polynomial Feature Mapping with M = 0

y = w0

1 M =0
t

−1

0 x 1

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 16 / 42


Polynomial Feature Mapping with M = 1

y = w0 + w1 x

1 M =1
t

−1

0 x 1

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 17 / 42


Polynomial Feature Mapping with M = 3

y = w0 + w1 x + w2 x 2 + w3 x 3

1 M =3
t

−1

0 x 1

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 18 / 42


Polynomial Feature Mapping with M = 9

y = w0 + w1 x + w2 x2 + w3 x3 + . . . + w9 x9

1 M =9
t

−1

0 x 1

-Pattern Recognition and Machine Learning, Christopher Bishop.

Intro ML (UofT) CSC311-Lec3 19 / 42


Model Complexity and Generalization

Underfitting (M=0): model is too simple — does not fit the data.
Overfitting (M=9): model is too complex — fits perfectly.

1 M =0 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

Good model (M=3): Achieves small test error (generalizes well).

1 M =3
t

−1

0 x 1

Intro ML (UofT) CSC311-Lec3 20 / 42


Model Complexity and Generalization

1 M =9
t

−1

0 x 1

As M increases, the magnitude of coefficients gets larger.


For M = 9, the coefficients have become finely tuned to the data.
Between data points, the function exhibits large oscillations.

Intro ML (UofT) CSC311-Lec3 21 / 42


Regularization for Controlling the Model Complexity

The degree of the polynomial M controls the complexity of the


model.
The value of M is a hyperparameter for polynomial expansion,
just like k in KNN. We can tune it using a validation set.
Restricting the number of parameters of a model (M here) is a
crude approach to control the complexity of the model.
A better solution: keep the number of parameters of the model
large, but enforce “simpler” solutions within the same space of
parameters.
This is done through regularization or penalization.
I Regularizer (or penalty): a function that quantifies how much we
prefer one hypothesis vs. another
Q: How?!

Intro ML (UofT) CSC311-Lec3 22 / 42


`2 (or L2 ) Regularization
We can encourage the weights to be small by choosing as our regularizer
the `2 (or L2 ) penalty.
1X 2
R(w) = 12 kwk22 = w .
2 j j
I Note: To be precise, we are regularizing the squared `2 norm.
The regularized cost function makes a tradeoff between fit to the data
and the norm of the weights:
λX 2
Jreg (w) = J (w) + λR(w) = J (w) + w .
2 j j

The basic idea is that “simpler” functions have smaller `2 -norm of their
weights w, and we prefer them to functions with larger `2 -norms.
If you fit training data poorly, J is large. If your optimal weights have
high values, R is large.
Large λ penalizes weight values more.
Here, λ is a hyperparameter that we can tune with a validation set.
Intro ML (UofT) CSC311-Lec3 23 / 42
`2 Regularized Least Squares: Ridge Regression

1
For the least squares problem, we have J (w) = 2N kXw − tk2 .
When λ > 0 (with regularization), regularized cost gives

1 λ
wλRidge = argmin Jreg (w) = argmin kXw − tk22 + kwk22
w w 2N 2
=(XT X + λN I)−1 XT t

The case λ = 0 (no regularization) reduces to least squares solution!


Q: What happens when λ → ∞?
Note that it is also common to formulate this problem as
argminw kXw − tk22 + λ2 kwk22 in which case the solution is
wλRidge = (XT X + λI)−1 XT t.

Intro ML (UofT) CSC311-Lec3 24 / 42


`1 vs. `2 Regularization

The `1 norm, or sum of absolute values, is another regularizer that encourages


weights to be exactly zero. (How can you tell?)
We can design regularizers based on whatever property we’d like to encourage.

— Bishop, Pattern Recognition and Machine Learning

Intro ML (UofT) CSC311-Lec3 25 / 42


Probabilistic Interpretation of the Squared Error
For the least squares: we minimize the sum of the squares of the errors
between the predictions for each data point x(i) and the corresponding
target values t(i) , i.e.,
n
X
minimize (w> x(i) + b − t(i) )2
(w,w0 )
i=1

t ≈ x> w + b, (w, b) ∈ RD × R
We measure the quality of the fit using the
squared error loss. Why?
Even though the squared error loss looks
natural, we did not really justify it.
We provide a probabilistic perspective here.
There are other justifications too; we get to
them in the next lecture.
Intro ML (UofT) CSC311-Lec3 26 / 42
Probabilistic Interpretation of the Squared Error

Suppose that our model arose from a


statistical model (b = 0 for simplicity):

y (i) = w> x(i) + (i)

where (i) ∼ N (0, σ 2 ) is independent


of anything else.
Thus, y (i) |x(i) ∼ p(y|x(i) , w) =
N (w> x(i) , σ 2 ).

Intro ML (UofT) CSC311-Lec3 27 / 42


Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
Suppose that the input data {x(1) , x(2) , . . . , x(N ) } are given and
the outputs are independently drawn from
y (i) ∼ p(y|x(i) , w)
with an unknown parameter w. So the dataset is
D = {(x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(N ) , y (N ) )}.
The likelihood function is Pr(D|w).
The maximum likelihood estimation (MLE) is the “principle” that
suggests we have to find a parameter ŵ that maximizes the
likelihood, i.e.,
ŵ ← argmax Pr(D|w).
w

Maximum likelihood estimation: after observing the data samples


(x(i) , y (i) ) for i = 1, 2, ..., N , we should choose w that maximizes the
likelihood.
Intro ML (UofT) CSC311-Lec3 28 / 42
Probabilistic Interpretation of the Squared Error:
Maximum Likelihood Estimation
For independent samples, the likelihood function of samples D is the product
of their likelihoods
N
Y
p(y (1) , y (2) , ..., y (n) |x(1) , x(2) , ..., x(N ) , w) = p(y (i) |x(i) , w) = L(w).
i=1

Product of N terms is not easy to minimize. Taking log reduces it to a sum!


Two objectives are equivalent since log is strictly increasing.
Maximizing the likelihood is equivalent to minimizing the negative
log-likelihood:
N
Y n
X
`(w) = − log L(w) = − log p(z (i) |w) = − log p(z (i) |w)
i=1 i=1

Maximum Likelihood Estimator (MLE)


After observing z (i) = (x(i) , y (i) ) for i = 1, ..., n i.i.d. samples from p(z|w), MLE is
n
X
wMLE = argmin l(w) = − log p(z (i) |w).
w
i=1
Intro ML (UofT) CSC311-Lec3 29 / 42
Probabilistic Interpretation of the Squared Error: From
MLE to Squared Error
Suppose that our model arose from a statistical model:
y (i) = w> x(i) + (i)
where (i) ∼ N (0, σ 2 ) is independent of anything else.
p(y (i) |x(i) , w) = √ 1 2 exp − 2σ1 2 (y (i) − w> x(i) )2

2πσ √
log p(y (i) |x(i) , w) = − 2σ1 2 (y (i) − w> x(i) )2 − log( 2πσ 2 )
The MLE solution is
N
X
w MLE
= argmin L(w) = 1
2σ 2
(y (i) − w> x(i) )2 + C.
w
i=1

As C and σ do not depend on w, they do not contribute to the


minimization.

wMLE = wLS when we work with Gaussian densities.


Intro ML (UofT) CSC311-Lec3 30 / 42
Gradient Descent

Intro ML (UofT) CSC311-Lec3 31 / 42


Gradient Descent

Now let’s see a second way to minimize the cost function which is
more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply
an update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros)
and repeatedly adjust them in the direction of steepest descent.

Intro ML (UofT) CSC311-Lec3 32 / 42


Gradient Descent

Observe:
I if ∂J /∂wj > 0, then increasing wj increases J .
I if ∂J /∂wj < 0, then increasing wj decreases J .
The following update decreases the cost function:
∂J
wj ← wj − α
∂wj
N
α X (i) (i)
= wj − (y − t(i) ) xj
N
i=1

α is a learning rate. The larger it is, the faster w changes.


I We’ll see later how to tune the learning rate, but values are
typically small, e.g. 0.01 or 0.0001

Intro ML (UofT) CSC311-Lec3 33 / 42


Gradient Descent
This gets its name from the gradient:
 ∂J 
1 ∂w
∂J  .. 
= . 
∂w ∂J
∂wD

I This is the direction of fastest increase in J .


Update rule in vector form:
∂J
w ←w−α
∂w
N
α X (i)
=w− (y − t(i) ) x(i)
N
i=1

Hence, gradient descent updates the weights in the direction of


fastest decrease.
Observe that once it converges, we get a critical point, i.e. ∂J
∂w = 0.
Intro ML (UofT) CSC311-Lec3 34 / 42
Gradient Descent for Linear regression

Even for linear regression, where there is a direct solution, we


sometimes need to use GD.
Why gradient descent, if we can find the optimum directly?
I GD can be applied to a much broader set of models
I GD can be easier to implement than direct solutions
I For regression in high-dimensional spaces, GD is more efficient than
direct solution
I Linear regression solution: (XT X)−1 XT t
I matrix inversion is an O(D3 ) algorithm
I each GD update costs O(N D)
I Huge difference if D  1

Intro ML (UofT) CSC311-Lec3 35 / 42


Gradient Descent under the `2 Regularization

Recall the gradient descent update:


∂J
w ←w−α
∂w
The gradient descent update of the regularized cost J + λR has
an interesting interpretation as weight decay:
 
∂J ∂R
w ←w−α +λ
∂w ∂w
 
∂J
=w−α + λw
∂w
∂J
= (1 − αλ)w − α
∂w

Intro ML (UofT) CSC311-Lec3 36 / 42


Learning Rate (Step Size)

In gradient descent, the learning rate α is a hyperparameter we


need to tune. Here are some things that can go wrong:

α too small: α too large: α much too large:


slow progress oscillations instability
Good values are typically between 0.001 and 0.1. You should do a
grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).

Intro ML (UofT) CSC311-Lec3 37 / 42


Training Curves

To diagnose optimization problems, it’s useful to look at training


curves: plot the training cost as a function of iteration.

Warning: it’s very hard to tell from the training curves whether an
optimizer has converged. They can reveal major problems, but
they can’t guarantee convergence.
Intro ML (UofT) CSC311-Lec3 38 / 42
Brief Matrix and Vector Calculus

For a function f : Rp → R, ∇f (z) denotes the gradient at z which


points in the direction of the greatest rate of increase.

∇f (x) ∈ Rp is a vector with [∇f (x)]i = ∂xi f (x).
2
∇2 f (x) ∈ Rp×p is a matrix with [∇2 f (x)]ij = ∂x∂i ∂xj f (x)
At any minimum of a function f , we have ∇f (w) = 0,
∇2 f (w)  0.
Consider the problem minimize `(w) = 12 ky − Xwk22 ,
w
∇`(w) = X > (Xw − y) = 0 =⇒ ŵ = (X > X)−1 X > y (assuming
X > X is invertible)

At an arbitrary point x (old/new observation), our prediction is


y = ŵ> x.

Intro ML (UofT) CSC311-Lec3 39 / 42


Vectorization

Computing the prediction using a for loop:

For-loops in Python are slow, so we vectorize algorithms by


expressing them in terms of vectors and matrices.

w = (w1 , . . . , wD )T x = (x1 , . . . , xD )

y = wT x + b

This is simpler and much faster:

Intro ML (UofT) CSC311-Lec3 40 / 42


Vectorization

Why vectorize?
The equations, and the code, will be simpler and more readable.
Gets rid of dummy variables/indices!
Vectorized code is much faster
I Cut down on Python interpreter overhead
I Use highly optimized linear algebra libraries
I Matrix multiplication is very fast on a Graphics Processing Unit
(GPU)

Intro ML (UofT) CSC311-Lec3 41 / 42


Conclusion

Linear regression exemplifies recurring themes of this course:


choose a model and a loss function
formulate an optimization problem
solve the minimization problem using one of two strategies
I direct solution (set derivatives to zero)
I gradient descent (see appendix)
vectorize the algorithm, i.e. represent in terms of linear algebra
make a linear model more powerful using features
improve the generalization by adding a regularizer
Probabilistic Interpretation as MLE with Gaussian noise model

Intro ML (UofT) CSC311-Lec3 42 / 42

You might also like