0% found this document useful (0 votes)
14 views66 pages

ML 3

Chapter 3 of the document discusses linear models for regression, focusing on linear basis function models, maximum likelihood estimation, and regularization techniques. It covers the bias-variance trade-off, Bayesian linear regression, and model comparison, emphasizing the importance of choosing the right model and understanding the effective number of parameters. The chapter also introduces predictive distributions and equivalent kernels, providing insights into Gaussian processes and Bayesian model selection.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views66 pages

ML 3

Chapter 3 of the document discusses linear models for regression, focusing on linear basis function models, maximum likelihood estimation, and regularization techniques. It covers the bias-variance trade-off, Bayesian linear regression, and model comparison, emphasizing the importance of choosing the right model and understanding the effective number of parameters. The chapter also introduces predictive distributions and equivalent kernels, providing insights into Gaussian processes and Bayesian model selection.

Uploaded by

wj9hn5fc5c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Machine Learning

Chapter 3: Linear Models for Regression

孫民
清華大學電機系

3/13/23
2 Linear Basis Function Models (1)- Recall
Example: Polynomial Curve Fitting

3/13/23
3 Linear Basis Function Models (2) - Basis Function

Generally

where !j(x) are known as basis functions (can be


nonlinear wrt. x).
Typically, !0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
!d(x) = xd.
y(x,w) = w0 +w1x1+w2x2+…+wDxD (Linear Regression Model)
3/13/23
Both linear with respect to x and w.
4 Linear Basis Function Models (3)

Polynomial basis functions:

These are global; a small


change in x affect all basis
functions.

3/13/23
5 Linear Basis Function Models (4)

Gaussian basis functions:

Note that basis func. do not


need to be normalized. µj and
s control location and scale
(width). These are local; a small
change in x only affect basis
functions with nearby µj . 3/13/23
6 Linear Basis Function Models (5)
Sigmoidal basis functions:

where

µj and s control location and


scale (slope).

3/13/23
7 Maximum Likelihood and Least Squares (1)

Assume observations from a deterministic function


with added Gaussian noise:
where
which is the same as saying,

Given observed inputs, , and targets,


, we obtain the likelihood function

3/13/23
8 Maximum Likelihood and Least Squares (2)

Taking the logarithm, we get

where

is the sum-of-squares error.


3/13/23
9 Maximum Likelihood and Least Squares (3)

Computing the gradient and setting it to zero yields

$ $
! 𝑡𝑛𝜙 𝑥𝑛 𝑇 − 𝒘𝑇(! 𝜙 𝑥𝑛 𝜙 𝑥𝑛 𝑇)= 0
!"# !"#

𝝓T t – (𝝓T 𝝓)w = 0
Where

3/13/23
10 Maximum Likelihood and Least Squares (3)

𝝓T t – (𝝓T 𝝓)w = 0
(𝝓T 𝝓)w = 𝝓T t
The Moore-Penrose
Solving for w, we get pseudo-inverse, .

where

3/13/23
12 Geometry of Least Squares
Consider

N-dimensional
M-dimensional

S is spanned by .
wML minimizes the distance
between t and its orthogonal
projection on S, i.e. y.
3/13/23
13 Sequential Learning

Data items considered one at a time (a.k.a. online learning);


use stochastic (sequential) gradient descent (SGD):

This is known as the least-mean-squares (LMS) algorithm.


Issue: how to choose 𝜂?

3/13/23
Regularized Least Squares (1)
14

Consider the error function:

Data term + Regularization term


With the sum-of-squares error function and a
quadratic regularizer, we get
" is called the
(weight decay) regularization
coefficient.

-𝝓T t + (𝝓T 𝝓)w+𝝀w = -𝝓T t + (𝝀I+(𝝓T 𝝓))w = 0


which is minimized by
3/13/23
15 Regularized Least Squares (2)

With a more general regularizer, we have

Lasso Quadratic
(sparsity) 3/13/23
16 Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a quadratic
regularizer.
Minimize s.t. ≤ 𝜂

3/13/23
17 Multiple Outputs (1)

Analogously to the single output case we have:

Given observed inputs, , and targets,


, we obtain the log likelihood function

3/13/23
18 Multiple Outputs (2)

Maximizing with respect to W, we obtain

If we consider a single target variable, tk, we see that

where , which is identical with the


single output case.
3/13/23
19 The Bias-Variance Decomposition (1)
Recall the expected squared loss (ch1-54),

where

The second term of E[L] corresponds to the noise inherent in the


random variable t.
What about the first term? 3/13/23
20 The Bias-Variance Decomposition (2)

Suppose we were given multiple data sets, each of


size N. Any particular data set, D, will give a
particular function y(x;D). We then have

3/13/23
21 The Bias-Variance Decomposition (3)

Taking the expectation over D yields

3/13/23
22 The Bias-Variance Decomposition (4)

Thus we can write

where

3/13/23
23 The Bias-Variance Decomposition (5)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization(")for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
24 The Bias-Variance Decomposition (6)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization(")for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
25 The Bias-Variance Decomposition (7)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization (")for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
26 The Bias-Variance Trade-off
From these plots, we
note that an over-
regularized model
(large ") will have a
high bias, while an
under-regularized
model (small ") will
have a high
variance. High Low

Model Complexity
3/13/23
27 Bias/Variance Dilemma (1)

n Bias: the difference between the expected prediction


value and the true value
n Variance: variations of estimated values
n Given training set 𝐷, as we increase model complexity,
n bias decreases (a better fit to data) and variance increases (fit
varies more with data)
n High bias means usually low variance, and vice versa
n Bias/Variance dilemma (Geman et al., 1992) 3/13/23
28 Bias/Variance Dilemma (2)

3/13/23
29 Bias/Variance Dilemma (3)

3/13/23
30 Bias/Variance Dilemma (4)

https://www.cs.cornell.edu/co
urses/cs578/2005fa/CS578.bag
ging.boosting.lecture.pdf

3/13/23
31 Bias/Variance Dilemma (5)
Error = noise2 + bias2 + variance
More data helps

Few training examples


Test Error

Many training examples

High Bias Low Bias


Low Variance
Complexity High Variance
3/13/23
32 Bias/Variance Dilemma (6)
• Need validation set
• Validation set is separate from the test set

Test error
Error

Training error

High Bias Low Bias


Low Variance
Complexity High Variance
3/13/23
33 Bias/Variance Dilemma (7)
Fixed classifier

Error

Testing

Generalization Error

Training
Number of Training Examples 3/13/23
34 Bayesian Linear Regression (1)
Given

Define a conjugate prior over w Assuming 𝛽 is known

Combining this with the likelihood function and using results


for marginal and conditional Gaussian distributions, gives the
posterior

where

3/13/23
35 Bayesian Linear Regression (2)

A common choice for the prior is

for which

Next we consider an example …

3/13/23
Bayesian Linear Regression (3)
36

0 data points observed


𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥

Prior Data Space

3/13/23
Bayesian Linear Regression (4)
37

1 data point observed


𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥

Likelihood Posterior Data Space

3/13/23
Bayesian Linear Regression (5)
38

2 data point observed


𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Likelihood of
2nd point Posterior Data Space

3/13/23
Bayesian Linear Regression (6)
39

20 data point observed


Likelihood 𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Of the 20th
point Posterior Data Space

3/13/23
40 Predictive Distribution (1)

Predict t for new values of x by integrating over w:

where

3/13/23
41 Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis functions,
1 data point

Sample from posterior distribution 3/13/23


42 Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis functions,
2 data points

Sample from posterior distribution 3/13/23


43 Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis functions,
4 data points

Sample from posterior distribution 3/13/23


44 Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis functions,
25 data points

Sample from posterior distribution 3/13/23


45 Equivalent Kernel (1)

The predictive mean can be written

Equivalent kernel or
smoother matrix.

This is a weighted sum of the training data target


values, tn.

3/13/23
46 Equivalent Kernel (2) – Gaussian Basis

Weight of tn depends on distance between x and


xn; nearby xn carry more weight (localization
property). 3/13/23
47 Equivalent Kernel (3)

Non-local basis functions have local equivalent kernels:

Polynomial Sigmoidal

3/13/23
48 Equivalent Kernel (4) – properties

The kernel as a covariance function: consider

We can avoid the use of basis functions and


define the kernel function directly, leading to
Gaussian Processes (Chapter 6).

3/13/23
Equivalent Kernel (5) – properties
49

for all values of x; however, the equivalent kernel


may be negative for some values of x.
Like all kernel functions, the equivalent kernel can be
expressed as an inner product:

where .
3/13/23
50 Bayesian Model Comparison (1)

How do we choose the ‘right’ model?


Assume we want to compare models Mi, i=1, …,L,
using data D; this requires computing

Posterior Prior Model evidence or


marginal likelihood

Bayes Factor: ratio of evidence for two models


(important when priors are the same for all models)

3/13/23
51 Bayesian Model Comparison (2)

Having computed p(Mi|D), we can compute the


predictive (mixture) distribution

This combine the strength of models assuming the


are curtained at difference regions of x.
A simpler approximation, known as model selection,
is to use the model with the highest evidence.
3/13/23
52 Bayesian Model Comparison (3)

For a model with parameters w, we get the model


evidence by marginalizing over w

Note that

3/13/23
53 Bayesian Model Comparison (4)
For a given model with a
single parameter, w, con-
sider the approximation

where the posterior is


assumed to be sharply
peaked and prior is wider.
We omit ℳ for simplicity. 3/13/23
54 Bayesian Model Comparison (5)
Taking logarithms, we obtain

Negative

With M parameters, all assumed to have the same


ratio , we get

Negative and linear in M.


3/13/23
Bayesian Model Comparison (6)
55

Negative and linear in M.

Matching data and model complexity

ℳ 3 most
complex
3/13/23
56 The Evidence Approximation (1)
The fully Bayesian predictive distribution is given by

but this integral is intractable. Approximate with

where is the mode of , which is


assumed to be sharply peaked; a.k.a. empirical
Bayes, type II or generalized maximum likelihood, or
3/13/23

evidence approximation.
57 The Evidence Approximation (2)
From Bayes’ theorem we have

and if we assume p(#,$) to be flat we see that

General results for Gaussian integrals give

! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
58 The Evidence Approximation (3)

Example: sinusoidal data, M th degree polynomial,

3/13/23
59

3/13/23
60 Maximizing the Evidence Function (1)

To maximise w.r.t. # and $, we define


the eigenvector equation

Thus

has eigenvalues "i + #.

3/13/23
Maximizing the Evidence Function (2)
61

We can now differentiate w.r.t. # and $,


and set the results to zero, to get

where

depends on both # and $.


! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
62 Effective Number of Parameters (3)

w1 is not well
Likelihood
determined by the
likelihood

w2 is well determined
by the likelihood
Prior
Effective Number of
Parameters is the
number of well
determined
parameters
3/13/23
63 Effective Number of Parameters (2)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.

3/13/23
64 Effective Number of Parameters (3)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.

Test set error

3/13/23
65 Effective Number of Parameters (4)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.
𝛼=0 -> 𝛾=M=10
Most 𝜆>𝛼 -> 𝜔 large

𝛼=∞ -> 𝛾=0


Most 𝜆 < 𝛼 -> 𝜔 small

3/13/23
66 Effective Number of Parameters (5)
In the limit 𝑁 ≫ 𝑀, 𝛾 = 𝑀 and we can consider
using the easy-to-compute approximation

3/13/23
67 Limitations of Fixed Basis Functions

• M basis function along each dimension of a D-


dimensional input space requires MD basis functions:
the curse of dimensionality.
• In later chapters, we shall see how we can get
away with fewer basis functions, by choosing these
using the training data.
• High-D data typically lies on a low-D manifold
• Perhaps to learn few basis function rather than
instantiate exhaustively.

3/13/23

You might also like