Machine Learning
Chapter 3: Linear Models for Regression
孫民
清華大學電機系
3/13/23
2 Linear Basis Function Models (1)- Recall
Example: Polynomial Curve Fitting
3/13/23
3 Linear Basis Function Models (2) - Basis Function
Generally
where !j(x) are known as basis functions (can be
nonlinear wrt. x).
Typically, !0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
!d(x) = xd.
y(x,w) = w0 +w1x1+w2x2+…+wDxD (Linear Regression Model)
3/13/23
Both linear with respect to x and w.
4 Linear Basis Function Models (3)
Polynomial basis functions:
These are global; a small
change in x affect all basis
functions.
3/13/23
5 Linear Basis Function Models (4)
Gaussian basis functions:
Note that basis func. do not
need to be normalized. µj and
s control location and scale
(width). These are local; a small
change in x only affect basis
functions with nearby µj . 3/13/23
6 Linear Basis Function Models (5)
Sigmoidal basis functions:
where
µj and s control location and
scale (slope).
3/13/23
7 Maximum Likelihood and Least Squares (1)
Assume observations from a deterministic function
with added Gaussian noise:
where
which is the same as saying,
Given observed inputs, , and targets,
, we obtain the likelihood function
3/13/23
8 Maximum Likelihood and Least Squares (2)
Taking the logarithm, we get
where
is the sum-of-squares error.
3/13/23
9 Maximum Likelihood and Least Squares (3)
Computing the gradient and setting it to zero yields
$ $
! 𝑡𝑛𝜙 𝑥𝑛 𝑇 − 𝒘𝑇(! 𝜙 𝑥𝑛 𝜙 𝑥𝑛 𝑇)= 0
!"# !"#
𝝓T t – (𝝓T 𝝓)w = 0
Where
3/13/23
10 Maximum Likelihood and Least Squares (3)
𝝓T t – (𝝓T 𝝓)w = 0
(𝝓T 𝝓)w = 𝝓T t
The Moore-Penrose
Solving for w, we get pseudo-inverse, .
where
3/13/23
12 Geometry of Least Squares
Consider
N-dimensional
M-dimensional
S is spanned by .
wML minimizes the distance
between t and its orthogonal
projection on S, i.e. y.
3/13/23
13 Sequential Learning
Data items considered one at a time (a.k.a. online learning);
use stochastic (sequential) gradient descent (SGD):
This is known as the least-mean-squares (LMS) algorithm.
Issue: how to choose 𝜂?
3/13/23
Regularized Least Squares (1)
14
Consider the error function:
Data term + Regularization term
With the sum-of-squares error function and a
quadratic regularizer, we get
" is called the
(weight decay) regularization
coefficient.
-𝝓T t + (𝝓T 𝝓)w+𝝀w = -𝝓T t + (𝝀I+(𝝓T 𝝓))w = 0
which is minimized by
3/13/23
15 Regularized Least Squares (2)
With a more general regularizer, we have
Lasso Quadratic
(sparsity) 3/13/23
16 Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a quadratic
regularizer.
Minimize s.t. ≤ 𝜂
3/13/23
17 Multiple Outputs (1)
Analogously to the single output case we have:
Given observed inputs, , and targets,
, we obtain the log likelihood function
3/13/23
18 Multiple Outputs (2)
Maximizing with respect to W, we obtain
If we consider a single target variable, tk, we see that
where , which is identical with the
single output case.
3/13/23
19 The Bias-Variance Decomposition (1)
Recall the expected squared loss (ch1-54),
where
The second term of E[L] corresponds to the noise inherent in the
random variable t.
What about the first term? 3/13/23
20 The Bias-Variance Decomposition (2)
Suppose we were given multiple data sets, each of
size N. Any particular data set, D, will give a
particular function y(x;D). We then have
3/13/23
21 The Bias-Variance Decomposition (3)
Taking the expectation over D yields
3/13/23
22 The Bias-Variance Decomposition (4)
Thus we can write
where
3/13/23
23 The Bias-Variance Decomposition (5)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization(")for fitting 24 Gaussian basis
functions.
3/13/23
Variance Bias
24 The Bias-Variance Decomposition (6)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization(")for fitting 24 Gaussian basis
functions.
3/13/23
Variance Bias
25 The Bias-Variance Decomposition (7)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization (")for fitting 24 Gaussian basis
functions.
3/13/23
Variance Bias
26 The Bias-Variance Trade-off
From these plots, we
note that an over-
regularized model
(large ") will have a
high bias, while an
under-regularized
model (small ") will
have a high
variance. High Low
Model Complexity
3/13/23
27 Bias/Variance Dilemma (1)
n Bias: the difference between the expected prediction
value and the true value
n Variance: variations of estimated values
n Given training set 𝐷, as we increase model complexity,
n bias decreases (a better fit to data) and variance increases (fit
varies more with data)
n High bias means usually low variance, and vice versa
n Bias/Variance dilemma (Geman et al., 1992) 3/13/23
28 Bias/Variance Dilemma (2)
3/13/23
29 Bias/Variance Dilemma (3)
3/13/23
30 Bias/Variance Dilemma (4)
https://www.cs.cornell.edu/co
urses/cs578/2005fa/CS578.bag
ging.boosting.lecture.pdf
3/13/23
31 Bias/Variance Dilemma (5)
Error = noise2 + bias2 + variance
More data helps
Few training examples
Test Error
Many training examples
High Bias Low Bias
Low Variance
Complexity High Variance
3/13/23
32 Bias/Variance Dilemma (6)
• Need validation set
• Validation set is separate from the test set
Test error
Error
Training error
High Bias Low Bias
Low Variance
Complexity High Variance
3/13/23
33 Bias/Variance Dilemma (7)
Fixed classifier
Error
Testing
Generalization Error
Training
Number of Training Examples 3/13/23
34 Bayesian Linear Regression (1)
Given
Define a conjugate prior over w Assuming 𝛽 is known
Combining this with the likelihood function and using results
for marginal and conditional Gaussian distributions, gives the
posterior
where
3/13/23
35 Bayesian Linear Regression (2)
A common choice for the prior is
for which
Next we consider an example …
3/13/23
Bayesian Linear Regression (3)
36
0 data points observed
𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥
Prior Data Space
3/13/23
Bayesian Linear Regression (4)
37
1 data point observed
𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Likelihood Posterior Data Space
3/13/23
Bayesian Linear Regression (5)
38
2 data point observed
𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Likelihood of
2nd point Posterior Data Space
3/13/23
Bayesian Linear Regression (6)
39
20 data point observed
Likelihood 𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Of the 20th
point Posterior Data Space
3/13/23
40 Predictive Distribution (1)
Predict t for new values of x by integrating over w:
where
3/13/23
41 Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis functions,
1 data point
Sample from posterior distribution 3/13/23
42 Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis functions,
2 data points
Sample from posterior distribution 3/13/23
43 Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis functions,
4 data points
Sample from posterior distribution 3/13/23
44 Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis functions,
25 data points
Sample from posterior distribution 3/13/23
45 Equivalent Kernel (1)
The predictive mean can be written
Equivalent kernel or
smoother matrix.
This is a weighted sum of the training data target
values, tn.
3/13/23
46 Equivalent Kernel (2) – Gaussian Basis
Weight of tn depends on distance between x and
xn; nearby xn carry more weight (localization
property). 3/13/23
47 Equivalent Kernel (3)
Non-local basis functions have local equivalent kernels:
Polynomial Sigmoidal
3/13/23
48 Equivalent Kernel (4) – properties
The kernel as a covariance function: consider
We can avoid the use of basis functions and
define the kernel function directly, leading to
Gaussian Processes (Chapter 6).
3/13/23
Equivalent Kernel (5) – properties
49
for all values of x; however, the equivalent kernel
may be negative for some values of x.
Like all kernel functions, the equivalent kernel can be
expressed as an inner product:
where .
3/13/23
50 Bayesian Model Comparison (1)
How do we choose the ‘right’ model?
Assume we want to compare models Mi, i=1, …,L,
using data D; this requires computing
Posterior Prior Model evidence or
marginal likelihood
Bayes Factor: ratio of evidence for two models
(important when priors are the same for all models)
3/13/23
51 Bayesian Model Comparison (2)
Having computed p(Mi|D), we can compute the
predictive (mixture) distribution
This combine the strength of models assuming the
are curtained at difference regions of x.
A simpler approximation, known as model selection,
is to use the model with the highest evidence.
3/13/23
52 Bayesian Model Comparison (3)
For a model with parameters w, we get the model
evidence by marginalizing over w
Note that
3/13/23
53 Bayesian Model Comparison (4)
For a given model with a
single parameter, w, con-
sider the approximation
where the posterior is
assumed to be sharply
peaked and prior is wider.
We omit ℳ for simplicity. 3/13/23
54 Bayesian Model Comparison (5)
Taking logarithms, we obtain
Negative
With M parameters, all assumed to have the same
ratio , we get
Negative and linear in M.
3/13/23
Bayesian Model Comparison (6)
55
Negative and linear in M.
Matching data and model complexity
ℳ 3 most
complex
3/13/23
56 The Evidence Approximation (1)
The fully Bayesian predictive distribution is given by
but this integral is intractable. Approximate with
where is the mode of , which is
assumed to be sharply peaked; a.k.a. empirical
Bayes, type II or generalized maximum likelihood, or
3/13/23
evidence approximation.
57 The Evidence Approximation (2)
From Bayes’ theorem we have
and if we assume p(#,$) to be flat we see that
General results for Gaussian integrals give
! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
58 The Evidence Approximation (3)
Example: sinusoidal data, M th degree polynomial,
3/13/23
59
3/13/23
60 Maximizing the Evidence Function (1)
To maximise w.r.t. # and $, we define
the eigenvector equation
Thus
has eigenvalues "i + #.
3/13/23
Maximizing the Evidence Function (2)
61
We can now differentiate w.r.t. # and $,
and set the results to zero, to get
where
depends on both # and $.
! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
62 Effective Number of Parameters (3)
w1 is not well
Likelihood
determined by the
likelihood
w2 is well determined
by the likelihood
Prior
Effective Number of
Parameters is the
number of well
determined
parameters
3/13/23
63 Effective Number of Parameters (2)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.
3/13/23
64 Effective Number of Parameters (3)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.
Test set error
3/13/23
65 Effective Number of Parameters (4)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.
𝛼=0 -> 𝛾=M=10
Most 𝜆>𝛼 -> 𝜔 large
𝛼=∞ -> 𝛾=0
Most 𝜆 < 𝛼 -> 𝜔 small
3/13/23
66 Effective Number of Parameters (5)
In the limit 𝑁 ≫ 𝑀, 𝛾 = 𝑀 and we can consider
using the easy-to-compute approximation
3/13/23
67 Limitations of Fixed Basis Functions
• M basis function along each dimension of a D-
dimensional input space requires MD basis functions:
the curse of dimensionality.
• In later chapters, we shall see how we can get
away with fewer basis functions, by choosing these
using the training data.
• High-D data typically lies on a low-D manifold
• Perhaps to learn few basis function rather than
instantiate exhaustively.
3/13/23