0% found this document useful (0 votes)

14 views66 pages

ML 3

Chapter 3 of the document discusses linear models for regression, focusing on linear basis function models, maximum likelihood estimation, and regularization techniques. It covers the bias-variance trade-off, Bayesian linear regression, and model comparison, emphasizing the importance of choosing the right model and understanding the effective number of parameters. The chapter also introduces predictive distributions and equivalent kernels, providing insights into Gaussian processes and Bayesian model selection.

Uploaded by

wj9hn5fc5c

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views66 pages

ML 3

Uploaded by

wj9hn5fc5c

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Machine Learning

Chapter 3: Linear Models for Regression

孫民
清華大學電機系

3/13/23
2 Linear Basis Function Models (1)- Recall
Example: Polynomial Curve Fitting

3/13/23
3 Linear Basis Function Models (2) - Basis Function

Generally

where !j(x) are known as basis functions (can be

nonlinear wrt. x).
Typically, !0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
!d(x) = xd.
y(x,w) = w0 +w1x1+w2x2+…+wDxD (Linear Regression Model)
3/13/23
Both linear with respect to x and w.
4 Linear Basis Function Models (3)

Polynomial basis functions:

These are global; a small

change in x affect all basis
functions.

3/13/23
5 Linear Basis Function Models (4)

Gaussian basis functions:

Note that basis func. do not

need to be normalized. µj and
s control location and scale
(width). These are local; a small
change in x only affect basis
functions with nearby µj . 3/13/23
6 Linear Basis Function Models (5)
Sigmoidal basis functions:

where

µj and s control location and

scale (slope).

3/13/23
7 Maximum Likelihood and Least Squares (1)

Assume observations from a deterministic function

with added Gaussian noise:
where
which is the same as saying,

Given observed inputs, , and targets,

, we obtain the likelihood function

3/13/23
8 Maximum Likelihood and Least Squares (2)

Taking the logarithm, we get

where

is the sum-of-squares error.

3/13/23
9 Maximum Likelihood and Least Squares (3)

Computing the gradient and setting it to zero yields

$ $
! 𝑡𝑛𝜙 𝑥𝑛 𝑇 − 𝒘𝑇(! 𝜙 𝑥𝑛 𝜙 𝑥𝑛 𝑇)= 0
!"# !"#

𝝓T t – (𝝓T 𝝓)w = 0
Where

3/13/23
10 Maximum Likelihood and Least Squares (3)

𝝓T t – (𝝓T 𝝓)w = 0
(𝝓T 𝝓)w = 𝝓T t
The Moore-Penrose
Solving for w, we get pseudo-inverse, .

where

3/13/23
12 Geometry of Least Squares
Consider

N-dimensional
M-dimensional

S is spanned by .
wML minimizes the distance
between t and its orthogonal
projection on S, i.e. y.
3/13/23
13 Sequential Learning

Data items considered one at a time (a.k.a. online learning);

use stochastic (sequential) gradient descent (SGD):

This is known as the least-mean-squares (LMS) algorithm.

Issue: how to choose 𝜂?

3/13/23
Regularized Least Squares (1)
14

Consider the error function:

Data term + Regularization term

With the sum-of-squares error function and a
quadratic regularizer, we get
" is called the
(weight decay) regularization
coefficient.

-𝝓T t + (𝝓T 𝝓)w+𝝀w = -𝝓T t + (𝝀I+(𝝓T 𝝓))w = 0

which is minimized by
3/13/23
15 Regularized Least Squares (2)

With a more general regularizer, we have

Lasso Quadratic
(sparsity) 3/13/23
16 Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a quadratic
regularizer.
Minimize s.t. ≤ 𝜂

3/13/23
17 Multiple Outputs (1)

Analogously to the single output case we have:

Given observed inputs, , and targets,

, we obtain the log likelihood function

3/13/23
18 Multiple Outputs (2)

Maximizing with respect to W, we obtain

If we consider a single target variable, tk, we see that

where , which is identical with the

single output case.
3/13/23
19 The Bias-Variance Decomposition (1)
Recall the expected squared loss (ch1-54),

where

The second term of E[L] corresponds to the noise inherent in the

random variable t.
What about the first term? 3/13/23
20 The Bias-Variance Decomposition (2)

Suppose we were given multiple data sets, each of

size N. Any particular data set, D, will give a
particular function y(x;D). We then have

3/13/23
21 The Bias-Variance Decomposition (3)

Taking the expectation over D yields

3/13/23
22 The Bias-Variance Decomposition (4)

Thus we can write

where

3/13/23
23 The Bias-Variance Decomposition (5)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization（"）for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
24 The Bias-Variance Decomposition (6)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization（"）for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
25 The Bias-Variance Decomposition (7)
Example: 100 data sets, each with 25 samples,
from the sinusoidal, varying the degree of
regularization （"）for fitting 24 Gaussian basis
functions.

3/13/23

Variance Bias
26 The Bias-Variance Trade-off
From these plots, we
note that an over-
regularized model
(large ") will have a
high bias, while an
under-regularized
model (small ") will
have a high
variance. High Low

Model Complexity
3/13/23
27 Bias/Variance Dilemma (1)

n Bias: the difference between the expected prediction

value and the true value
n Variance: variations of estimated values
n Given training set 𝐷, as we increase model complexity,
n bias decreases (a better fit to data) and variance increases (fit
varies more with data)
n High bias means usually low variance, and vice versa
n Bias/Variance dilemma (Geman et al., 1992) 3/13/23
28 Bias/Variance Dilemma (2)

3/13/23
29 Bias/Variance Dilemma (3)

3/13/23
30 Bias/Variance Dilemma (4)

https://www.cs.cornell.edu/co
urses/cs578/2005fa/CS578.bag
ging.boosting.lecture.pdf

3/13/23
31 Bias/Variance Dilemma (5)
Error = noise2 + bias2 + variance
More data helps

Few training examples

Test Error

Many training examples

High Bias Low Bias

Low Variance
Complexity High Variance
3/13/23
32 Bias/Variance Dilemma (6)
• Need validation set
• Validation set is separate from the test set

Test error
Error

Training error

High Bias Low Bias

Low Variance
Complexity High Variance
3/13/23
33 Bias/Variance Dilemma (7)
Fixed classifier

Error

Testing

Generalization Error

Training
Number of Training Examples 3/13/23
34 Bayesian Linear Regression (1)
Given

Define a conjugate prior over w Assuming 𝛽 is known

Combining this with the likelihood function and using results

for marginal and conditional Gaussian distributions, gives the
posterior

where

3/13/23
35 Bayesian Linear Regression (2)

A common choice for the prior is

for which

Next we consider an example …

3/13/23
Bayesian Linear Regression (3)
36

0 data points observed

𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥

Prior Data Space

3/13/23
Bayesian Linear Regression (4)
37

1 data point observed

𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥

Likelihood Posterior Data Space

3/13/23
Bayesian Linear Regression (5)
38

2 data point observed

𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Likelihood of
2nd point Posterior Data Space

3/13/23
Bayesian Linear Regression (6)
39

20 data point observed

Likelihood 𝑦 𝑥, 𝐰 = 𝑤! + 𝑤" 𝑥 𝑤 0 = 𝑦 − 𝑤 1𝑥
Of the 20th
point Posterior Data Space

3/13/23
40 Predictive Distribution (1)

Predict t for new values of x by integrating over w:

where

3/13/23
41 Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis functions,
1 data point

Sample from posterior distribution 3/13/23

42 Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis functions,
2 data points

Sample from posterior distribution 3/13/23

43 Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis functions,
4 data points

Sample from posterior distribution 3/13/23

44 Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis functions,
25 data points

Sample from posterior distribution 3/13/23

45 Equivalent Kernel (1)

The predictive mean can be written

Equivalent kernel or
smoother matrix.

This is a weighted sum of the training data target

values, tn.

3/13/23
46 Equivalent Kernel (2) – Gaussian Basis

Weight of tn depends on distance between x and

xn; nearby xn carry more weight (localization
property). 3/13/23
47 Equivalent Kernel (3)

Non-local basis functions have local equivalent kernels:

Polynomial Sigmoidal

3/13/23
48 Equivalent Kernel (4) – properties

The kernel as a covariance function: consider

We can avoid the use of basis functions and

define the kernel function directly, leading to
Gaussian Processes (Chapter 6).

3/13/23
Equivalent Kernel (5) – properties
49

for all values of x; however, the equivalent kernel

may be negative for some values of x.
Like all kernel functions, the equivalent kernel can be
expressed as an inner product:

where .
3/13/23
50 Bayesian Model Comparison (1)

How do we choose the ‘right’ model?

Assume we want to compare models Mi, i=1, …,L,
using data D; this requires computing

Posterior Prior Model evidence or

marginal likelihood

Bayes Factor: ratio of evidence for two models

(important when priors are the same for all models)

3/13/23
51 Bayesian Model Comparison (2)

Having computed p(Mi|D), we can compute the

predictive (mixture) distribution

This combine the strength of models assuming the

are curtained at difference regions of x.
A simpler approximation, known as model selection,
is to use the model with the highest evidence.
3/13/23
52 Bayesian Model Comparison (3)

For a model with parameters w, we get the model

evidence by marginalizing over w

Note that

3/13/23
53 Bayesian Model Comparison (4)
For a given model with a
single parameter, w, con-
sider the approximation

where the posterior is

assumed to be sharply
peaked and prior is wider.
We omit ℳ for simplicity. 3/13/23
54 Bayesian Model Comparison (5)
Taking logarithms, we obtain

Negative

With M parameters, all assumed to have the same

ratio , we get

Negative and linear in M.

3/13/23
Bayesian Model Comparison (6)
55

Negative and linear in M.

Matching data and model complexity

ℳ 3 most
complex
3/13/23
56 The Evidence Approximation (1)
The fully Bayesian predictive distribution is given by

but this integral is intractable. Approximate with

where is the mode of , which is

assumed to be sharply peaked; a.k.a. empirical
Bayes, type II or generalized maximum likelihood, or
3/13/23

evidence approximation.
57 The Evidence Approximation (2)
From Bayes’ theorem we have

and if we assume p(#,$) to be flat we see that

General results for Gaussian integrals give

! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
58 The Evidence Approximation (3)

Example: sinusoidal data, M th degree polynomial,

3/13/23
59

3/13/23
60 Maximizing the Evidence Function (1)

To maximise w.r.t. # and $, we define

the eigenvector equation

Thus

has eigenvalues "i + #.

3/13/23
Maximizing the Evidence Function (2)
61

We can now differentiate w.r.t. # and $,

and set the results to zero, to get

where

depends on both # and $.

! #
𝐸 𝑚𝑁 = 𝒕 − 𝝓𝑚𝑁 2 + 𝑚 NT𝑚 N 3/13/23
" "
62 Effective Number of Parameters (3)

w1 is not well
Likelihood
determined by the
likelihood

w2 is well determined
by the likelihood
Prior
Effective Number of
Parameters is the
number of well
determined
parameters
3/13/23
63 Effective Number of Parameters (2)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.

3/13/23
64 Effective Number of Parameters (3)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.

Test set error

3/13/23
65 Effective Number of Parameters (4)
Example: sinusoidal data, 9 Gaussian basis functions,
$ = 11.1.
𝛼=0 -> 𝛾=M=10
Most 𝜆>𝛼 -> 𝜔 large

𝛼=∞ -> 𝛾=0

Most 𝜆 < 𝛼 -> 𝜔 small

3/13/23
66 Effective Number of Parameters (5)
In the limit 𝑁 ≫ 𝑀, 𝛾 = 𝑀 and we can consider
using the easy-to-compute approximation

3/13/23
67 Limitations of Fixed Basis Functions

• M basis function along each dimension of a D-

dimensional input space requires MD basis functions:
the curse of dimensionality.
• In later chapters, we shall see how we can get
away with fewer basis functions, by choosing these
using the training data.
• High-D data typically lies on a low-D manifold
• Perhaps to learn few basis function rather than
instantiate exhaustively.

3/13/23

PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Linear Modal For Regresion
No ratings yet
Linear Modal For Regresion
32 pages
ML Lecture Linear Regression 1
No ratings yet
ML Lecture Linear Regression 1
33 pages
Linear Regression Models Guide
100% (1)
Linear Regression Models Guide
61 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
ML Lecture Linear Regression 2
No ratings yet
ML Lecture Linear Regression 2
23 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ML Lecture Linear Regression 3
No ratings yet
ML Lecture Linear Regression 3
22 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Main 2
No ratings yet
Main 2
37 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Lecture 2 2022
No ratings yet
Lecture 2 2022
34 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
ML Classifiers & Regression Guide
No ratings yet
ML Classifiers & Regression Guide
46 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
ML 01
No ratings yet
ML 01
24 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Lecture 03
No ratings yet
Lecture 03
47 pages
Bayesian Optimization for ML Experts
No ratings yet
Bayesian Optimization for ML Experts
84 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
Statistical Methods-1
No ratings yet
Statistical Methods-1
63 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Machine Learning UNIT II
No ratings yet
Machine Learning UNIT II
34 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
46 pages
Neural Networks for Engineers
No ratings yet
Neural Networks for Engineers
44 pages
Z-Score Examples With Solutions
No ratings yet
Z-Score Examples With Solutions
6 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Answ Exam Ibp 07
No ratings yet
Answ Exam Ibp 07
16 pages
Gmail - Chapter V - Data Analysis and Interpretation in SIP Project Report
No ratings yet
Gmail - Chapter V - Data Analysis and Interpretation in SIP Project Report
4 pages
Be NG 2 Probst at Final 2009
No ratings yet
Be NG 2 Probst at Final 2009
5 pages
Kwan 2006
No ratings yet
Kwan 2006
21 pages
Control Charts & Normal Distribution
No ratings yet
Control Charts & Normal Distribution
25 pages
A General Guide To Stat Analysis
No ratings yet
A General Guide To Stat Analysis
1 page
Statistics Quiz for Students
No ratings yet
Statistics Quiz for Students
4 pages
Practice Questions
No ratings yet
Practice Questions
2 pages
Wine Quality Prediction GHAR
No ratings yet
Wine Quality Prediction GHAR
19 pages
Correlation - DPP 01 - (Aarambh 2024)
No ratings yet
Correlation - DPP 01 - (Aarambh 2024)
3 pages
Big Data Evaluation Guide SYBBA
No ratings yet
Big Data Evaluation Guide SYBBA
4 pages
Inference on Single Population
No ratings yet
Inference on Single Population
12 pages
Agricultural Statistics
No ratings yet
Agricultural Statistics
3 pages
Fake Currency Prediction. Project Report
No ratings yet
Fake Currency Prediction. Project Report
19 pages
Measures of Variability
No ratings yet
Measures of Variability
18 pages
Sample Data For Item Analysis - LAC
No ratings yet
Sample Data For Item Analysis - LAC
17 pages
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
100% (1)
(Monographs On Statistics and Applied Probability (Series) 161) Li, Bing - Sufficient Dimension Reduction - Methods and Applications With R-CRC Press (2018)
307 pages
This Study Resource Was
No ratings yet
This Study Resource Was
6 pages
Machine Learning Applications Lab
No ratings yet
Machine Learning Applications Lab
2 pages
Data Science Notes Structured FINAL v2
No ratings yet
Data Science Notes Structured FINAL v2
9 pages
The New Statistics With R: An Introduction For Biologists 2nd Edition Andy Hector Instant Download
No ratings yet
The New Statistics With R: An Introduction For Biologists 2nd Edition Andy Hector Instant Download
52 pages
Nim. 8156114014 Abstract
No ratings yet
Nim. 8156114014 Abstract
2 pages
A New Approach To Comparing VaR
No ratings yet
A New Approach To Comparing VaR
23 pages
Ch2 4 Problems
No ratings yet
Ch2 4 Problems
5 pages
CHAPTER-12-Data Analysis Using SPSS
No ratings yet
CHAPTER-12-Data Analysis Using SPSS
619 pages
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
No ratings yet
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
4 pages
Cramer's V Explained
No ratings yet
Cramer's V Explained
7 pages
Kumar Soft Drink Bottling Company (Refer To The Dataset: "Kumar Soft Drink") One Way Anova Test
No ratings yet
Kumar Soft Drink Bottling Company (Refer To The Dataset: "Kumar Soft Drink") One Way Anova Test
9 pages