0% found this document useful (0 votes)

38 views36 pages

Lecture6 2015

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views36 pages

Lecture6 2015

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

STA 414/2104:

Machine Learning

Russ Salakhutdinov
Department of Computer Science!
Department of Statistics!
rsalakhu@cs.toronto.edu!
http://www.cs.toronto.edu/~rsalakhu/

Lecture 6
Gaussian Processes
• So far, we have considered linear regression models of the form:

where w is a vector of parameters and Á(x) is a vector of fixed nonlinear basis

functions.

• A prior distribution over w induces a prior distribution over functions f(x,w).

• Given a training dataset, we compute the posterior distribution over w, which
induces a posterior distribution over functions f(x,w).

Samples from the posterior

Nonlinear regression
r the problem of nonlinear regression:
Gaussian Processes
nsider the problem of nonlinear regression:
nt to learn
• You a function
want withf error
to learn affunction bars
with error barsfrom data
from data D. D = {X, y}
u want to learn a function f with error bars from data D = {X, y}

y
y

xx
Gaussian process defines a distribution over functions p(f ) which can be used for
sian process
• A defines
Gaussian a distribution
process defines a over over
distribution functions
functions p(f
p(f) ) which
which can can be used
be
yesian regression:
n regression:
used for Bayesian regression:
p(f )p(D|f )
p(f |D) = p(f )p(D|f )
p(D)
p(f |D) =
p(D)
Gaussian Processes
• In the Gaussian process viewpoint, we define a prior probability distributions
over functions directly.

• May seem difficult: How can we define a distribution over the uncountably
infinite space of functions?

• Insight: for a finite training set, we only need to consider the values of the
functions at discrete set of input values xn.

• Hence in practice, we work in a finite space.

• Many related models: In geostatistics literature, GP regression is known as

kriging. See also a recent book on GPs by Rasmussen & Williams (2006).
Linear Regression Revisited
• Consider the following linear model, defined in terms of M linear combinations
of fixed basis functions:

• We place a Gaussian prior over model parameters:

• For any given fixed value of w, we have a corresponding linear function. A

probability distribution over w defines a probability distribution over functions.

• Given a dataset we will denote the values of the

function as

• Hence:

N by M Design M by 1 vector
matrix of model
parameters
Linear Regression Revisited

• Observe that y is a linear combination of Gaussian random variables, and

hence is itself Gaussian:

Here, K is known as the Gramm matrix with elements:

where k(x,x’) is the kernel function.

• This model provides a particular example of a Gaussian process.

Gaussian Process
• A Gaussian process (GP) is a random function f: X ! R, such that for any
finite set of input points

where the parameters are the mean function m(x) and covariance kernel k(x,x’).

• Note that a random function is a stochastic process. It is a collection of

random variables {f(x)}x 2 X, one for each possible value x (see Rasmussen and
Williams, 2006).

• Key point about Gaussian Processes: Given a dataset

the marginal distribution over is completely specified
by the second-order statistics: the mean and covariance.
Gaussian Process
• In many applications, we will have no prior knowledge about the mean
function f(x). By symmetry, we take it be zero.
• The specification of a Gaussian Process is then completed by specifying the
covariance function, evaluated at any two input points xn and xm:

• One commonly used covariance function is squared exponential:

• Covariance (kernel) function is typically chosen to express the property that,

for inputs xn and xm that are similar, the corresponding values f(xn) and f(xm) will
be more strongly correlated than for dissimilar points.
Laying out Gaussians
Laying out Gaussians
A wayLaying out
of visualizing drawsGaussians
Visualizing Draws
from a 2D from GPs
Gaussian:
A way of visualizing draws from a 2D Gaussian:
2 • Visualizing draws from 2-D Gaussian:
A way of visualizing draws from a 2D Gaussian:
2
1 2 0
1 0
0
2

1 0
f2 f

0 −0.5

f
−1 0 −0.5
f2

f
−0.5

f
−1
−2 −1 −1
−2 −2 −1 0 1 2 −1
−1
−2 f0 x_1 x_2
−2 −1 1 1 2
−2 −1
f 0 1 2 x_1 x_2
f 1 x_1 x_2
1 1.5
1.51 1.5
1
0.5 1
Now it’s easy to show three draws 0.50 0.5
• Three draws from a 6-D Gaussian:
Now it’s
Noweasy to
it’s easy show three
to show draws
three draws 0 0
f

from a 6D Gaussian: f
f

−0.5
from afrom
6D aGaussian:
6D Gaussian: −0.5
−1
−0.5

−1 −1
−1.5
−1.5x_1 x_2 x_3 x_4 x_5 x_6
Slide Credit: Iain Murray −1.5 x_1 x_2 x_3 x_4 x_5 x_6
x_1 x_2 x_3 x_4 x_5 x_6
Building large Gaussians
Visualizing Draws from GPs
Three
• Threedraws from25-D
draws from a 25D Gaussian:
Gaussian
2

1
f

−1

x
To• To generate
produce these,
this, the mean
we needed a was set Itoused
mean: zero:zeros(25,1)
zeros(25,1)
• The covariance was set using a covariance function:
The covariances were set using a kernel function: ij = k(xi, xj ).
• The x’s are the positions that are planted the tics on the axis.
The x’s are the positions that I planted the tics on the axis.
We can
Later we’llvisualize
find k’sdraws
that from a GP iterative
ensure is alwayssampling
positivef(xsemi-definite.
n) | f(x1),...,f(xn−1) on a
sequence of input points x1, x2, . . xn.
Slide Credit: Iain Murray
Samples from GPs
Squared-exponential kernel Exponential kernel

• Ornstein-Uhlenbeck process that

describes Brownian motion.
GPs for Regression
• We need to account for noise on the observed target values:

where fn = f(xn), and ²n is an independent random noise variable. We will

assume Gaussian noise:

• Given a dataset and corresponding target values

the conditional takes form:

• From the definitions of a Gaussian process, the marginal distribution p(f) is

given by the Gaussian:
Illustration
• Illustration of sampling of targets {tn} from a Gaussian process.

• The blue curve shows a sample from a

GP prior:!

• The red points show the values of fn,

obtained by evaluating the function at a set
of input values {xn}.!

• The green points show the corresponding

values of {tn}:!
Marginal Distribution
• The marginal distribution p(t), conditioned on the set of inputs X, can be
obtained by integrating over f:

where the covariance matrix is given by:

• The two Gaussian sources of randomness, one associated with f(x) and the
other with noise, are independent, and so their covariances add.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:

• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:

• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Samples
Samples from
from GPs with GPs
di↵erent K(x, x0)
• Samples from GPs with different covariance functions:
3 1.5 2.5 5

2.5 2
4
1
2
1.5

1.5 3
0.5 1

1
0.5 2
f(x)

f(x)

f(x)
0.5 0
0 1
0
−0.5 −0.5
−0.5 0
−1
−1
−1
−1
−1.5 −1.5

−2 −1.5 −2 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

2 3 6
2

1 2 4
1

0 1 2
f(x)

f(x)

f(x)
0

−1 0 0

−1
−2 −1 −2

−2
−3 −2 −4

−4 −3 −3 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

3
2 6
2

2
1 4
1
1
0 2
f(x)

f(x)

f(x)
0 0

−1 0
−1
−1
−2 −2
−2

−2
−3 −4
−3

−4 −3 −4 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
Prediction
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1.

• Note that the joint distribution over t and tN+1 is given by:

where CN is the N by N matrix with elements:

c is the scalar:

and k is the N by 1 vector with elements k(xn,xN+1).

Prediction
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1.

• Note that the joint distribution over t and tN+1 is given by:

• Hence the conditional distribution is Gaussian:

with Key results that define

GP regression

Positive: hence the

reduction in uncertainty
Illustration
• Illustration of GP regression applied to the sinusoidal data set.

• The green curve shows the true

function.

• The blue data points are samples

from the true function plus some
additive Gaussian noise

• The red curve shows the mean of the

GP predictive distribution, with shaded
region corresponding to +/- 2 standard
deviations.

• Restriction on the kernel function: The covariance matrix:

must be positive definite.

Mean of Predictive Distribution
• Note that the mean of the predictive distribution

can be written as a function of xN+1: Linear combination

an is the nth component of

• Also, note that the mean and variance of the predictive distribution both
depend on xN+1.

Remember: k is the N by 1 vector with elements k(xn,xN+1).

Prediction using GPs with di↵erent K(x, x0)
Prediction using GPs
Prediction using GPs with di↵erent K(x, x0)
• A sample from the prior using different covariance functions:
A sample from the prior for each covariance function:
A sample from the prior for each covariance function:
3

2.5
1.5 2.5

2
1
23 1.5 1.5 2.5

1.5
2.5
0.5 1 2
1
12
0.5 1.5
f(x)

f(x)

f(x)
0.5
1.5 0
0.5 0 1
0
1
−0.5 −0.5 0.5
f(x)

f(x)

f(x)
−0.5
0.5 0
−1 0
−1
0 −1

−1.5 −0.5 −1.5−0.5

−0.5

−2 −1.5 −2 −1
−10 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
−1
x x x
−1.5 −1.5

−2 −1.5 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x
Corresponding predictions, mean with two standard deviations:
• Corresponding
Corresponding
1

predictions:
predictions, meanmean plus standard
with two two standard deviations:
deviations:
1

0.8 1.5
2 1

0.5 0.5
1 0.6 1 2 1
1

0.40.8
0.5 1.5
0 0
0.5 0.2 0.5
0.6
0 1
0
0.4
−0.5 −0.5
−0.5 0.5
0 0
−0.2
0.2
−1 0
−0.4
−1 0 −1
−0.5 −0.5
−0.6 −1.5−0.5
−0.2

−1.5 −0.8 −2 −1 −1.5

0 5 10 15 −0.4 0 5 10 15 0 5 10 15 0 5 10 15
−1 −1

−0.6 −1.5

gpdemo −1.5
0 5 10 15
−0.8
0 5 10 15
−2
0 5 10 15
−1.5
0 5 10 15

gpdemo
Computational Complexity
• The central computation in using GPs will involve the inversion of an N by N
matrix CN, which is of order O(N3):

• By contrast, in the basis function model, we have to invert a matrix SN of

size M by M (where M is the number of basis functions).

• If the number of M basis functions is smaller than the number N of data

points, then it will be computationally more efficient to work in the basis
function framework (see the first few slides)

• The advantage of GPs is that we can consider covariance functions that

can only be expressed in terms of an infinite number of basis functions.
Learning the Hyperparameters
• The predictions of a GP regression model will depend on the choice of the
covariance function.

• Instead of fixing the covariance function, we may prefer to use a parametric

family of functions and infer the parameter values from data.

• These parameters may govern the length scale of the correlations or the
precision of the noise model and correspond to the hyperparameters in a
standard parametric model.
hyperparameters

• How can we infer the values of these parameters?

Learning the Hyperparameters
• We can compute the marginal likelihood function:

Hyperparameters of the
GP model

• One option is to maximize the log of the marginal likelihood with respect to µ.

• This corresponds to the type II maximum likelihood, or empirical Bayes:

• The maximization can be performed using gradient-based optimization

techniques, such as conjugate gradients. The gradients take form:
Learning the Hyperparameters

• Because ln p(t|µ) will be a nonconvex function, it will have multiple maxima.

• In the fully Bayesian approach, we can introduce a prior p(µ) and infer the
posterior p(µ | t).

• In general, the posterior will not have a closed form solution, so we must
resort of approximations (typically MCMC).

• Noise: We have assumed that the additive noise, governed by ¯, is constant.

• For some models, known as heteroscedastic, the noise variance itself will
depend on x (e.g. by introducing another GP that will model log ¯ (x)).
Automatic Relevance Determination
• How can we detect inputs variables that have very little effect on the
predictive distribution (irrelevant inputs).
• Consider a GP with 2-D input space x = (x1, x2) with the following covariance
function:

• As ´i becomes small,
the function becomes
insensitive to the
corresponding value of xi
(input xi becomes less
relevant).
Automatic Relevance Determination
• The ARD framework can be easily incorporated into exponential-quadratic
kernel:

Control relevance of input dimension i, where

D is the dimensionality of the input space

• We can optimize these parameters by performing type II maximum likelihood

(by optimizing marginal log-likelihood)

• The relative importance of different inputs can be inferred from data.

Illustration
• Example: We have a dataset with 3-D inputs (x1, x2, x3). The target variables tn
are sampled as follows:
• Sample 100 values of x1 from a Gaussian, evaluate the function sin(2¼x1),
and add Gaussian noise.
• Let x2 = x1, and add Gaussian noise.
• Sample 100 values of x3 from an independent Gaussian distribution.
• Hence x1 is a good predictor of t, x2 is a more noisy predictor of t, and x3 has
only chance correlation with t.

• Plot displays ´1 (red), ´2 (green), and ´3 (blue)

as a function of the number of iterations when
optimizing the marginal likelihood.
Classification with GPs
• Consider a two-class problem with targets t 2 {0,1}.
• Define a Gaussian process over a function f(x).
• Transform the function using sigmoid function:

• Hence y(x) 2 (0,1).

Transformed sample using
sigmoid function
Classification with GPs
• After transformation, we obtain a non-Gaussian stochastic process over
functions y(x).

• The probability distribution over t is given by the Bernoulli distribution:

Transformed sample using

sigmoid function
Classification with GPs
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1

• Predictive distribution is given by:

given by Posterior is also

intractable.
• This integral is analytically intractable. Can resort to MCMC by approximately
sampling from the posterior, and performing Monte Carlo integration:

where
Approximations
• Another option:

Gaussian approximation

• Use approximate formula for the convolution of a logistic sigmoid and a

Gaussian distribution.

• Three different approaches to obtaining a Gaussian approximation:

• Variational Inference
• Expectation Propagation
• Laplace Approximation
Laplace Approximation
• We seek to obtain a Gaussian approximation to the posterior. Using Bayes
rule we have:

Easy to compute: Laplace approximation

Gaussian.

• Here p(fN) is given by a zero-mean GP with covariance matrix CN, and the
data term:

• Obtain the Laplace approximation by Taylor expanding log of the posterior:

log p(fN | tN).
Classification Results

Optimal decision boundary from the true Predictive posterior probability together
distribution (green) and the decision with GP decision boundary.
boundary from GP classifier (black)

Gaussian Processes Tutorial
No ratings yet
Gaussian Processes Tutorial
31 pages
Gaussian Processes Regression Tutorial
No ratings yet
Gaussian Processes Regression Tutorial
30 pages
Machine Learning and Pattern Recognition Gaussian Processes
No ratings yet
Machine Learning and Pattern Recognition Gaussian Processes
6 pages
Gaussian Processes For Regression: A Tutorial
No ratings yet
Gaussian Processes For Regression: A Tutorial
7 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Lec 3 Gaussian Processes
No ratings yet
Lec 3 Gaussian Processes
30 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
Stochastic Differential Equations in Machine Learning
No ratings yet
Stochastic Differential Equations in Machine Learning
26 pages
Gaussian Processes in Bayesian Regression
No ratings yet
Gaussian Processes in Bayesian Regression
146 pages
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
No ratings yet
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
19 pages
Tutorial
No ratings yet
Tutorial
11 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
No ratings yet
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
54 pages
GOOD l6
No ratings yet
GOOD l6
23 pages
MS 2023EE301 Stochastic
No ratings yet
MS 2023EE301 Stochastic
8 pages
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
No ratings yet
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
4 pages
Gaussian Processes - A Brief Introduction: Carl Edward Rasmussen
No ratings yet
Gaussian Processes - A Brief Introduction: Carl Edward Rasmussen
27 pages
Wilson2020 Part1
No ratings yet
Wilson2020 Part1
52 pages
Gaussian Process Intuitive
No ratings yet
Gaussian Process Intuitive
17 pages
Gaussian Process Model: With Omar Knio (KAUST & Duke University)
No ratings yet
Gaussian Process Model: With Omar Knio (KAUST & Duke University)
17 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Stationary Stochastic Process
No ratings yet
Stationary Stochastic Process
47 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Gaussian Processes For Machine
No ratings yet
Gaussian Processes For Machine
62 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Tungban Probabilistic ML 2021 - Lecture09
No ratings yet
Tungban Probabilistic ML 2021 - Lecture09
46 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
Advanced Gaussian Process Models
No ratings yet
Advanced Gaussian Process Models
15 pages
Kriging
No ratings yet
Kriging
10 pages
Class Gaussian Process 2024
No ratings yet
Class Gaussian Process 2024
170 pages
ML 3
No ratings yet
ML 3
66 pages
Gaussian Process Regression Guide
No ratings yet
Gaussian Process Regression Guide
13 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
6 Random Signal Analysis
No ratings yet
6 Random Signal Analysis
16 pages
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
No ratings yet
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
40 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Sample Exam For ML YSZ: Question 1 (Linear Regression)
No ratings yet
Sample Exam For ML YSZ: Question 1 (Linear Regression)
4 pages
Least Squares in Inverse Problems
No ratings yet
Least Squares in Inverse Problems
37 pages
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
No ratings yet
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
4 pages
Lecture 13: Gaussian Process Optimization: 1.1 Submodularity
No ratings yet
Lecture 13: Gaussian Process Optimization: 1.1 Submodularity
13 pages
Lecture 7 - Introduction To Kriging
100% (1)
Lecture 7 - Introduction To Kriging
52 pages
Linear Dynamical Models, Kalman Filtering and Statistics. Lecture Notes To IN-ST 259
No ratings yet
Linear Dynamical Models, Kalman Filtering and Statistics. Lecture Notes To IN-ST 259
163 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
Gaussian Process Regression in Finance
No ratings yet
Gaussian Process Regression in Finance
36 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Booklet Exercises
No ratings yet
Booklet Exercises
31 pages
Bayesian Optimization in RF Systems
No ratings yet
Bayesian Optimization in RF Systems
121 pages
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
No ratings yet
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
9 pages
The Art of Gaussian Processes Classic and Contemporary
No ratings yet
The Art of Gaussian Processes Classic and Contemporary
216 pages
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
No ratings yet
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
8 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Snelson 2005 Sparse Gps
No ratings yet
Snelson 2005 Sparse Gps
8 pages
Lecture4 Mech SU
No ratings yet
Lecture4 Mech SU
17 pages
Variational Gaussian Processes
No ratings yet
Variational Gaussian Processes
62 pages
Seminar em
No ratings yet
Seminar em
51 pages
Classical and Quantum Information Basics
No ratings yet
Classical and Quantum Information Basics
49 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Iterative Quantum Phase Estimation
No ratings yet
Iterative Quantum Phase Estimation
31 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Bayesian Linear Regression Guide
No ratings yet
Bayesian Linear Regression Guide
29 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
State Space Models & Bayesian Inference
No ratings yet
State Space Models & Bayesian Inference
58 pages
Advanced Rejection Sampling Guide
No ratings yet
Advanced Rejection Sampling Guide
30 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages

Lecture6 2015

Uploaded by

Lecture6 2015

Uploaded by

STA 414/2104:

where w is a vector of parameters and Á(x) is a vector of fixed nonlinear basis

• A prior distribution over w induces a prior distribution over functions f(x,w).

Samples from the posterior

• Hence in practice, we work in a finite space.

• Many related models: In geostatistics literature, GP regression is known as

• We place a Gaussian prior over model parameters:

• For any given fixed value of w, we have a corresponding linear function. A

• Given a dataset we will denote the values of the

• Observe that y is a linear combination of Gaussian random variables, and

Here, K is known as the Gramm matrix with elements:

where k(x,x’) is the kernel function.

• This model provides a particular example of a Gaussian process.

• Note that a random function is a stochastic process. It is a collection of

• Key point about Gaussian Processes: Given a dataset

• One commonly used covariance function is squared exponential:

• Covariance (kernel) function is typically chosen to express the property that,

• Ornstein-Uhlenbeck process that

where fn = f(xn), and ²n is an independent random noise variable. We will

• Given a dataset and corresponding target values

• From the definitions of a Gaussian process, the marginal distribution p(f) is

• The blue curve shows a sample from a

• The red points show the values of fn,

• The green points show the corresponding

where the covariance matrix is given by:

• Our goal is predict tN+1 for a new input vector xN+1.

where CN is the N by N matrix with elements:

and k is the N by 1 vector with elements k(xn,xN+1).

• Our goal is predict tN+1 for a new input vector xN+1.

• Hence the conditional distribution is Gaussian:

with Key results that define

Positive: hence the

• The green curve shows the true

• The blue data points are samples

• The red curve shows the mean of the

• Restriction on the kernel function: The covariance matrix:

must be positive definite.

can be written as a function of xN+1: Linear combination

an is the nth component of

Remember: k is the N by 1 vector with elements k(xn,xN+1).

−1.5 −0.5 −1.5−0.5

−1.5 −0.8 −2 −1 −1.5

• By contrast, in the basis function model, we have to invert a matrix SN of

• If the number of M basis functions is smaller than the number N of data

• The advantage of GPs is that we can consider covariance functions that

• Instead of fixing the covariance function, we may prefer to use a parametric

• How can we infer the values of these parameters?

• This corresponds to the type II maximum likelihood, or empirical Bayes:

• The maximization can be performed using gradient-based optimization

• Because ln p(t|µ) will be a nonconvex function, it will have multiple maxima.

• Noise: We have assumed that the additive noise, governed by ¯, is constant.

Control relevance of input dimension i, where

• We can optimize these parameters by performing type II maximum likelihood

• The relative importance of different inputs can be inferred from data.

• Plot displays ´1 (red), ´2 (green), and ´3 (blue)

• Hence y(x) 2 (0,1).

• The probability distribution over t is given by the Bernoulli distribution:

Transformed sample using

• Our goal is predict tN+1 for a new input vector xN+1

• Predictive distribution is given by:

given by Posterior is also

• Use approximate formula for the convolution of a logistic sigmoid and a

• Three different approaches to obtaining a Gaussian approximation:

Easy to compute: Laplace approximation

• Obtain the Laplace approximation by Taylor expanding log of the posterior:

You might also like