0% found this document useful (0 votes)
38 views36 pages

Lecture6 2015

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views36 pages

Lecture6 2015

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

STA 414/2104:

Machine Learning

Russ  Salakhutdinov  
Department of Computer Science!
Department of Statistics!
rsalakhu@cs.toronto.edu!
http://www.cs.toronto.edu/~rsalakhu/

Lecture 6
Gaussian Processes
• So far, we have considered linear regression models of the form:

where w is a vector of parameters and Á(x) is a vector of fixed nonlinear basis


functions.

• A prior distribution over w induces a prior distribution over functions f(x,w).


• Given a training dataset, we compute the posterior distribution over w, which
induces a posterior distribution over functions f(x,w).

Samples from the posterior


Nonlinear regression
r the problem of nonlinear regression:
Gaussian Processes
nsider the problem of nonlinear regression:
nt to learn
• You a function
want withf error
to learn affunction bars
with error barsfrom data
from data D. D = {X, y}
u want to learn a function f with error bars from data D = {X, y}

y
y

xx
Gaussian process defines a distribution over functions p(f ) which can be used for
sian process
• A defines
Gaussian a distribution
process defines a over over
distribution functions
functions p(f
p(f) ) which
which can can be used
be
yesian regression:
n regression:
used for Bayesian regression:
p(f )p(D|f )
p(f |D) = p(f )p(D|f )
p(D)
p(f |D) =
p(D)
Gaussian Processes
• In the Gaussian process viewpoint, we define a prior probability distributions
over functions directly.

• May seem difficult: How can we define a distribution over the uncountably
infinite space of functions?

• Insight: for a finite training set, we only need to consider the values of the
functions at discrete set of input values xn.

• Hence in practice, we work in a finite space.

• Many related models: In geostatistics literature, GP regression is known as


kriging. See also a recent book on GPs by Rasmussen & Williams (2006).
Linear Regression Revisited
• Consider the following linear model, defined in terms of M linear combinations
of fixed basis functions:

• We place a Gaussian prior over model parameters:

• For any given fixed value of w, we have a corresponding linear function. A


probability distribution over w defines a probability distribution over functions.

• Given a dataset we will denote the values of the


function as

• Hence:

N by M Design M by 1 vector
matrix of model
parameters
Linear Regression Revisited

• Observe that y is a linear combination of Gaussian random variables, and


hence is itself Gaussian:

Here, K is known as the Gramm matrix with elements:

where k(x,x’) is the kernel function.

• This model provides a particular example of a Gaussian process.


Gaussian Process
• A Gaussian process (GP) is a random function f: X ! R, such that for any
finite set of input points

where the parameters are the mean function m(x) and covariance kernel k(x,x’).

• Note that a random function is a stochastic process. It is a collection of


random variables {f(x)}x 2 X, one for each possible value x (see Rasmussen and
Williams, 2006).

• Key point about Gaussian Processes: Given a dataset


the marginal distribution over is completely specified
by the second-order statistics: the mean and covariance.
Gaussian Process
• In many applications, we will have no prior knowledge about the mean
function f(x). By symmetry, we take it be zero.
• The specification of a Gaussian Process is then completed by specifying the
covariance function, evaluated at any two input points xn and xm:

• One commonly used covariance function is squared exponential:

• Covariance (kernel) function is typically chosen to express the property that,


for inputs xn and xm that are similar, the corresponding values f(xn) and f(xm) will
be more strongly correlated than for dissimilar points.
Laying out Gaussians
Laying out Gaussians
A wayLaying out
of visualizing drawsGaussians
Visualizing Draws
from a 2D from GPs
Gaussian:
A way of visualizing draws from a 2D Gaussian:
2 • Visualizing draws from 2-D Gaussian:
A way of visualizing draws from a 2D Gaussian:
2
1 2 0
1 0
0
2

1 0
f2 f

0 −0.5

f
−1 0 −0.5
f2

f
−0.5

f
−1
−2 −1 −1
−2 −2 −1 0 1 2 −1
−1
−2 f0 x_1 x_2
−2 −1 1 1 2
−2 −1
f 0 1 2 x_1 x_2
f 1 x_1 x_2
1 1.5
1.51 1.5
1
0.5 1
Now it’s easy to show three draws 0.50 0.5
• Three draws from a 6-D Gaussian:
Now it’s
Noweasy to
it’s easy show three
to show draws
three draws 0 0
f

from a 6D Gaussian: f
f

−0.5
from afrom
6D aGaussian:
6D Gaussian: −0.5
−1
−0.5

−1 −1
−1.5
−1.5x_1 x_2 x_3 x_4 x_5 x_6
Slide Credit: Iain Murray −1.5 x_1 x_2 x_3 x_4 x_5 x_6
x_1 x_2 x_3 x_4 x_5 x_6
Building large Gaussians
Visualizing Draws from GPs
Three
• Threedraws from25-D
draws from a 25D Gaussian:
Gaussian
2

1
f

−1

x
To• To generate
produce these,
this, the mean
we needed a was set Itoused
mean: zero:zeros(25,1)
zeros(25,1)
• The covariance was set using a covariance function:
The covariances were set using a kernel function: ij = k(xi, xj ).
• The x’s are the positions that are planted the tics on the axis.
The x’s are the positions that I planted the tics on the axis.
We can
Later we’llvisualize
find k’sdraws
that from a GP iterative
ensure is alwayssampling
positivef(xsemi-definite.
n) | f(x1),...,f(xn−1) on a
sequence of input points x1, x2, . . xn.
Slide Credit: Iain Murray
Samples from GPs
Squared-exponential kernel Exponential kernel

• Ornstein-Uhlenbeck process that


describes Brownian motion.
GPs for Regression
• We need to account for noise on the observed target values:

where fn = f(xn), and ²n is an independent random noise variable. We will


assume Gaussian noise:

• Given a dataset and corresponding target values


the conditional takes form:

• From the definitions of a Gaussian process, the marginal distribution p(f) is


given by the Gaussian:
Illustration
• Illustration of sampling of targets {tn} from a Gaussian process.

• The blue curve shows a sample from a


GP prior:!

• The red points show the values of fn,


obtained by evaluating the function at a set
of input values {xn}.!

• The green points show the corresponding


values of {tn}:!
Marginal Distribution
• The marginal distribution p(t), conditioned on the set of inputs X, can be
obtained by integrating over f:

where the covariance matrix is given by:

• The two Gaussian sources of randomness, one associated with f(x) and the
other with noise, are independent, and so their covariances add.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:

• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:

• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:

• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Samples
Samples from
from GPs with GPs
di↵erent K(x, x0)
• Samples from GPs with different covariance functions:
3 1.5 2.5 5

2.5 2
4
1
2
1.5

1.5 3
0.5 1

1
0.5 2
f(x)

f(x)

f(x)

f(x)
0.5 0
0 1
0
−0.5 −0.5
−0.5 0
−1
−1
−1
−1
−1.5 −1.5

−2 −1.5 −2 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

2 3 6
2

1 2 4
1

0 1 2
f(x)

f(x)

f(x)

f(x)
0

−1 0 0

−1
−2 −1 −2

−2
−3 −2 −4

−4 −3 −3 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

3
2 6
2

2
1 4
1
1
0 2
f(x)

f(x)

f(x)

f(x)
0 0

−1 0
−1
−1
−2 −2
−2

−2
−3 −4
−3

−4 −3 −4 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
Prediction
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1.

• Note that the joint distribution over t and tN+1 is given by:

where CN is the N by N matrix with elements:

c is the scalar:

and k is the N by 1 vector with elements k(xn,xN+1).


Prediction
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1.

• Note that the joint distribution over t and tN+1 is given by:

• Hence the conditional distribution is Gaussian:

with Key results that define


GP regression

Positive: hence the


reduction in uncertainty
Illustration
• Illustration of GP regression applied to the sinusoidal data set.

• The green curve shows the true


function.

• The blue data points are samples


from the true function plus some
additive Gaussian noise

• The red curve shows the mean of the


GP predictive distribution, with shaded
region corresponding to +/- 2 standard
deviations.

• Restriction on the kernel function: The covariance matrix:

must be positive definite.


Mean of Predictive Distribution
• Note that the mean of the predictive distribution

can be written as a function of xN+1: Linear combination

an is the nth component of

• Also, note that the mean and variance of the predictive distribution both
depend on xN+1.

Remember: k is the N by 1 vector with elements k(xn,xN+1).


Prediction using GPs with di↵erent K(x, x0)
Prediction using GPs
Prediction using GPs with di↵erent K(x, x0)
• A sample from the prior using different covariance functions:
A sample from the prior for each covariance function:
A sample from the prior for each covariance function:
3

2.5
1.5 2.5

2
1
23 1.5 1.5 2.5

1.5
2.5
0.5 1 2
1
12
0.5 1.5
f(x)

f(x)

f(x)
0.5
1.5 0
0.5 0 1
0
1
−0.5 −0.5 0.5
f(x)

f(x)

f(x)
−0.5
0.5 0
−1 0
−1
0 −1

−1.5 −0.5 −1.5−0.5


−0.5

−2 −1.5 −2 −1
−10 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
−1
x x x
−1.5 −1.5

−2 −1.5 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x
Corresponding predictions, mean with two standard deviations:
• Corresponding
Corresponding
1

predictions:
predictions, meanmean plus standard
with two two standard deviations:
deviations:
1

0.8 1.5
2 1

0.5 0.5
1 0.6 1 2 1
1

0.40.8
0.5 1.5
0 0
0.5 0.2 0.5
0.6
0 1
0
0.4
−0.5 −0.5
−0.5 0.5
0 0
−0.2
0.2
−1 0
−0.4
−1 0 −1
−0.5 −0.5
−0.6 −1.5−0.5
−0.2

−1.5 −0.8 −2 −1 −1.5


0 5 10 15 −0.4 0 5 10 15 0 5 10 15 0 5 10 15
−1 −1

−0.6 −1.5

gpdemo −1.5
0 5 10 15
−0.8
0 5 10 15
−2
0 5 10 15
−1.5
0 5 10 15

gpdemo
Computational Complexity
• The central computation in using GPs will involve the inversion of an N by N
matrix CN, which is of order O(N3):

• By contrast, in the basis function model, we have to invert a matrix SN of


size M by M (where M is the number of basis functions).

• If the number of M basis functions is smaller than the number N of data


points, then it will be computationally more efficient to work in the basis
function framework (see the first few slides)

• The advantage of GPs is that we can consider covariance functions that


can only be expressed in terms of an infinite number of basis functions.
Learning the Hyperparameters
• The predictions of a GP regression model will depend on the choice of the
covariance function.

• Instead of fixing the covariance function, we may prefer to use a parametric


family of functions and infer the parameter values from data.

• These parameters may govern the length scale of the correlations or the
precision of the noise model and correspond to the hyperparameters in a
standard parametric model.
hyperparameters

• How can we infer the values of these parameters?


Learning the Hyperparameters
• We can compute the marginal likelihood function:

Hyperparameters of the
GP model

• One option is to maximize the log of the marginal likelihood with respect to µ.

• This corresponds to the type II maximum likelihood, or empirical Bayes:

• The maximization can be performed using gradient-based optimization


techniques, such as conjugate gradients. The gradients take form:
Learning the Hyperparameters

• Because ln p(t|µ) will be a nonconvex function, it will have multiple maxima.

• In the fully Bayesian approach, we can introduce a prior p(µ) and infer the
posterior p(µ | t).

• In general, the posterior will not have a closed form solution, so we must
resort of approximations (typically MCMC).

• Noise: We have assumed that the additive noise, governed by ¯, is constant.

• For some models, known as heteroscedastic, the noise variance itself will
depend on x (e.g. by introducing another GP that will model log ¯ (x)).
Automatic Relevance Determination
• How can we detect inputs variables that have very little effect on the
predictive distribution (irrelevant inputs).
• Consider a GP with 2-D input space x = (x1, x2) with the following covariance
function:

• As ´i becomes small,
the function becomes
insensitive to the
corresponding value of xi
(input xi becomes less
relevant).
Automatic Relevance Determination
• The ARD framework can be easily incorporated into exponential-quadratic
kernel:

Control relevance of input dimension i, where


D is the dimensionality of the input space

• We can optimize these parameters by performing type II maximum likelihood


(by optimizing marginal log-likelihood)

• The relative importance of different inputs can be inferred from data.


Illustration
• Example: We have a dataset with 3-D inputs (x1, x2, x3). The target variables tn
are sampled as follows:
• Sample 100 values of x1 from a Gaussian, evaluate the function sin(2¼x1),
and add Gaussian noise.
• Let x2 = x1, and add Gaussian noise.
• Sample 100 values of x3 from an independent Gaussian distribution.
• Hence x1 is a good predictor of t, x2 is a more noisy predictor of t, and x3 has
only chance correlation with t.

• Plot displays ´1 (red), ´2 (green), and ´3 (blue)


as a function of the number of iterations when
optimizing the marginal likelihood.
Classification with GPs
• Consider a two-class problem with targets t 2 {0,1}.
• Define a Gaussian process over a function f(x).
• Transform the function using sigmoid function:

• Hence y(x) 2 (0,1).


Transformed sample using
sigmoid function
Classification with GPs
• After transformation, we obtain a non-Gaussian stochastic process over
functions y(x).

• The probability distribution over t is given by the Bernoulli distribution:

Transformed sample using


sigmoid function
Classification with GPs
• Suppose we are given a dataset with target values

• Our goal is predict tN+1 for a new input vector xN+1

• Predictive distribution is given by:

given by Posterior is also


intractable.
• This integral is analytically intractable. Can resort to MCMC by approximately
sampling from the posterior, and performing Monte Carlo integration:

where
Approximations
• Another option:

Gaussian approximation

• Use approximate formula for the convolution of a logistic sigmoid and a


Gaussian distribution.

• Three different approaches to obtaining a Gaussian approximation:

• Variational Inference
• Expectation Propagation
• Laplace Approximation
Laplace Approximation
• We seek to obtain a Gaussian approximation to the posterior. Using Bayes
rule we have:

Easy to compute: Laplace approximation


Gaussian.

• Here p(fN) is given by a zero-mean GP with covariance matrix CN, and the
data term:

• Obtain the Laplace approximation by Taylor expanding log of the posterior:


log p(fN | tN).
Classification Results

Optimal decision boundary from the true Predictive posterior probability together
distribution (green) and the decision with GP decision boundary.
boundary from GP classifier (black)

You might also like