Lecture6 2015
Lecture6 2015
Machine Learning
Russ
Salakhutdinov
Department of Computer Science!
Department of Statistics!
rsalakhu@cs.toronto.edu!
http://www.cs.toronto.edu/~rsalakhu/
Lecture 6
Gaussian Processes
• So far, we have considered linear regression models of the form:
y
y
xx
Gaussian process defines a distribution over functions p(f ) which can be used for
sian process
• A defines
Gaussian a distribution
process defines a over over
distribution functions
functions p(f
p(f) ) which
which can can be used
be
yesian regression:
n regression:
used for Bayesian regression:
p(f )p(D|f )
p(f |D) = p(f )p(D|f )
p(D)
p(f |D) =
p(D)
Gaussian Processes
• In the Gaussian process viewpoint, we define a prior probability distributions
over functions directly.
• May seem difficult: How can we define a distribution over the uncountably
infinite space of functions?
• Insight: for a finite training set, we only need to consider the values of the
functions at discrete set of input values xn.
• Hence:
N by M Design M by 1 vector
matrix of model
parameters
Linear Regression Revisited
where the parameters are the mean function m(x) and covariance kernel k(x,x’).
1 0
f2 f
0 −0.5
f
−1 0 −0.5
f2
f
−0.5
f
−1
−2 −1 −1
−2 −2 −1 0 1 2 −1
−1
−2 f0 x_1 x_2
−2 −1 1 1 2
−2 −1
f 0 1 2 x_1 x_2
f 1 x_1 x_2
1 1.5
1.51 1.5
1
0.5 1
Now it’s easy to show three draws 0.50 0.5
• Three draws from a 6-D Gaussian:
Now it’s
Noweasy to
it’s easy show three
to show draws
three draws 0 0
f
from a 6D Gaussian: f
f
−0.5
from afrom
6D aGaussian:
6D Gaussian: −0.5
−1
−0.5
−1 −1
−1.5
−1.5x_1 x_2 x_3 x_4 x_5 x_6
Slide Credit: Iain Murray −1.5 x_1 x_2 x_3 x_4 x_5 x_6
x_1 x_2 x_3 x_4 x_5 x_6
Building large Gaussians
Visualizing Draws from GPs
Three
• Threedraws from25-D
draws from a 25D Gaussian:
Gaussian
2
1
f
−1
x
To• To generate
produce these,
this, the mean
we needed a was set Itoused
mean: zero:zeros(25,1)
zeros(25,1)
• The covariance was set using a covariance function:
The covariances were set using a kernel function: ij = k(xi, xj ).
• The x’s are the positions that are planted the tics on the axis.
The x’s are the positions that I planted the tics on the axis.
We can
Later we’llvisualize
find k’sdraws
that from a GP iterative
ensure is alwayssampling
positivef(xsemi-definite.
n) | f(x1),...,f(xn−1) on a
sequence of input points x1, x2, . . xn.
Slide Credit: Iain Murray
Samples from GPs
Squared-exponential kernel Exponential kernel
• The two Gaussian sources of randomness, one associated with f(x) and the
other with noise, are independent, and so their covariances add.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:
• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:
• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Covariance Function
• One widely used covariance (kernel) function for GP regression is given by
the squared-exponential plus constant and linear terms:
• Note that the last term corresponds to a parametric model that is a linear
function of the input variables.
Samples
Samples from
from GPs with GPs
di↵erent K(x, x0)
• Samples from GPs with different covariance functions:
3 1.5 2.5 5
2.5 2
4
1
2
1.5
1.5 3
0.5 1
1
0.5 2
f(x)
f(x)
f(x)
f(x)
0.5 0
0 1
0
−0.5 −0.5
−0.5 0
−1
−1
−1
−1
−1.5 −1.5
−2 −1.5 −2 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8
2 3 6
2
1 2 4
1
0 1 2
f(x)
f(x)
f(x)
f(x)
0
−1 0 0
−1
−2 −1 −2
−2
−3 −2 −4
−4 −3 −3 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8
3
2 6
2
2
1 4
1
1
0 2
f(x)
f(x)
f(x)
f(x)
0 0
−1 0
−1
−1
−2 −2
−2
−2
−3 −4
−3
−4 −3 −4 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
Prediction
• Suppose we are given a dataset with target values
• Note that the joint distribution over t and tN+1 is given by:
c is the scalar:
• Note that the joint distribution over t and tN+1 is given by:
• Also, note that the mean and variance of the predictive distribution both
depend on xN+1.
2.5
1.5 2.5
2
1
23 1.5 1.5 2.5
1.5
2.5
0.5 1 2
1
12
0.5 1.5
f(x)
f(x)
f(x)
0.5
1.5 0
0.5 0 1
0
1
−0.5 −0.5 0.5
f(x)
f(x)
f(x)
−0.5
0.5 0
−1 0
−1
0 −1
−2 −1.5 −2 −1
−10 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
−1
x x x
−1.5 −1.5
−2 −1.5 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x
Corresponding predictions, mean with two standard deviations:
• Corresponding
Corresponding
1
predictions:
predictions, meanmean plus standard
with two two standard deviations:
deviations:
1
0.8 1.5
2 1
0.5 0.5
1 0.6 1 2 1
1
0.40.8
0.5 1.5
0 0
0.5 0.2 0.5
0.6
0 1
0
0.4
−0.5 −0.5
−0.5 0.5
0 0
−0.2
0.2
−1 0
−0.4
−1 0 −1
−0.5 −0.5
−0.6 −1.5−0.5
−0.2
−0.6 −1.5
gpdemo −1.5
0 5 10 15
−0.8
0 5 10 15
−2
0 5 10 15
−1.5
0 5 10 15
gpdemo
Computational Complexity
• The central computation in using GPs will involve the inversion of an N by N
matrix CN, which is of order O(N3):
• These parameters may govern the length scale of the correlations or the
precision of the noise model and correspond to the hyperparameters in a
standard parametric model.
hyperparameters
Hyperparameters of the
GP model
• One option is to maximize the log of the marginal likelihood with respect to µ.
• In the fully Bayesian approach, we can introduce a prior p(µ) and infer the
posterior p(µ | t).
• In general, the posterior will not have a closed form solution, so we must
resort of approximations (typically MCMC).
• For some models, known as heteroscedastic, the noise variance itself will
depend on x (e.g. by introducing another GP that will model log ¯ (x)).
Automatic Relevance Determination
• How can we detect inputs variables that have very little effect on the
predictive distribution (irrelevant inputs).
• Consider a GP with 2-D input space x = (x1, x2) with the following covariance
function:
• As ´i becomes small,
the function becomes
insensitive to the
corresponding value of xi
(input xi becomes less
relevant).
Automatic Relevance Determination
• The ARD framework can be easily incorporated into exponential-quadratic
kernel:
where
Approximations
• Another option:
Gaussian approximation
• Variational Inference
• Expectation Propagation
• Laplace Approximation
Laplace Approximation
• We seek to obtain a Gaussian approximation to the posterior. Using Bayes
rule we have:
• Here p(fN) is given by a zero-mean GP with covariance matrix CN, and the
data term:
Optimal decision boundary from the true Predictive posterior probability together
distribution (green) and the decision with GP decision boundary.
boundary from GP classifier (black)