Tutorial: Gaussian process models
for machine learning
Ed Snelson (snelson@gatsby.ucl.ac.uk)
Gatsby Computational Neuroscience Unit, UCL
26th October 2006
Info
The GP book: Rasmussen and Williams, 2006
Basic GP (Matlab) code available:
http://www.gaussianprocess.org/gpml/
1
Gaussian process history
Prediction with GPs:
Time series: Wiener, Kolmogorov 1940s
Geostatistics: kriging 1970s naturally only two or three
dimensional input spaces
Spatial statistics in general: see Cressie [1993] for overview
General regression: OHagan [1978]
Computer experiments (noise free): Sacks et al. [1989]
Machine learning: Williams and Rasmussen [1996], Neal [1996]
2
Nonlinear regression
Consider the problem of nonlinear regression:
You want to learn a function f with error bars from data D = {X, y}
A Gaussian process is a prior over functions p(f ) which can be used
for Bayesian regression:
p(f )p(D|f )
p(f |D) =
p(D)
3
What is a Gaussian process?
Continuous stochastic process random functions a set of
random variables indexed by a continuous variable: f (x)
Set of inputs X = {x1, x2, . . . , xN }; corresponding set of random
function variables f = {f1, f2, . . . , fN }
GP: Any set of function variables {fn}N
n=1 has joint (zero mean)
Gaussian distribution:
p(f |X) = N (0, K)
Conditional model - density of inputs not modeled
Z
Consistency: p(f1) = df2 p(f1, f2)
4
Covariances
Where does the covariance matrix K come from?
Covariance matrix constructed from covariance function:
Kij = K(xi, xj )
Covariance function characterizes correlations between different
points in the process:
K(x, x0) = E[f (x)f (x0)]
Must produce positive semidefinite covariance matrices v>Kv 0
Ensures consistency
5
Squared exponential (SE) covariance
" #
0
2
1 x x
K(x, x0) = 02 exp
2
Intuition: function variables close in input space are highly
correlated, whilst those far away are uncorrelated
, 0 hyperparameters. : lengthscale, 0: amplitude
Stationary: K(x, x0) = K(x x0) invariant to translations
Very smooth sample functions infinitely differentiable
6
Matern class of covariances
1
0
0
0 2 2|x x | 2|x x |
K(x, x ) = K
()
where K is a modified Bessel function.
Stationary, isotropic
: SE covariance
Finite : much rougher sample functions
= 1/2: K(x, x0) = exp(|x x0|/), OU process, very rough
sample functions
7
Nonstationary covariances
Linear covariance: K(x, x0) = 02 + xx0
Brownian motion (Wiener process): K(x, x0) = min(x, x0)
0
2 xx
0 2 sin 2
Periodic covariance: K(x, x ) = exp 2
Neural network covariance
8
Constructing new covariances from old
There are several ways to combine covariances:
Sum: K(x, x0) = K1(x, x0) + K2(x, x0)
addition of independent processes
Product: K(x, x0) = K1(x, x0)K2(x, x0)
product of independent processes
Z
Convolution: K(x, x0) = dz dz 0 h(x, z)K(z, z 0)h(x0, z 0)
blurring of process with kernel h
9
Prediction with GPs
We have seen examples of GPs with certain covariance functions
General properties of covariances controlled by small number of
hyperparameters
Task: prediction from noisy data
Use GP as a Bayesian prior expressing beliefs about underlying
function we are modeling
Link to data via noise model or likelihood
10
GP regression with Gaussian noise
Data generated with Gaussian white noise around the function f
y =f + E[(x)(x0)] = 2(x x0)
Equivalently, the noise model, or likelihood is:
p(y|f ) = N (f , 2I)
Integrating over the function variables gives the marginal likelihood:
Z
p(y) = df p(y|f )p(f )
= N (0, K + 2I)
11
Prediction
N training input and output pairs (X, y), and T test inputs XT
Consider joint training and test marginal likelihood:
2 KN KNT
p(y, yT ) = N (0, KN +T + I) , KN +T = ,
KTN KT
Condition on training outputs: p(yT |y) = N (T , T )
T = KTN [KN + 2I]1y
T = KT KTN [KN + 2I]1KN T + 2I
Gives correlated predictions. Defines a predictive Gaussian process
12
Prediction
Often only marginal variances (diag T ) are required sufficient
to consider a single test input x:
= KN [KN + 2I]1y
2 = K KN [KN + 2I]1KN + 2 .
Mean predictor is a linear predictor: = KN
Inversion of KN + 2I costs O(N 3)
Prediction cost per test case is O(N ) for the mean and O(N 2) for
the variance
13
Determination of hyperparameters
Advantage of the probabilistic GP framework ability to choose
hyperparameters and covariances directly from the training data
Other models, e.g. SVMs, splines etc. require cross validation
GP: minimize negative log marginal likelihood L() wrt
hyperparameters and noise level :
1 1 > 1 N
L = log p(y|) = log det C() + y C ()y + log(2)
2 2 2
where C = K + 2I
Uncertainty in the function variables f is taken into account
14
Gradient based optimization
Minimization of L() is a non-convex optimization task
Standard gradient based techniques, such as CG or quasi-Newton
Gradients:
L 1 1 C 1 > 1 C 1
= tr C y C C y
i 2 i 2 i
Local minima, but usually not much of a problem with few
hyperparameters
Use weighted sums of covariances and let ML choose
15
Automatic relevance determination1
The ARD SE covariance function for multi-dimensional inputs:
" D
#
2
xd x0d
0 1 X
K(x, x ) = 02 exp
2 d
d=1
Learn an individual lengthscale hyperparameter d for each input
dimension xd
d determines the relevancy of input feature d to the regression
If d very large, then the feature is irrelevant
1
Neal, 1996. MacKay, 1998.
16
Relationship to generalized linear regression
Weighted sum of fixed finite set of M basis functions:
M
X
f (x) = wmm(x) = w>(x)
m=1
Place Gaussian prior on weights: p(w) = N (0, w )
Defines GP with finite rank M (degenerate) covariance function:
K(x, x0) = E[f (x)f (x0)] = >(x)w (x0)
Function space vs. weight space interpretations
17
Relationship to generalized linear regression
General GP can specify covariance function directly rather than
via set of basis functions
Mercers theorem: can always decompose covariance function into
eigenfunctions and eigenvalues:
X
K(x, x0) = ii(x)i(x0)
i=1
If sum finite, back to linear regression. Often sum infinite, and no
analytical expressions for eigenfunctions
Power of kernel methods in general (e.g. GPs, SVMs etc.)
project x 7 (x) into high or infinite dimensional feature space
and still handle computations tractably
18
Relationship to neural networks
Neural net with one hidden layer of NH units:
NH
X
f (x) = b + vj h(x; uj )
j=1
h bounded hidden layer transfer function
(e.g. h(x; u) = erf(u>x))
If vs and b zero mean independent, and weights uj iid, then CLT
implies NN GP as NH [Neal, 1996]
NN covariance function depends on transfer function h, but is in
general non-stationary
19
Relationship to spline models
Univariate cubic spline has cost functional:
N
X Z 1
(f (xn) yn)2 + f 00(x)2dx
n=1 0
Can give probabilistic GP interpretation by considering RH term
as an (improper) GP prior
Make proper by weak penalty on zeroth and first derivatives
Can derive a spline covariance function, and full GP machinery can
be applied to spline regression (uncertainties, ML hyperparameters)
Penalties on derivatives equivalent to specifying the inverse
covariance function no natural marginalization
20
Relationship to support vector regression
GP regression SV regression
log p(y|f)
log p(y|f)
f y e f e y
-insensitive error function can be considered as a non-Gaussian
likelihood or noise model. Integrating over f becomes intractable
SV regression can be considered as MAP solution fMAP to GP with
-insensitive error likelihood
Advantages of SVR: naturally sparse solution by QP, robust to
outliers. Disadvantages: uncertainties not taken into account, no
predictive variances, or learning of hyperparameters by ML
21
Logistic and probit regression
Binary classification task: y = 1
GLM likelihood: p(y = +1|x, w) = (x) = (x>w)
(z) sigmoid function such as the logistic or cumulative normal.
Weight space viewpoint: prior p(w) = N (0, w )
Function space viewpoint: let f (x) = w>x, then likelihood (x) =
(f (x)), Gaussian prior on f
22
GP classification
1. Place a GP prior directly on f (x)
2. Use a sigmoidal likelihood: p(y = +1|f ) = (f )
Just as for SVR, non-Gaussian likelihood makes integrating over f
intractable: Z
p(f|y) = df p(f|f )p(f |y)
where the posterior p(f |y) p(y|f )p(f )
Make tractable by using a Gaussian approximation to posterior.
Then prediction:
Z
p(y = +1|y) = df (f)p(f|y)
23
GP classification
Two common ways to make Gaussian approximation to posterior:
1. Laplace approximation. Second order Taylor approximation about
mode of posterior
2. Expectation propagation (EP)2. EP can be thought of as
approximately minimizing KL[p(f |y)||q(f |y)] by an iterative
procedure.
Kuss and Rasmussen [2005] evaluate both methods experimentally
and find EP to be significantly superior
Classification accuracy on digits data sets comparable to SVMs.
Advantages: probabilistic predictions, hyperparameters by ML
2
Minka, 2001
24
GP latent variable model (GPLVM)3
Probabilistic model for dimensionality reduction: data is set of
high dimensional vectors: y1, y2, . . . , yN
Model each dimension (k) of y as an independent GP with
unknown low dimensional latent inputs: x1, x2, . . . , xN
Y 1 2 > 1
p({yn}|{xn}) exp wk Yk K Yk
2
k
Maximize likelihood to find latent projections {xn} (not a true
density model for y because cannot tractably integrate over x).
Smooth mapping from x to y with uncertainties
3
Lawrence, 2004
25
Application: GPLVMs for human pose modeling4
High dimensional data points are feature vectors derived from pose
information from mo-cap data.
Features: joint angles, vertical orientation, velocity and
accelerations
GPLVM used to learn low-dimensional trajectories of e.g. base-ball
pitch, basketball jump shot
GPLVM predictive distribution used to make cost function for
finding new poses with constraints
4
Grochow, Martin, Hertzmann, Popovic, 2004. Style-based inverse kinematics. Demos available
from http://grail.cs.washington.edu/projects/styleik/
26
Sparse GP approximations
Problem for large data sets: training GP O(N 3), prediction O(N 2)
per test case
Recent years many approximations developed reduce cost to
O(N M 2) training and O(M 2) prediction per test case
Based around a low rank (M ) covariance approximation
See Quinonero Candela and Rasmussen [2005] for a review of
regression approximations
Classification more complicated, so simpler approximations such
as IVM5 may be more suitable
5
Lawrence et al., 2003
27
Two stage generative model
f
pseudo-input prior
p(f |X) = N (0, KM )
1. Choose any set of M (pseudo-) inputs X
2. Draw corresponding function values f from prior
28
Two stage generative model
conditional
p(f |f ) = N (, )
f = KNM K1
M f
= KN KNM K1
M KMN
x
X
3. Draw f conditioned on f
This two stage procedure defines exactly the same GP prior
We have not gained anything yet, but it inspires a sparse
approximation ...
29
Factorized approximation
single point conditional
p(fn|f ) = N (n, n)
n = KnM K1
M f
f
n = Knn KnM K1
M KM n
Y
Approximate: p(f |f ) p(fn|f ) = N (, ) , = diag()
n
h Y i
Minimum KL: min KL p(f |f ) k qn(fn)
qn
n
30
Sparse pseudo-input Gaussian processes (SPGP)6
R Q
Integrate out f to obtain SPGP prior: p(f ) = df n p(fn|f ) p(f )
GP prior SPGP/FITC prior
N (0, KN ) p(f ) = N (0, KNM K1
M KMN + )
= +
SPGP/FITC covariance inverted in O(M 2N ). Predictive mean and
variance computed in O(M ) and O(M 2) per test case respectively
SPGP = GP with non-stationary covariance parameterized by X
6
Snelson and Ghahramani, 2005
31
1D demo
amplitude lengthscale noise
Initialize adversarially: amplitude and lengthscale too big
noise too small
pseudo-inputs bunched up
32
1D demo
amplitude lengthscale noise
Pseudo-inputs and hyperparameters optimized
32
Future GP directions
Design of covariance functions to encorporate more specific prior
knowledge
Beyond vectorial input data: structure in the input domain
Further improvements to sparse GP approximations to scale GPs
up for very large data sets
Beyond regression and classification, e.g. applications of latent
variable models such as GPLVM
33