Gaussian Processes Tutorial
Gaussian Processes Tutorial
Zoubin Ghahramani
       Department of Engineering
      University of Cambridge, UK
         zoubin@eng.cam.ac.uk
 http://learning.eng.cam.ac.uk/zoubin/
              MLSS 2011
                             Nonlinear regression
You want to learn a function f with error bars from data D = {X, y}
                                          x
A Gaussian process defines a distribution over functions p(f ) which can be used for
Bayesian regression:
                                        p(f )p(D|f )
                              p(f |D) =
                                           p(D)
                              Gaussian Processes
f : X → <.
Definition: p(f ) is a Gaussian process if for any finite subset {x1, . . . , xn} ⊂ X ,
the marginal distribution over that finite subset p(f ) has a multivariate Gaussian
distribution.
           Gaussian process covariance functions (kernels)
p(f ) is a Gaussian process if for any finite subset {x1, . . . , xn} ⊂ X , the marginal
distribution over that finite subset p(f ) has a multivariate Gaussian distribution.
where
                                                                 0
                                                                   
                            µ(x)                 K(x, x) K(x, x )
                   µ=                   Σ=
                            µ(x0)                K(x0, x) K(x0, x0)
and similarly for p(f (x1), . . . , f (xn)) where now µ is an n × 1 vector and Σ is an
n × n matrix.
                 Gaussian process covariance functions
These kernel parameters are interpretable and can be learned from data:
                                                             v0   signal variance
                                                             v1   variance of bias
                                                             v2   noise variance
                                                             r    lengthscale
                                                             α    roughness
Once the mean and covariance functions are defined, everything else about GPs
follows from the basic rules of probability applied to mutivariate Gaussians.
                                                 Samples from GPs with different K(x, x0)
3 1.5 2.5 5
         2.5                                                                                                                                          2
                                                                                                                                                                                                                          4
                                                                                1
          2
                                                                                                                                                     1.5
         1.5                                                                                                                                                                                                              3
                                                                               0.5                                                                    1
          1
                                                                                                                                                     0.5                                                                  2
f(x)
f(x)
f(x)
                                                                                                                                                                                                                  f(x)
         0.5                                                                    0
                                                                                                                                                      0                                                                   1
          0
                                                                             −0.5                                                                  −0.5
       −0.5                                                                                                                                                                                                               0
                                                                                                                                                     −1
         −1
                                                                               −1
                                                                                                                                                                                                                         −1
       −1.5                                                                                                                                        −1.5
         −2                                                                  −1.5                                                                    −2                                                                  −2
               0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100               0   10   20   30   40   50   60   70   80   90   100
                                       x                                                                     x                                                                     x                                                                  x
          3                                                                     3                                                                     4                                                                   8
          2                                                                                                                                           3                                                                   6
                                                                                2
          1                                                                                                                                           2                                                                   4
                                                                                1
          0                                                                                                                                           1                                                                   2
  f(x)
f(x)
f(x)
                                                                                                                                                                                                                  f(x)
                                                                                0
−1 0 0
                                                                               −1
         −2                                                                                                                                          −1                                                                  −2
                                                                               −2
         −3                                                                                                                                          −2                                                                  −4
         −4                                                                    −3                                                                    −3                                                                  −6
               0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100               0   10   20   30   40   50   60   70   80   90   100
                                       x                                                                     x                                                                     x                                                                  x
          3                                                                     3                                                                     4                                                                   8
                                                                                                                                                      3
          2                                                                                                                                                                                                               6
                                                                                2
                                                                                                                                                      2
          1                                                                                                                                                                                                               4
                                                                                1
                                                                                                                                                      1
          0                                                                                                                                                                                                               2
  f(x)
f(x)
f(x)
                                                                                                                                                                                                                  f(x)
                                                                                0                                                                     0
         −1                                                                                                                                                                                                               0
                                                                                                                                                     −1
                                                                               −1
         −2                                                                                                                                                                                                              −2
                                                                                                                                                     −2
                                                                               −2
         −3                                                                                                                                                                                                              −4
                                                                                                                                                     −3
         −4                                                                    −3                                                                    −4                                                                  −6
               0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100                  0   10   20   30   40   50   60   70   80   90   100               0   10   20   30   40   50   60   70   80   90   100
                                       x                                                                     x                                                                     x                                                                  x
         Using Gaussian processes for nonlinear regression
Model:                           yi = f (xi) + i
                                 f   ∼ GP(·|0, K)
                                 i ∼ N(·|0, σ 2)
We can also compute the marginal likelihood (evidence) and use this to compare or
tune covariance functions
                                   Z
                          p(y|X) = p(y|f, X) p(f ) df
                                                    Prediction using GPs with different K(x, x0)
         2.5                                                                                                                                                        2
                                                                                       1
          2
                                                                                                                                                                   1.5
         1.5
                                                                                      0.5                                                                           1
          1
                                                                                                                                                                   0.5
 f(x)
f(x)
                                                                                                                                                           f(x)
         0.5                                                                           0
                                                                                                                                                                    0
          0
                                                                                     −0.5                                                                         −0.5
        −0.5
                                                                                                                                                                   −1
         −1
                                                                                      −1
        −1.5                                                                                                                                                      −1.5
         −2                                                                          −1.5                                                                          −2
               0       10   20   30       40   50   60   70   80   90   100                 0       10   20   30       40   50   60   70   80   90   100                 0       10   20   30       40   50   60   70   80   90   100
                                               x                                                                            x                                                                            x
0.8 1.5
         0.5                                                                                                                                                                                                                             0.5
                                                                                      0.6
                                                                                                                                                                    1
                                                                                      0.4
                                                                                                                                                                   0.5
          0                                                                                                                                                                                                                               0
                                                                                      0.2
                                                                                                                                                                    0
                                                                                       0
        −0.5                                                                                                                                                                                                                            −0.5
                                                                                                                                                                  −0.5
                                                                                     −0.2
                                                                                                                                                                   −1
                                                                                     −0.4
         −1                                                                                                                                                                                                                              −1
−0.6 −1.5
gpdemo
                       Gaussian process (GP) priors
     f
                                                         KN
                      x
Covariance: Knn0   = K(xn, xn0 ; θ), hyperparameters θ
                                      2        0             123
                                            D    (d)     (d)
                                         1      x    −  xn0 A 7
                                               @ n
                                           X
                       Knn0   = v exp 4−
                                      6
                                                               5
                                         2 d=1       rd
                        Gaussian process (GP) priors
                N function values
                                                                 prior
                                                         p(f |X) = N (0, KN )
         f1
     f    f2
           f3                                            KN
                                       fN
                      x
Covariance: Knn0   = K(xn, xn0 ; θ), hyperparameters θ
                                      2        0             123
                                            D    (d)     (d)
                                         1      x    −  xn0 A 7
                                               @ n
                                           X
                       Knn0   = v exp 4−
                                      6
                                                               5
                                         2 d=1       rd
                                GP regression
y                                                 marginal likelihood
                                              p(y|X) = N (0, KN + σ 2I)
                x
            predictive
                                                 predictive distribution
                                              p(y∗|x∗, X, y) = N (µ∗, σ∗2)
y
                                                µ∗ = K∗N (KN + σ 2I)−1y
                                         σ∗2 = K∗∗ − K∗N (KN + σ 2I)−1KN ∗ + σ 2
                x
                                GP regression
y                                                 marginal likelihood
                                              p(y|X) = N (0, KN + σ 2I)
                x
            predictive
                                                 predictive distribution
                                              p(y∗|x∗, X, y) = N (µ∗, σ∗2)
y
                                                µ∗ = K∗N (KN + σ 2I)−1y
                                         σ∗2 = K∗∗ − K∗N (KN + σ 2I)−1KN ∗ + σ 2
                x        x∗
                            GP learning the kernel
Consider the covariance function K with hyperparameters θ = (v0, v1, r1, . . . , rd, α):
                                                          α
                                                             
                                                  (d) (d)
                                       D
                                                          !
                                     X         |x − x | 
                                                  i        j
               Kθ (xi, xj ) = v0 exp −                              + v1
                                                     rd        
                                          d=1
A multilayer perceptron (neural network) with infinitely many hidden units and
Gaussian priors on the weights → a GP (Neal, 1996)
                 Using Gaussian Processes for Classification
Binary classification problem: Given a data set D = {(xi, yi)}ni=1, with binary class
labels yi ∈ {−1, +1}, infer class label probabilities at new points.
1.5
                                   y = +1
           1                       y = −1
          0.5
      f
           0
−0.5
          −1
            −1   −0.5   0    0.5            1
                        x
There are many ways to relate function values fi = f (xi) to class probabilities:
                                1
                    
                    
                          1+exp(−yi fi )         sigmoid (logistic)
                             Φ(yifi)         cumulative normal (probit)
                    
         p(yi|fi) =
                    
                           H(yifi)                   threshold
                       + (1 − 2)H(yifi)          robust threshold
                    
Non-Gaussian likelihood, so we need to use approximate inference methods (Laplace, EP, MCMC).
                           Support Vector Machines
Consider soft-margin Support Vector Machines:
                               1  2
                                       X
                           min kwk + C   (1 − yifi)+
                            w  2       i
where ()+ is the hinge loss and fi = f (xi) = w · xi + w0. Let’s kernelize this:
                    xi → φ(xi) = k(·, xi),           w → f (·)
                              1 > −1    X
                          min f K f + C   (1 − yifi)+
                           f  2         i
        Support Vector Machines and Gaussian Processes
                                                 1 > −1    X
We can write the SVM loss as:                min f K f + C   (1 − yifi)+
                                              f  2         i
                                                    1 > −1   X
We can write the negative log of a GP likelihood as: f K f −   ln p(yi|fi) + c
                                                    2        i
Equivalent? No.
  Linear                      Logistic
Regression                   Regression
              Bayesian                              Bayesian
               Linear                               Logistic
             Regression                            Regression
  Kernel                       Kernel
Regression                  Classification
                 GP                                    GP
             Regression                           Classification
Classification
                            Bayesian
                   Kernel
Matlab Demo: Gaussian Process Classification
matlab/gpml-matlab/gpml-demo
demo ep 2d
                   demo gpr
         Sparse Approximations: Speeding up GP learning
   (Snelson and Ghahramani, 2006a, 2006b; Naish-Guzman and Holden 2008)
We can approximate GP through M < N inducing pointsR   Qf̄ to obtain this Sparse
Pseudo-input Gaussian process (SPGP) prior: p(f ) = df̄ n p(fn|f̄ ) p(f̄ )
  GP prior                                  SPGP prior
   N (0, KN )   ≈              p(f )  = N (0, KNM K−1M KMN +       Λ)
≈ = +
• Given data {X, y} with noise σ 2, predictive mean and variance can be computed
  in O(M ) and O(M 2) per test case respectively
Builds on a large lit on sparse GPs (see Quiñonero Candela and Rasmussen, 2006).
                                      Some Comparisons
     Table 1: Test errors and predictive accuracy (smaller is better) for the GP classifier, the support
     vector machine, the informative vector machine, and the sparse pseudo-input GP classifier.
Example: classification
                           input x = (x1, . . . , xD ) ∈ RD
                               output y ∈ {+1, −1}
                               m̂ = argmax p(D|m)
                                        m
• Usual answer (overfitting) does not apply to fully Bayesian methods, since they
  don’t involve any fitting.
Note: Radford Neal won the NIPS feature selection competition using Bayesian
methods that used 100% of the features.
                  Feature Selection using ARD in GPs
Problem: Often there are many possible inputs that might be relevant to predicting
a particular output. We need algorithms that automatically decide which inputs are
relevant.
The parameter rd is the length scale of the function along input dimension d.
As rd → ∞ the function f varies less and less as a function of x(d), that is, the dth
dimension becomes irrelevant.
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3
                    SVM                     −3
                                                                SVM                     −3
                                                                                                          SVM
 −3   −2   −1   0         1         2   3    −3   −2   −1   0         1         2   3    −3   −2     −1     0   1   2   3
                                    Linear                       Logistic
                                  Regression                    Regression
                                                 Bayesian                              Bayesian
                                                  Linear                               Logistic
                                                Regression                            Regression
                                    Kernel                        Kernel
                                  Regression                   Classification
                                                    GP                                    GP
                                                Regression                           Classification
Classification
                                                               Bayesian
                                                      Kernel
• Gaussian processes define distributions on functions which can be used for nonlinear regression,
  classification, ranking, preference learning, ordinal regression, etc.
• GPs are closely related to many other models. We can derive them from:
  – Bayesian kernel machines
  – Linear regression with basis functions
  – Infinite multi-layer perceptron neural networks
  – Spline models
• Compared to SVMs, GPs offer several advantages: learning the kernel and regularization
  parameters, integrated feature selection, fully probabilistic predictions, interpretability.
Appendix
                    An example of ARD for classification
Data set: 6-dimensional data set with three relevant features and three irrelevant
features. For each data point ~xi, the relevant features depend on its class label:
x1i , x2i , x3i ∼ N (yi, 1), while the irrelevant features do not: x4i , x5i , x6i ∼ N (0, 1).
                                         4
                                   x4
                                         0
−1
−2
−3
                                        −4
                                         −4   −3   −2   −1   0    1   2   3   4
x1
Result: r4, r5, r6 → ∞ improving the likelihood and classification error rates,
compared to a single-lengthscale model.
                        prior         p(θ|α)
                        posterior     p(θ|α, D) ∝ p(y|X,
                                                     R    θ)p(θ|α)
                        evidence      p(y|X, α) = p(y|X,      θ)p(θ|α) dθ
                                      p(y 0|D, x0, α) = p(y 0|x0, θ)p(θ|D, α) dθ
                                                       R
                        prediction
Let the weights from feature xd have variance αd−1: p(wdj |αd) = N (0, αd−1)
• Qi, Y., Minka, T.P., Picard, R.W., and Ghahramani, Z. (2004) Predictive Automatic Relevance
  Determination by Expectation Propagation. In Twenty-first International Conference on
  Machine Learning (ICML-04). Banff, Alberta, Canada.
• Quiñonero-Candela, J. and Rasmussen, C.E. (2005) A unifying view of sparse approximate
  Gaussian process regression. Journal of Machine Learning Research 6:1959.
• Naish-Guzman, A. and Holden, S. (2008) The generalized FITC approximation. Advances in
  Neural Information Processing Systems 20:1057–1064.
• Neal, R. M. (1996) Bayesian learning for neural networks. Springer Verlag.
• Neal, R. M. (1998). Regression and classification using Gaussian process priors (with discussion).
  In Bernardo, J. M. et al., editors, Bayesian statistics 6, pages 475-501. Oxford University Press.
• O’Hagan, A. (1978). Curve Fitting and Optimal Design for Prediction (with discussion). Journal
  of the Royal Statistical Society B, 40(1):1-42.
• Rasmussen, C.E. and Williams, C.K.I. (2006) Gaussian Processes for Machine Learning. MIT
  Press.
• Snelson, E. and Ghahramani, Z. (2006a) Sparse Gaussian Processes using Pseudo-Inputs. In
  Advances in Neural Information Processing Systems 18 (NIPS-2005).
• Snelson, E. and Ghahramani, Z. (2006b) Variable noise and dimensionality reduction for sparse
  Gaussian processes. In Uncertainty in Artifical Intelligence 22 (UAI).
• More information and code at: http://www.gaussianprocess.org/