0% found this document useful (0 votes)

116 views31 pages

Gaussian Processes Tutorial

This document provides an introduction to Gaussian processes for nonlinear regression. It discusses how a Gaussian process defines a distribution over functions that can be used for Bayesian regression. A Gaussian process is parameterized by a mean function and covariance function (or kernel) that defines the correlation between function values. Different kernels can produce different types of functions and behaviors. The document provides examples of kernels and samples of functions drawn from Gaussian processes using different kernels.

Uploaded by

P6E7P7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views31 pages

Gaussian Processes Tutorial

Uploaded by

P6E7P7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

A Tutorial on Gaussian Processes

(or why I don’t use SVMs)

Zoubin Ghahramani

Department of Engineering
University of Cambridge, UK

Machine Learning Department

Carnegie Mellon University, USA

zoubin@eng.cam.ac.uk
http://learning.eng.cam.ac.uk/zoubin/

MLSS 2011
Nonlinear regression

Consider the problem of nonlinear regression:

You want to learn a function f with error bars from data D = {X, y}

x
A Gaussian process defines a distribution over functions p(f ) which can be used for
Bayesian regression:
p(f )p(D|f )
p(f |D) =
p(D)
Gaussian Processes

A Gaussian process defines a distribution over functions, p(f ), where f is a function

mapping some input space X to <.

f : X → <.

Notice that f can be an infinite-dimensional quantity (e.g. if X = <)

Let f = (f (x1), . . . , f (xn)) be an n-dimensional vector of function values evaluated

at n points xi ∈ X . Note f is a random variable.

Definition: p(f ) is a Gaussian process if for any finite subset {x1, . . . , xn} ⊂ X ,
the marginal distribution over that finite subset p(f ) has a multivariate Gaussian
distribution.
Gaussian process covariance functions (kernels)

p(f ) is a Gaussian process if for any finite subset {x1, . . . , xn} ⊂ X , the marginal
distribution over that finite subset p(f ) has a multivariate Gaussian distribution.

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a

covariance function, or kernel, K(x, x0).

p(f (x), f (x0)) = N(µ, Σ)

where
0

µ(x) K(x, x) K(x, x )
µ= Σ=
µ(x0) K(x0, x) K(x0, x0)

and similarly for p(f (x1), . . . , f (xn)) where now µ is an n × 1 vector and Σ is an
n × n matrix.
Gaussian process covariance functions

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a

covariance function, K(x, x0).

An example covariance function:

α
|xi − xj |
K(xi, xj ) = v0 exp − + v1 + v2 δij
r

with parameters (v0, v1, v2, r, α)

These kernel parameters are interpretable and can be learned from data:

v0 signal variance
v1 variance of bias
v2 noise variance
r lengthscale
α roughness

Once the mean and covariance functions are defined, everything else about GPs
follows from the basic rules of probability applied to mutivariate Gaussians.
Samples from GPs with different K(x, x0)

3 1.5 2.5 5

2.5 2
4
1
2
1.5

1.5 3
0.5 1

1
0.5 2
f(x)

f(x)

f(x)
0.5 0
0 1
0
−0.5 −0.5
−0.5 0
−1
−1
−1
−1
−1.5 −1.5

−2 −1.5 −2 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

2 3 6
2

1 2 4
1

0 1 2
f(x)

f(x)

f(x)
0

−1 0 0

−1
−2 −1 −2

−2
−3 −2 −4

−4 −3 −3 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
3 3 4 8

3
2 6
2

2
1 4
1
1
0 2
f(x)

f(x)

f(x)
0 0

−1 0
−1
−1
−2 −2
−2

−2
−3 −4
−3

−4 −3 −4 −6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x x
Using Gaussian processes for nonlinear regression

Imagine observing a data set D = {(xi, yi)ni=1} = (X, y).

Model: yi = f (xi) + i
f ∼ GP(·|0, K)
i ∼ N(·|0, σ 2)

Prior on f is a GP, likelihood is Gaussian, therefore posterior on f is also a GP.

We can use this to make predictions

Z
p(y∗|x∗, D) = p(y∗|x∗, f, D) p(f |D) df

We can also compute the marginal likelihood (evidence) and use this to compare or
tune covariance functions
Z
p(y|X) = p(y|f, X) p(f ) df
Prediction using GPs with different K(x, x0)

A sample from the prior for each covariance function:

3 1.5 2.5

2.5 2
1
2
1.5

1.5
0.5 1

1
0.5
f(x)

f(x)

f(x)
0.5 0
0
0
−0.5 −0.5
−0.5

−1
−1
−1
−1.5 −1.5

−2 −1.5 −2
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
x x x

Corresponding predictions, mean with two standard deviations:

1 1 2 1

0.8 1.5

0.5 0.5
0.6
1

0.4
0.5
0 0
0.2
0
0
−0.5 −0.5
−0.5
−0.2

−1
−0.4
−1 −1

−0.6 −1.5

−1.5 −0.8 −2 −1.5

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

gpdemo
Gaussian process (GP) priors

GP: consistent Gaussian prior on any set of function values f = {fn}N

n=1 , given
corresponding inputs X = {xn}Nn=1

one sample function

prior
p(f |X) = N (0, KN )

f
KN

x
Covariance: Knn0 = K(xn, xn0 ; θ), hyperparameters θ
2 0 123
D (d) (d)
1 x − xn0 A 7
@ n
X
Knn0 = v exp 4−
6
5
2 d=1 rd
Gaussian process (GP) priors

GP: consistent Gaussian prior on any set of function values f = {fn}N

n=1 , given
corresponding inputs X = {xn}Nn=1

N function values
prior
p(f |X) = N (0, KN )
f1
f f2
f3 KN
fN

x
Covariance: Knn0 = K(xn, xn0 ; θ), hyperparameters θ
2 0 123
D (d) (d)
1 x − xn0 A 7
@ n
X
Knn0 = v exp 4−
6
5
2 d=1 rd
GP regression

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

sample data

y marginal likelihood
p(y|X) = N (0, KN + σ 2I)

x
predictive

predictive distribution
p(y∗|x∗, X, y) = N (µ∗, σ∗2)
y
µ∗ = K∗N (KN + σ 2I)−1y
σ∗2 = K∗∗ − K∗N (KN + σ 2I)−1KN ∗ + σ 2

x
GP regression

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

sample data

y marginal likelihood
p(y|X) = N (0, KN + σ 2I)

x
predictive

predictive distribution
p(y∗|x∗, X, y) = N (µ∗, σ∗2)
y
µ∗ = K∗N (KN + σ 2I)−1y
σ∗2 = K∗∗ − K∗N (KN + σ 2I)−1KN ∗ + σ 2

x x∗
GP learning the kernel

Consider the covariance function K with hyperparameters θ = (v0, v1, r1, . . . , rd, α):
 α

(d) (d)
D
!
 X |x − x | 
i j
Kθ (xi, xj ) = v0 exp − + v1
 rd 
d=1

Given a data set D = (X, y), how do we learn θ?

The marginal likelihood is a function of θ

p(y|X, θ) = N (0, Kθ + σ 2I)

where its log is:

1 1 >
ln p(y|X, θ) = − ln det(Kθ + σ I) − y (Kθ + σ 2I)−1y + const
2
2 2

which can be optimized as a function of θ and σ.

Alternatively, one can infer θ using Bayesian methods, which is more costly but
immune to overfitting.
From linear regression to GPs:

• Linear regression with inputs xi and outputs yi: yi = β0 + β1xi + i

M
X
• Linear regression with M basis functions: yi = βm φm(xi) + i
m=1

• Bayesian linear regression with basis functions:

βm ∼ N(·|0, λm) (independent of β`, ∀` 6= m), i ∼ N(·|0, σ 2)

• Integrating out the coefficients, βj , we find:

M
def
X
E[yi] = 0, Cov(yi, yj ) = Kij = λm φm(xi) φm(xj ) + δij σ 2
m=1
This is a Gaussian process with covariance function K(xi, xj ) = Kij .

This GP has a finite number (M ) of basis functions. Many useful GP kernels

correspond to infinitely many basis functions (i.e. infinite-dim feature spaces).

A multilayer perceptron (neural network) with infinitely many hidden units and
Gaussian priors on the weights → a GP (Neal, 1996)
Using Gaussian Processes for Classification
Binary classification problem: Given a data set D = {(xi, yi)}ni=1, with binary class
labels yi ∈ {−1, +1}, infer class label probabilities at new points.

1.5

y = +1
1 y = −1

0.5
f
0

−0.5

−1
−1 −0.5 0 0.5 1
x

There are many ways to relate function values fi = f (xi) to class probabilities:
1


 1+exp(−yi fi ) sigmoid (logistic)
Φ(yifi) cumulative normal (probit)

p(yi|fi) =

 H(yifi) threshold
+ (1 − 2)H(yifi) robust threshold


Non-Gaussian likelihood, so we need to use approximate inference methods (Laplace, EP, MCMC).
Support Vector Machines
Consider soft-margin Support Vector Machines:
1 2
X
min kwk + C (1 − yifi)+
w 2 i

where ()+ is the hinge loss and fi = f (xi) = w · xi + w0. Let’s kernelize this:
xi → φ(xi) = k(·, xi), w → f (·)

By reproducing property: hk(·, xi), f (·)i = f (xi).

X
By representer theorem, solution: f (x) = αik(x, xi)
i
Defining f = (f1, . . . fN )T note that f = Kα, so α = K−1f

Therefore the regularizer 21 kwk2 → 21 kf k2H = 21 hf (·), f (·)iH = 12 α>Kα = 21 f >K−1f

So we can rewrite the kernelized SVM loss as:

1 > −1 X
min f K f + C (1 − yifi)+
f 2 i
Support Vector Machines and Gaussian Processes

1 > −1 X
We can write the SVM loss as: min f K f + C (1 − yifi)+
f 2 i

1 > −1 X
We can write the negative log of a GP likelihood as: f K f − ln p(yi|fi) + c
2 i
Equivalent? No.

With Gaussian processes we:

• Handle uncertainty in unknown function f by averaging, not minimization.
• Compute p(y = +1|x) 6= p(y = +1|f̂ , x).
• Can learn the kernel parameters automatically from data, no matter how
flexible we wish to make the kernel.
• Can learn the regularization parameter C without cross-validation.
• Can incorporate interpretable noise models and priors over functions, and can
sample from prior to get intuitions about the model assumptions.
• We can combine automatic feature selection with learning using ARD.
A picture

Linear Logistic
Regression Regression

Bayesian Bayesian
Linear Logistic
Regression Regression

Kernel Kernel
Regression Classification

GP GP
Regression Classification

Classification

Bayesian
Kernel
Matlab Demo: Gaussian Process Classification

matlab/gpml-matlab/gpml-demo

demo ep 2d

demo gpr
Sparse Approximations: Speeding up GP learning
(Snelson and Ghahramani, 2006a, 2006b; Naish-Guzman and Holden 2008)
We can approximate GP through M < N inducing pointsR Qf̄ to obtain this Sparse
Pseudo-input Gaussian process (SPGP) prior: p(f ) = df̄ n p(fn|f̄ ) p(f̄ )
GP prior SPGP prior
N (0, KN ) ≈ p(f ) = N (0, KNM K−1M KMN + Λ)

≈ = +

• SPGP covariance inverted in O(M 2N ) O(N 3) ⇒ much faster

• SPGP = GP with non-stationary covariance parameterized by X̄

• Given data {X, y} with noise σ 2, predictive mean and variance can be computed
in O(M ) and O(M 2) per test case respectively
Builds on a large lit on sparse GPs (see Quiñonero Candela and Rasmussen, 2006).
Some Comparisons

Table 1: Test errors and predictive accuracy (smaller is better) for the GP classifier, the support
vector machine, the informative vector machine, and the sparse pseudo-input GP classifier.

Data set GPC SVM IVM SPGPC

name train:test dim err nlp err #sv err nlp M err nlp M
synth 250:1000 2 0.097 0.227 0.098 98 0.096 0.235 150 0.087 0.234 4
crabs 80:120 5 0.039 0.096 0.168 67 0.066 0.134 60 0.043 0.105 10
banana 400:4900 2 0.105 0.237 0.106 151 0.105 0.242 200 0.107 0.261 20
breast-cancer 200:77 9 0.288 0.558 0.277 122 0.307 0.691 120 0.281 0.557 2
diabetes 468:300 8 0.231 0.475 0.226 271 0.230 0.486 400 0.230 0.485 2
flare-solar 666:400 9 0.346 0.570 0.331 556 0.340 0.628 550 0.338 0.569 3
german 700:300 20 0.230 0.482 0.247 461 0.290 0.658 450 0.236 0.491 4
heart 170:100 13 0.178 0.423 0.166 92 0.203 0.455 120 0.172 0.414 2
image 1300:1010 18 0.027 0.078 0.040 462 0.028 0.082 400 0.031 0.087 200
ringnorm 400:7000 20 0.016 0.071 0.016 157 0.016 0.101 100 0.014 0.089 2
splice 1000:2175 60 0.115 0.281 0.102 698 0.225 0.403 700 0.126 0.306 200
thyroid 140:75 5 0.043 0.093 0.056 61 0.041 0.120 40 0.037 0.128 6
titanic 150:2051 3 0.221 0.514 0.223 118 0.242 0.578 100 0.231 0.520 2
twonorm 400:7000 20 0.031 0.085 0.027 220 0.031 0.085 300 0.026 0.086 2
waveform 400:4600 21 0.100 0.229 0.107 148 0.100 0.232 250 0.099 0.228 10

From (Naish-Guzman and Holden, 2008), using exactly same kernels.

linear models. In all cases, we employed the isotropic squared exponential kernel, avoiding here the
anisotropic version primarily to allow comparison with the SVM: lacking a probabilistic foundation,
its kernel parameters and regularization constant must be set by cross-validation. For the IVM,
hyperparameter optimization is interleaved with active set selection as described in [2], while for the
other GP models, we fit hyperparameters by gradient ascent on the estimated marginal likelihood,
Feature Selection

Example: classification

input x = (x1, . . . , xD ) ∈ RD
output y ∈ {+1, −1}

2D possible subsets of relevant input features.

One approach, consider all models m ∈ {0, 1}D and find

m̂ = argmax p(D|m)
m

Problems: intractable, overfitting, we should really average

Feature Selection

• Why are we doing feature selection?

• What does it cost us to keep all the features?

• Usual answer (overfitting) does not apply to fully Bayesian methods, since they
don’t involve any fitting.

• We should only do feature selection if there is a cost associated with measuring

features or predicting with many features.

Note: Radford Neal won the NIPS feature selection competition using Bayesian
methods that used 100% of the features.
Feature Selection using ARD in GPs

Problem: Often there are many possible inputs that might be relevant to predicting
a particular output. We need algorithms that automatically decide which inputs are
relevant.

Automatic Relevance Determination:

Consider this covariance function:

 !2

D (d) (d)
1X xn − xn0
Knn0 = v exp −
 
2 rd
d=1

The parameter rd is the length scale of the function along input dimension d.

As rd → ∞ the function f varies less and less as a function of x(d), that is, the dth
dimension becomes irrelevant.

Given data, by learning the lengthscales (r1, . . . , rD ) it is possible to do automatic

feature selection.
Bayesian Discriminative Modeling

Terminology for classification with inputs x and classes y:

• Generative Model: models prior p(y) and class-conditional density p(x|y)
• Discriminative Model: directly models the conditional distribution p(y|x) or
the class boundary e.g. {x : p(y = +1|x) = 0.5}
Myth: Bayesian Methods = Generative Models
For example, it is possible to define Bayesian kernel classifiers (i.e. Gaussian
processes) analogous to support vector machines (SVMs).
3 3 3
BPM BPM BPM
2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3
SVM −3
SVM −3
SVM
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

(figure adapted from Minka, 2001)

Conclusions

Linear Logistic
Regression Regression

Bayesian Bayesian
Linear Logistic
Regression Regression

Kernel Kernel
Regression Classification

GP GP
Regression Classification

Classification

Bayesian
Kernel

• Gaussian processes define distributions on functions which can be used for nonlinear regression,
classification, ranking, preference learning, ordinal regression, etc.
• GPs are closely related to many other models. We can derive them from:
– Bayesian kernel machines
– Linear regression with basis functions
– Infinite multi-layer perceptron neural networks
– Spline models
• Compared to SVMs, GPs offer several advantages: learning the kernel and regularization
parameters, integrated feature selection, fully probabilistic predictions, interpretability.
Appendix
An example of ARD for classification

Data set: 6-dimensional data set with three relevant features and three irrelevant
features. For each data point ~xi, the relevant features depend on its class label:
x1i , x2i , x3i ∼ N (yi, 1), while the irrelevant features do not: x4i , x5i , x6i ∼ N (0, 1).
4

x4
0

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4

Result: r4, r5, r6 → ∞ improving the likelihood and classification error rates,
compared to a single-lengthscale model.

Methods single lengthscale multiple lengthscales

log p(y|X, θ) -55.4480 -35.4119
Error rates 0.0600 0.0400
Example from (Kim and Ghahramani, 2004)
More on ARD and feature selection with thousands of inputs: (Qi et al, 2004).
Feature Selection: Automatic Relevance Determination

Bayesian neural network

Data: D = {(x(n), y (n))}N

n=1 = (X, y)
Parameters (weights): θ = {{wij }, {vk }}

Automatic Relevance Determination (ARD):

Let the weights from feature xd have variance αd−1: p(wdj |αd) = N (0, αd−1)

αd → ∞ variance → 0 weights → 0 (irrelevant)

Let’s think about this: αd ∞ finite variance weight can vary (relevant)

ARD: optimize α̂ = argmax p(y|X, α).

α
During optimization some αd will go to ∞, so the model will discover irrelevant
inputs.
Sparse GP overview

This work contains 2 key ideas:

1. A new sparse Gaussian process approximation based on a small set of M ‘pseudo-
inputs’ (M N ). This reduces computational complexity to O(M 2N )

2. A gradient based learning procedure for finding the pseudo-inputs and

hyperparameters of the Gaussian process, in one joint optimization
References

• Qi, Y., Minka, T.P., Picard, R.W., and Ghahramani, Z. (2004) Predictive Automatic Relevance
Determination by Expectation Propagation. In Twenty-first International Conference on
Machine Learning (ICML-04). Banff, Alberta, Canada.
• Quiñonero-Candela, J. and Rasmussen, C.E. (2005) A unifying view of sparse approximate
Gaussian process regression. Journal of Machine Learning Research 6:1959.
• Naish-Guzman, A. and Holden, S. (2008) The generalized FITC approximation. Advances in
Neural Information Processing Systems 20:1057–1064.
• Neal, R. M. (1996) Bayesian learning for neural networks. Springer Verlag.
• Neal, R. M. (1998). Regression and classification using Gaussian process priors (with discussion).
In Bernardo, J. M. et al., editors, Bayesian statistics 6, pages 475-501. Oxford University Press.
• O’Hagan, A. (1978). Curve Fitting and Optimal Design for Prediction (with discussion). Journal
of the Royal Statistical Society B, 40(1):1-42.
• Rasmussen, C.E. and Williams, C.K.I. (2006) Gaussian Processes for Machine Learning. MIT
Press.
• Snelson, E. and Ghahramani, Z. (2006a) Sparse Gaussian Processes using Pseudo-Inputs. In
Advances in Neural Information Processing Systems 18 (NIPS-2005).
• Snelson, E. and Ghahramani, Z. (2006b) Variable noise and dimensionality reduction for sparse
Gaussian processes. In Uncertainty in Artifical Intelligence 22 (UAI).
• More information and code at: http://www.gaussianprocess.org/

Gaussian Processes Regression Tutorial
No ratings yet
Gaussian Processes Regression Tutorial
30 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Lecture6 2015
No ratings yet
Lecture6 2015
36 pages
Machine Learning and Pattern Recognition Gaussian Processes
No ratings yet
Machine Learning and Pattern Recognition Gaussian Processes
6 pages
Gaussian Processes For Regression: A Tutorial
No ratings yet
Gaussian Processes For Regression: A Tutorial
7 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Stochastic Differential Equations in Machine Learning
No ratings yet
Stochastic Differential Equations in Machine Learning
26 pages
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
No ratings yet
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
4 pages
Tungban Probabilistic ML 2021 - Lecture09
No ratings yet
Tungban Probabilistic ML 2021 - Lecture09
46 pages
Wilson2020 Part1
No ratings yet
Wilson2020 Part1
52 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Gaussian Process Model: With Omar Knio (KAUST & Duke University)
No ratings yet
Gaussian Process Model: With Omar Knio (KAUST & Duke University)
17 pages
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
No ratings yet
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
19 pages
Gaussian Processes for ML Students
No ratings yet
Gaussian Processes for ML Students
15 pages
Stationary Stochastic Process
No ratings yet
Stationary Stochastic Process
47 pages
Tutorial
No ratings yet
Tutorial
11 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
Gaussian Processes - A Brief Introduction: Carl Edward Rasmussen
No ratings yet
Gaussian Processes - A Brief Introduction: Carl Edward Rasmussen
27 pages
Lec 3 Gaussian Processes
No ratings yet
Lec 3 Gaussian Processes
30 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
No ratings yet
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
40 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
No ratings yet
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
54 pages
GOOD l6
No ratings yet
GOOD l6
23 pages
Gaussian Processes in Bayesian Regression
No ratings yet
Gaussian Processes in Bayesian Regression
146 pages
The Art of Gaussian Processes Classic and Contemporary
No ratings yet
The Art of Gaussian Processes Classic and Contemporary
216 pages
Snelson 2005 Sparse Gps
No ratings yet
Snelson 2005 Sparse Gps
8 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Lecture 13: Gaussian Process Optimization: 1.1 Submodularity
No ratings yet
Lecture 13: Gaussian Process Optimization: 1.1 Submodularity
13 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Advanced Gaussian Process Models
No ratings yet
Advanced Gaussian Process Models
15 pages
Documentation For GPML Matlab Code
No ratings yet
Documentation For GPML Matlab Code
10 pages
Gaussian Processes For Machine
No ratings yet
Gaussian Processes For Machine
62 pages
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
No ratings yet
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
8 pages
ML 3
No ratings yet
ML 3
66 pages
MS 2023EE301 Stochastic
No ratings yet
MS 2023EE301 Stochastic
8 pages
Class Gaussian Process 2024
No ratings yet
Class Gaussian Process 2024
170 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Gaussian Process Regression Guide
No ratings yet
Gaussian Process Regression Guide
13 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Gaussian Process Intuitive
No ratings yet
Gaussian Process Intuitive
17 pages
Gaussian Process
No ratings yet
Gaussian Process
9 pages
Machine Learning Vs Statistics
No ratings yet
Machine Learning Vs Statistics
66 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
CPSC 540: Machine Learning: Gaussians
No ratings yet
CPSC 540: Machine Learning: Gaussians
30 pages
Tut 07
No ratings yet
Tut 07
19 pages
Spatiotemporal Learning Via Infinite-Dimensional Bayesian Filtering and Smoothing A Look at Gaussian Process Regression Through Kalman Filtering
No ratings yet
Spatiotemporal Learning Via Infinite-Dimensional Bayesian Filtering and Smoothing A Look at Gaussian Process Regression Through Kalman Filtering
11 pages
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
No ratings yet
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
6 pages
Variational Gaussian Processes
No ratings yet
Variational Gaussian Processes
62 pages
TheoryIdeasInla Screen
No ratings yet
TheoryIdeasInla Screen
69 pages
Kriging
No ratings yet
Kriging
10 pages
Taylor & Francis, LTD., American Statistical Association The American Statistician
No ratings yet
Taylor & Francis, LTD., American Statistical Association The American Statistician
7 pages
Loop Invariants for CS Students
No ratings yet
Loop Invariants for CS Students
7 pages
LEYKEKHMAN 2019 Numerical Analysis Lecture Notes
No ratings yet
LEYKEKHMAN 2019 Numerical Analysis Lecture Notes
87 pages
Exercise 9: Pagerank - Solution: 1 Problem 1
No ratings yet
Exercise 9: Pagerank - Solution: 1 Problem 1
3 pages
j1 SoC CPU Forth Language
100% (1)
j1 SoC CPU Forth Language
4 pages
One Algorithm From The Book: A Tribute To Ira Pohl: Alexander Stepanov
No ratings yet
One Algorithm From The Book: A Tribute To Ira Pohl: Alexander Stepanov
27 pages
Side Chains
No ratings yet
Side Chains
25 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
Online Bayesian Passive-Aggressive Learning
No ratings yet
Online Bayesian Passive-Aggressive Learning
39 pages
Graphrnn: A Deep Generative Model For Graphs
No ratings yet
Graphrnn: A Deep Generative Model For Graphs
29 pages
SNEPSCHEUT 1994 Mechanized Support For Stepwise Refinement
No ratings yet
SNEPSCHEUT 1994 Mechanized Support For Stepwise Refinement
13 pages
Another Look at The 9 ,'longest Ascending Subsequence" Problem
No ratings yet
Another Look at The 9 ,'longest Ascending Subsequence" Problem
5 pages
JDLX: Visualization of Dancing Links
No ratings yet
JDLX: Visualization of Dancing Links
7 pages
Ewd697 PDF
No ratings yet
Ewd697 PDF
12 pages
VLDS p62 Louis
No ratings yet
VLDS p62 Louis
6 pages
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
No ratings yet
A Reproducing Kernel Hilbert Space Framework For Information-Theoretic Learning
12 pages
QR Decomposition Explained
No ratings yet
QR Decomposition Explained
12 pages
Tensor Rank-One Approximation
No ratings yet
Tensor Rank-One Approximation
32 pages
Sefer Peshita
No ratings yet
Sefer Peshita
801 pages
Ariel/Tide Mall Sampling Form
No ratings yet
Ariel/Tide Mall Sampling Form
10 pages
Slotted Drain Design Calculator 11-08
No ratings yet
Slotted Drain Design Calculator 11-08
4 pages
AN1015 Software Techniques For Improving Micro Controller EMC Performance
No ratings yet
AN1015 Software Techniques For Improving Micro Controller EMC Performance
14 pages
Introduction To Database System
100% (1)
Introduction To Database System
48 pages
Dpco CF
No ratings yet
Dpco CF
12 pages
Readme
100% (1)
Readme
4 pages
5.2 Catalogue
No ratings yet
5.2 Catalogue
5 pages
Java Layout Managers Guide
No ratings yet
Java Layout Managers Guide
9 pages
Define The Two
No ratings yet
Define The Two
13 pages
01 C++ Programming Languag
100% (1)
01 C++ Programming Languag
12 pages
Chap2 Signal Analysis
No ratings yet
Chap2 Signal Analysis
20 pages
5 Simple Steps To Reports and Dashboards
No ratings yet
5 Simple Steps To Reports and Dashboards
3 pages
About Intellipaat: Key Features of Intellipaattraining
100% (1)
About Intellipaat: Key Features of Intellipaattraining
6 pages
SQL Server Architecture & Features
No ratings yet
SQL Server Architecture & Features
18 pages
C++ Copy Constructor: A Copy Constructor Is Called Whenever A New Variable Is Created From An Object
No ratings yet
C++ Copy Constructor: A Copy Constructor Is Called Whenever A New Variable Is Created From An Object
10 pages
Bca Science Slip Sem II C & Dbms Feb2017
No ratings yet
Bca Science Slip Sem II C & Dbms Feb2017
66 pages
Project Management Expertise
No ratings yet
Project Management Expertise
2 pages
Microsoft Dynamics CRM Installing Guide
No ratings yet
Microsoft Dynamics CRM Installing Guide
112 pages
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
102 pages
Condition Exclusion: by Kanhu Ranjan Padhi, SAP Labs
No ratings yet
Condition Exclusion: by Kanhu Ranjan Padhi, SAP Labs
12 pages
GLS Finite Element for Elasticity
No ratings yet
GLS Finite Element for Elasticity
17 pages
Stochastic Processes Overview
No ratings yet
Stochastic Processes Overview
2 pages
Sesion 3 - Estructuras, Controles y ListView TreeView
No ratings yet
Sesion 3 - Estructuras, Controles y ListView TreeView
18 pages
Lieb Herr
No ratings yet
Lieb Herr
2 pages
AirWatch Install Requirements Guide For SaaS v7 0
No ratings yet
AirWatch Install Requirements Guide For SaaS v7 0
3 pages
Manajemen Strategi Fred R David
100% (2)
Manajemen Strategi Fred R David
5 pages
DFSORT Changes SORTSOC PDF
No ratings yet
DFSORT Changes SORTSOC PDF
55 pages
CS 3305 Web Programming 2 - Term 5, 2018-2019
No ratings yet
CS 3305 Web Programming 2 - Term 5, 2018-2019
10 pages
COBOL Language History & Features
No ratings yet
COBOL Language History & Features
104 pages

Gaussian Processes Tutorial

Uploaded by

Gaussian Processes Tutorial

Uploaded by

A Tutorial on Gaussian Processes

(or why I don’t use SVMs)

Machine Learning Department

Consider the problem of nonlinear regression:

A Gaussian process defines a distribution over functions, p(f ), where f is a function

Notice that f can be an infinite-dimensional quantity (e.g. if X = <)

Let f = (f (x1), . . . , f (xn)) be an n-dimensional vector of function values evaluated

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a

p(f (x), f (x0)) = N(µ, Σ)

Gaussian processes (GPs) are parameterized by a mean function, µ(x), and a

An example covariance function:

with parameters (v0, v1, v2, r, α)

Imagine observing a data set D = {(xi, yi)ni=1} = (X, y).

Prior on f is a GP, likelihood is Gaussian, therefore posterior on f is also a GP.

We can use this to make predictions

A sample from the prior for each covariance function:

Corresponding predictions, mean with two standard deviations:

−1.5 −0.8 −2 −1.5

GP: consistent Gaussian prior on any set of function values f = {fn}N

one sample function

GP: consistent Gaussian prior on any set of function values f = {fn}N

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

Given a data set D = (X, y), how do we learn θ?

The marginal likelihood is a function of θ

p(y|X, θ) = N (0, Kθ + σ 2I)

where its log is:

which can be optimized as a function of θ and σ.

• Linear regression with inputs xi and outputs yi: yi = β0 + β1xi + i

• Bayesian linear regression with basis functions:

βm ∼ N(·|0, λm) (independent of β`, ∀` 6= m), i ∼ N(·|0, σ 2)

• Integrating out the coefficients, βj , we find:

This GP has a finite number (M ) of basis functions. Many useful GP kernels

By reproducing property: hk(·, xi), f (·)i = f (xi).

Therefore the regularizer 21 kwk2 → 21 kf k2H = 21 hf (·), f (·)iH = 12 α>Kα = 21 f >K−1f

So we can rewrite the kernelized SVM loss as:

With Gaussian processes we:

• SPGP covariance inverted in O(M 2N )  O(N 3) ⇒ much faster

• SPGP = GP with non-stationary covariance parameterized by X̄

Data set GPC SVM IVM SPGPC

From (Naish-Guzman and Holden, 2008), using exactly same kernels.

2D possible subsets of relevant input features.

One approach, consider all models m ∈ {0, 1}D and find

Problems: intractable, overfitting, we should really average

• Why are we doing feature selection?

• What does it cost us to keep all the features?

• We should only do feature selection if there is a cost associated with measuring

Automatic Relevance Determination:

Consider this covariance function:

Given data, by learning the lengthscales (r1, . . . , rD ) it is possible to do automatic

Terminology for classification with inputs x and classes y:

(figure adapted from Minka, 2001)

Methods single lengthscale multiple lengthscales

Bayesian neural network

Data: D = {(x(n), y (n))}N

Automatic Relevance Determination (ARD):

αd → ∞ variance → 0 weights → 0 (irrelevant)

ARD: optimize α̂ = argmax p(y|X, α).

This work contains 2 key ideas:

2. A gradient based learning procedure for finding the pseudo-inputs and

You might also like

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

Gaussian observation noise: yn = fn + n, where n ∼ N (0, σ 2)

• Linear regression with inputs xi and outputs yi: yi = β0 + β1xi + i

βm ∼ N(·|0, λm) (independent of β`, ∀` 6= m), i ∼ N(·|0, σ 2)

• SPGP covariance inverted in O(M 2N ) O(N 3) ⇒ much faster