0% found this document useful (0 votes)

85 views35 pages

Tutorial: Gaussian Process Models For Machine Learning

This document provides an overview of Gaussian process models for machine learning. It discusses: 1) What a Gaussian process is - a stochastic process where any finite set of random variables has a joint Gaussian distribution. 2) Common covariance functions used in GPS, including squared exponential, Matern, and periodic covariances. 3) How GPS can be used for regression and prediction by placing a GP prior over functions and linking to data with a likelihood function. 4) The process of making predictions for new data points using the GP, which provides mean predictions and uncertainty in the form of predictive variances. 5) How hyperparameters of the GP such as those in the covariance function can be optimized by

Uploaded by

Saundaraya Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views35 pages

Tutorial: Gaussian Process Models For Machine Learning

Uploaded by

Saundaraya Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Tutorial: Gaussian process models

for machine learning

Ed Snelson (snelson@gatsby.ucl.ac.uk)

Gatsby Computational Neuroscience Unit, UCL

26th October 2006

Info

The GP book: Rasmussen and Williams, 2006

Basic GP (Matlab) code available:

http://www.gaussianprocess.org/gpml/

1
Gaussian process history

Prediction with GPs:

Time series: Wiener, Kolmogorov 1940s

Geostatistics: kriging 1970s naturally only two or three

dimensional input spaces

Spatial statistics in general: see Cressie [1993] for overview

General regression: OHagan [1978]

Computer experiments (noise free): Sacks et al. [1989]

Machine learning: Williams and Rasmussen [1996], Neal [1996]

2
Nonlinear regression
Consider the problem of nonlinear regression:
You want to learn a function f with error bars from data D = {X, y}

A Gaussian process is a prior over functions p(f ) which can be used

for Bayesian regression:

p(f )p(D|f )
p(f |D) =
p(D)

3
What is a Gaussian process?

Continuous stochastic process random functions a set of

random variables indexed by a continuous variable: f (x)

Set of inputs X = {x1, x2, . . . , xN }; corresponding set of random

function variables f = {f1, f2, . . . , fN }

GP: Any set of function variables {fn}N

n=1 has joint (zero mean)
Gaussian distribution:

p(f |X) = N (0, K)

Conditional model - density of inputs not modeled

Z
Consistency: p(f1) = df2 p(f1, f2)

4
Covariances

Where does the covariance matrix K come from?

Covariance matrix constructed from covariance function:

Kij = K(xi, xj )

Covariance function characterizes correlations between different

points in the process:

K(x, x0) = E[f (x)f (x0)]

Must produce positive semidefinite covariance matrices v>Kv 0

Ensures consistency

5
Squared exponential (SE) covariance

" #
0
2
1 x x
K(x, x0) = 02 exp
2

Intuition: function variables close in input space are highly

correlated, whilst those far away are uncorrelated

, 0 hyperparameters. : lengthscale, 0: amplitude

Stationary: K(x, x0) = K(x x0) invariant to translations

Very smooth sample functions infinitely differentiable

6
Matern class of covariances

1
0
0

0 2 2|x x | 2|x x |
K(x, x ) = K
()

where K is a modified Bessel function.

Stationary, isotropic

: SE covariance

Finite : much rougher sample functions

= 1/2: K(x, x0) = exp(|x x0|/), OU process, very rough

sample functions

7
Nonstationary covariances

Linear covariance: K(x, x0) = 02 + xx0

Brownian motion (Wiener process): K(x, x0) = min(x, x0)

0
2 xx

0 2 sin 2
Periodic covariance: K(x, x ) = exp 2

Neural network covariance

8
Constructing new covariances from old

There are several ways to combine covariances:

Sum: K(x, x0) = K1(x, x0) + K2(x, x0)

addition of independent processes

Product: K(x, x0) = K1(x, x0)K2(x, x0)

product of independent processes
Z
Convolution: K(x, x0) = dz dz 0 h(x, z)K(z, z 0)h(x0, z 0)
blurring of process with kernel h

9
Prediction with GPs

We have seen examples of GPs with certain covariance functions

General properties of covariances controlled by small number of

hyperparameters

Task: prediction from noisy data

Use GP as a Bayesian prior expressing beliefs about underlying

function we are modeling

Link to data via noise model or likelihood

10
GP regression with Gaussian noise

Data generated with Gaussian white noise around the function f

y =f + E[(x)(x0)] = 2(x x0)

Equivalently, the noise model, or likelihood is:

p(y|f ) = N (f , 2I)

Integrating over the function variables gives the marginal likelihood:

Z
p(y) = df p(y|f )p(f )

= N (0, K + 2I)

11
Prediction

N training input and output pairs (X, y), and T test inputs XT

Consider joint training and test marginal likelihood:

2 KN KNT
p(y, yT ) = N (0, KN +T + I) , KN +T = ,
KTN KT

Condition on training outputs: p(yT |y) = N (T , T )

T = KTN [KN + 2I]1y

T = KT KTN [KN + 2I]1KN T + 2I

Gives correlated predictions. Defines a predictive Gaussian process

12
Prediction

Often only marginal variances (diag T ) are required sufficient

to consider a single test input x:

= KN [KN + 2I]1y
2 = K KN [KN + 2I]1KN + 2 .

Mean predictor is a linear predictor: = KN

Inversion of KN + 2I costs O(N 3)

Prediction cost per test case is O(N ) for the mean and O(N 2) for
the variance

13
Determination of hyperparameters

Advantage of the probabilistic GP framework ability to choose

hyperparameters and covariances directly from the training data

Other models, e.g. SVMs, splines etc. require cross validation

GP: minimize negative log marginal likelihood L() wrt

hyperparameters and noise level :

1 1 > 1 N
L = log p(y|) = log det C() + y C ()y + log(2)
2 2 2

where C = K + 2I

Uncertainty in the function variables f is taken into account

14
Gradient based optimization

Minimization of L() is a non-convex optimization task

Standard gradient based techniques, such as CG or quasi-Newton

Gradients:
L 1 1 C 1 > 1 C 1
= tr C y C C y
i 2 i 2 i

Local minima, but usually not much of a problem with few

hyperparameters

Use weighted sums of covariances and let ML choose

15
Automatic relevance determination1

The ARD SE covariance function for multi-dimensional inputs:

" D
#
2
xd x0d

0 1 X
K(x, x ) = 02 exp
2 d
d=1

Learn an individual lengthscale hyperparameter d for each input

dimension xd

d determines the relevancy of input feature d to the regression

If d very large, then the feature is irrelevant

1
Neal, 1996. MacKay, 1998.
16
Relationship to generalized linear regression

Weighted sum of fixed finite set of M basis functions:

M
X
f (x) = wmm(x) = w>(x)
m=1

Place Gaussian prior on weights: p(w) = N (0, w )

Defines GP with finite rank M (degenerate) covariance function:

K(x, x0) = E[f (x)f (x0)] = >(x)w (x0)

Function space vs. weight space interpretations

17
Relationship to generalized linear regression

General GP can specify covariance function directly rather than

via set of basis functions

Mercers theorem: can always decompose covariance function into

eigenfunctions and eigenvalues:

X
K(x, x0) = ii(x)i(x0)
i=1

If sum finite, back to linear regression. Often sum infinite, and no

analytical expressions for eigenfunctions

Power of kernel methods in general (e.g. GPs, SVMs etc.)

project x 7 (x) into high or infinite dimensional feature space
and still handle computations tractably
18
Relationship to neural networks

Neural net with one hidden layer of NH units:

NH
X
f (x) = b + vj h(x; uj )
j=1

h bounded hidden layer transfer function

(e.g. h(x; u) = erf(u>x))

If vs and b zero mean independent, and weights uj iid, then CLT

implies NN GP as NH [Neal, 1996]

NN covariance function depends on transfer function h, but is in

general non-stationary

19
Relationship to spline models

Univariate cubic spline has cost functional:

N
X Z 1
(f (xn) yn)2 + f 00(x)2dx
n=1 0

Can give probabilistic GP interpretation by considering RH term

as an (improper) GP prior

Make proper by weak penalty on zeroth and first derivatives

Can derive a spline covariance function, and full GP machinery can

be applied to spline regression (uncertainties, ML hyperparameters)

Penalties on derivatives equivalent to specifying the inverse

covariance function no natural marginalization
20
Relationship to support vector regression
GP regression SV regression

log p(y|f)
log p(y|f)

f y e f e y

-insensitive error function can be considered as a non-Gaussian

likelihood or noise model. Integrating over f becomes intractable

SV regression can be considered as MAP solution fMAP to GP with

-insensitive error likelihood

Advantages of SVR: naturally sparse solution by QP, robust to

outliers. Disadvantages: uncertainties not taken into account, no
predictive variances, or learning of hyperparameters by ML
21
Logistic and probit regression

Binary classification task: y = 1

GLM likelihood: p(y = +1|x, w) = (x) = (x>w)

(z) sigmoid function such as the logistic or cumulative normal.

Weight space viewpoint: prior p(w) = N (0, w )

Function space viewpoint: let f (x) = w>x, then likelihood (x) =

(f (x)), Gaussian prior on f

22
GP classification

1. Place a GP prior directly on f (x)

2. Use a sigmoidal likelihood: p(y = +1|f ) = (f )

Just as for SVR, non-Gaussian likelihood makes integrating over f

23
GP classification

Two common ways to make Gaussian approximation to posterior:

1. Laplace approximation. Second order Taylor approximation about

mode of posterior

2. Expectation propagation (EP)2. EP can be thought of as

approximately minimizing KL[p(f |y)||q(f |y)] by an iterative
procedure.

Kuss and Rasmussen [2005] evaluate both methods experimentally

and find EP to be significantly superior

Classification accuracy on digits data sets comparable to SVMs.

Advantages: probabilistic predictions, hyperparameters by ML
2
Minka, 2001
24
GP latent variable model (GPLVM)3

Probabilistic model for dimensionality reduction: data is set of

high dimensional vectors: y1, y2, . . . , yN

Model each dimension (k) of y as an independent GP with

unknown low dimensional latent inputs: x1, x2, . . . , xN

Y 1 2 > 1
p({yn}|{xn}) exp wk Yk K Yk
2
k

Maximize likelihood to find latent projections {xn} (not a true

density model for y because cannot tractably integrate over x).

Smooth mapping from x to y with uncertainties

3
Lawrence, 2004
25
Application: GPLVMs for human pose modeling4

High dimensional data points are feature vectors derived from pose
information from mo-cap data.

Features: joint angles, vertical orientation, velocity and

accelerations

GPLVM used to learn low-dimensional trajectories of e.g. base-ball

pitch, basketball jump shot

GPLVM predictive distribution used to make cost function for

finding new poses with constraints

4
Grochow, Martin, Hertzmann, Popovic, 2004. Style-based inverse kinematics. Demos available
from http://grail.cs.washington.edu/projects/styleik/
26
Sparse GP approximations

Problem for large data sets: training GP O(N 3), prediction O(N 2)
per test case

Recent years many approximations developed reduce cost to

O(N M 2) training and O(M 2) prediction per test case

Based around a low rank (M ) covariance approximation

See Quinonero Candela and Rasmussen [2005] for a review of

regression approximations

Classification more complicated, so simpler approximations such

as IVM5 may be more suitable

5
Lawrence et al., 2003
27
Two stage generative model

f
pseudo-input prior
p(f |X) = N (0, KM )

1. Choose any set of M (pseudo-) inputs X

2. Draw corresponding function values f from prior

28
Two stage generative model
conditional
p(f |f ) = N (, )

f = KNM K1
M f

= KN KNM K1
M KMN

x

X

3. Draw f conditioned on f

This two stage procedure defines exactly the same GP prior

We have not gained anything yet, but it inspires a sparse

approximation ...
29
Factorized approximation
single point conditional
p(fn|f ) = N (n, n)

n = KnM K1
M f
f
n = Knn KnM K1
M KM n

Y
Approximate: p(f |f ) p(fn|f ) = N (, ) , = diag()
n
h Y i
Minimum KL: min KL p(f |f ) k qn(fn)
qn
n
30
Sparse pseudo-input Gaussian processes (SPGP)6
R Q
Integrate out f to obtain SPGP prior: p(f ) = df n p(fn|f ) p(f )
GP prior SPGP/FITC prior
N (0, KN ) p(f ) = N (0, KNM K1
M KMN + )

= +

SPGP/FITC covariance inverted in O(M 2N ). Predictive mean and

variance computed in O(M ) and O(M 2) per test case respectively

SPGP = GP with non-stationary covariance parameterized by X

6
Snelson and Ghahramani, 2005
31
1D demo
amplitude lengthscale noise

Initialize adversarially: amplitude and lengthscale too big

noise too small
pseudo-inputs bunched up

32
1D demo
amplitude lengthscale noise

Pseudo-inputs and hyperparameters optimized

32
Future GP directions

Design of covariance functions to encorporate more specific prior

knowledge

Beyond vectorial input data: structure in the input domain

Further improvements to sparse GP approximations to scale GPs

up for very large data sets

Beyond regression and classification, e.g. applications of latent

variable models such as GPLVM

Gaussian Processes Tutorial
No ratings yet
Gaussian Processes Tutorial
31 pages
Gaussian Processes Regression Tutorial
No ratings yet
Gaussian Processes Regression Tutorial
30 pages
Gaussian Processes For Regression: A Tutorial
No ratings yet
Gaussian Processes For Regression: A Tutorial
7 pages
Gaussian Processes for ML Students
No ratings yet
Gaussian Processes for ML Students
15 pages
ML 3
No ratings yet
ML 3
66 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Snelson 2005 Sparse Gps
No ratings yet
Snelson 2005 Sparse Gps
8 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Lecture6 2015
No ratings yet
Lecture6 2015
36 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Ai - Foundations of Machine Learning II
No ratings yet
Ai - Foundations of Machine Learning II
54 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
No ratings yet
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
8 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
No ratings yet
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
54 pages
Stochastic Differential Equations in Machine Learning
No ratings yet
Stochastic Differential Equations in Machine Learning
26 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
Gaussian Processes For Machine
No ratings yet
Gaussian Processes For Machine
62 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
ML Classifiers & Regression Guide
No ratings yet
ML Classifiers & Regression Guide
46 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Machine Learning and Pattern Recognition Gaussian Processes
No ratings yet
Machine Learning and Pattern Recognition Gaussian Processes
6 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Multi Layer NN
No ratings yet
Multi Layer NN
16 pages
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
No ratings yet
An Intuitive Tutorial To Gaussian Processes Regression: Jie Wang Ingenuity Labs Research Institute
19 pages
Classification
No ratings yet
Classification
47 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Supervised Learning & Regression
No ratings yet
Supervised Learning & Regression
41 pages
Merge
No ratings yet
Merge
240 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Tutorial
No ratings yet
Tutorial
11 pages
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
No ratings yet
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
9 pages
SVMs: Classification & Regression Guide
No ratings yet
SVMs: Classification & Regression Guide
66 pages
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
No ratings yet
Gaussian Process - Part 2: 1 2 N T I 1 2 N T
4 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
ML Merge
No ratings yet
ML Merge
145 pages
Machine Learning Guide
No ratings yet
Machine Learning Guide
185 pages
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
No ratings yet
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
26 pages
Main Notes
No ratings yet
Main Notes
227 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
UC Berkeley ML Course Guide
100% (1)
UC Berkeley ML Course Guide
185 pages
Advanced Gaussian Process Models
No ratings yet
Advanced Gaussian Process Models
15 pages
ICOM IC-2820H Brochure
No ratings yet
ICOM IC-2820H Brochure
3 pages
Understanding Carrier Ethernet Throughput - V14
No ratings yet
Understanding Carrier Ethernet Throughput - V14
46 pages
Product Brochure
100% (1)
Product Brochure
20 pages
Tape Recorder
No ratings yet
Tape Recorder
4 pages
Csol590-Module 2 - Assignment - Forensic Imaging
No ratings yet
Csol590-Module 2 - Assignment - Forensic Imaging
7 pages
Trucare Interior Wall Primer Advanced
No ratings yet
Trucare Interior Wall Primer Advanced
2 pages
Peter England Questionere Mam
No ratings yet
Peter England Questionere Mam
7 pages
Material Inward Entry
No ratings yet
Material Inward Entry
30 pages
Pump Theory
100% (2)
Pump Theory
167 pages
GE 101 - Lesson 1
No ratings yet
GE 101 - Lesson 1
31 pages
English Grammar Error Detection & Correction
No ratings yet
English Grammar Error Detection & Correction
231 pages
Moblie Attachment
No ratings yet
Moblie Attachment
52 pages
Questioins - Ch1 - Govr
No ratings yet
Questioins - Ch1 - Govr
4 pages
Citi Investment Banking Analyst
No ratings yet
Citi Investment Banking Analyst
4 pages
River Mapping for Class X Geography
No ratings yet
River Mapping for Class X Geography
1 page
Case Study 3 - ACJ Manufacturing
100% (1)
Case Study 3 - ACJ Manufacturing
9 pages
Pritika 1-4
No ratings yet
Pritika 1-4
6 pages
Breville BES860XL Manual
100% (1)
Breville BES860XL Manual
116 pages
Lenton Plus
No ratings yet
Lenton Plus
2 pages
Logic Programming A Hands-On Approach
No ratings yet
Logic Programming A Hands-On Approach
67 pages
Freelance Contract Template
No ratings yet
Freelance Contract Template
7 pages
Problem Solving and Reasoning
No ratings yet
Problem Solving and Reasoning
44 pages
1000 Common English Words List
No ratings yet
1000 Common English Words List
13 pages
A.N. Matveev - Optics
100% (1)
A.N. Matveev - Optics
445 pages
22 November 2024
No ratings yet
22 November 2024
15 pages
Week 3 - Windows Forensic Analysis
No ratings yet
Week 3 - Windows Forensic Analysis
17 pages
Play Area Competition Pools
No ratings yet
Play Area Competition Pools
2 pages
IPSAS 21& 26 Impairment
No ratings yet
IPSAS 21& 26 Impairment
39 pages
QM ZG528 Course Handout
No ratings yet
QM ZG528 Course Handout
8 pages
20th Century Modern Design-Design Movements
100% (1)
20th Century Modern Design-Design Movements
3 pages

Tutorial: Gaussian Process Models For Machine Learning

Uploaded by

Tutorial: Gaussian Process Models For Machine Learning

Uploaded by

Tutorial: Gaussian process models

for machine learning

Gatsby Computational Neuroscience Unit, UCL

26th October 2006

The GP book: Rasmussen and Williams, 2006

Basic GP (Matlab) code available:

Prediction with GPs:

Time series: Wiener, Kolmogorov 1940s

Geostatistics: kriging 1970s naturally only two or three

Spatial statistics in general: see Cressie [1993] for overview

General regression: OHagan [1978]

Computer experiments (noise free): Sacks et al. [1989]

Machine learning: Williams and Rasmussen [1996], Neal [1996]

A Gaussian process is a prior over functions p(f ) which can be used

Continuous stochastic process random functions a set of

Set of inputs X = {x1, x2, . . . , xN }; corresponding set of random

GP: Any set of function variables {fn}N

p(f |X) = N (0, K)

Conditional model - density of inputs not modeled

Where does the covariance matrix K come from?

Covariance matrix constructed from covariance function:

Covariance function characterizes correlations between different

K(x, x0) = E[f (x)f (x0)]

Must produce positive semidefinite covariance matrices v>Kv 0

Intuition: function variables close in input space are highly

, 0 hyperparameters. : lengthscale, 0: amplitude

Stationary: K(x, x0) = K(x x0) invariant to translations

Very smooth sample functions infinitely differentiable

where K is a modified Bessel function.

Finite : much rougher sample functions

= 1/2: K(x, x0) = exp(|x x0|/), OU process, very rough

Linear covariance: K(x, x0) = 02 + xx0

Brownian motion (Wiener process): K(x, x0) = min(x, x0)

Neural network covariance

There are several ways to combine covariances:

Sum: K(x, x0) = K1(x, x0) + K2(x, x0)

Product: K(x, x0) = K1(x, x0)K2(x, x0)

We have seen examples of GPs with certain covariance functions

General properties of covariances controlled by small number of

Task: prediction from noisy data

Use GP as a Bayesian prior expressing beliefs about underlying

Link to data via noise model or likelihood

Data generated with Gaussian white noise around the function f

y =f + E[(x)(x0)] = 2(x x0)

Equivalently, the noise model, or likelihood is:

Integrating over the function variables gives the marginal likelihood:

Consider joint training and test marginal likelihood:

Condition on training outputs: p(yT |y) = N (T , T )

T = KTN [KN + 2I]1y

Gives correlated predictions. Defines a predictive Gaussian process

Often only marginal variances (diag T ) are required sufficient

Mean predictor is a linear predictor: = KN

Inversion of KN + 2I costs O(N 3)

Advantage of the probabilistic GP framework ability to choose

Other models, e.g. SVMs, splines etc. require cross validation

GP: minimize negative log marginal likelihood L() wrt

Uncertainty in the function variables f is taken into account

Minimization of L() is a non-convex optimization task

Standard gradient based techniques, such as CG or quasi-Newton

Local minima, but usually not much of a problem with few

Use weighted sums of covariances and let ML choose

The ARD SE covariance function for multi-dimensional inputs:

Learn an individual lengthscale hyperparameter d for each input

d determines the relevancy of input feature d to the regression

If d very large, then the feature is irrelevant

Weighted sum of fixed finite set of M basis functions:

Place Gaussian prior on weights: p(w) = N (0, w )

Defines GP with finite rank M (degenerate) covariance function:

K(x, x0) = E[f (x)f (x0)] = >(x)w (x0)

Function space vs. weight space interpretations

General GP can specify covariance function directly rather than

Mercers theorem: can always decompose covariance function into

If sum finite, back to linear regression. Often sum infinite, and no

Power of kernel methods in general (e.g. GPs, SVMs etc.)

Neural net with one hidden layer of NH units:

h bounded hidden layer transfer function

y =f + E[(x)(x0)] = 2(x x0)

-insensitive error function can be considered as a non-Gaussian