0% found this document useful (0 votes)
13 views80 pages

10 KRR

The document discusses the relationship between human categorization and kernel methods, particularly focusing on Kernel Ridge Regression and its application in model selection through cross-validation. It highlights psychological models of categorization and the use of generalization gradients to measure similarity in perceptual spaces. The document emphasizes the complexity of non-linear classification and the implications for understanding how humans categorize stimuli.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views80 pages

10 KRR

The document discusses the relationship between human categorization and kernel methods, particularly focusing on Kernel Ridge Regression and its application in model selection through cross-validation. It highlights psychological models of categorization and the use of generalization gradients to measure similarity in perceptual spaces. The document emphasizes the complexity of non-linear classification and the implications for understanding how humans categorize stimuli.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Machine Learning

Kernel Ridge Regression

Siamac Fazli

Nazarbayev University – Dept. of Computer Science

1 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Overview

• Psychological Models of Categorization


• Kernel Methods
• Kernel Ridge Regression
• Generalization and Model Selection using Cross-Validation

2 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

3 3
Positive Examples Positive Examples
Negative Examples Negative Examples
2 2

1 1
x2

x2
0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
1 1

Some data is linearly separable Other data is not separable

3 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

3
Positive Examples
Negative Examples
2

1
Non-linear models (such as NN) can
learn linearly non separable problems
2

0
x

Kernel methods can implicitly model


−1
arbitrarily complex hidden layers
−2
Psychological models of learning are
−3
−3 −2 −1 0 1 2 3 very similar to kernel methods
x1

Linearly non-separable data

4 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Psychological experiments show [Shepard, 1987]:


We can learn arbitrary concepts/categorizations
(i.e. also non-linear classification boundaries)
Central question: How do our brains do that?
Simpler question: What is our perceptual space?
• Do we extract non-linear features?
• How are perceptual similarities related to physical similarity?
• How can we measure internal similarities?

5 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )
3. Measure amount of saliva secreted

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )
3. Measure amount of saliva secreted
4. Perceptual similarity ∝ amount of saliva

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the


categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,


a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the


categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,


2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the


categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,


2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the


categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,


2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
4. Test categorization on new stimuli
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the


categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,


2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
4. Test categorization on new stimuli
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
5. Human categorization reveals
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle perceptual space
Bush and Mosteller (1951, p. 413) to conclude, “Although
there are several intuitive notions as to what is meant by
Figure 1. An illustration of a perceptual space. The stimuli
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Sheperd’s Law of Universal Generalization

Perceptual Similarity of new data x decays


exponentially with distance from prototype µ

8 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Sheperd’s Law of Universal Generalization

Perceptual Similarity of new data x decays


exponentially with distance from prototype µ

µ Gaussian Kernel
k(µ,x)
σ

−(µ−x )2
x k(µ, x ) = e σ (1)
Gaussian Kernel in R1

8 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries


The internal feature space is not clear

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries


The internal feature space is not clear
The similarity measure we use internally is not clear

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries


The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries


The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples
We can train/modify it by presenting examples

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries


The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples
We can train/modify it by presenting examples
⇒ The same holds true for kernel methods!

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other two stimuli in this category, the generaliza-


tion gradients overlap quite a bit and form a bigger hump.
k(x, x1) Regions with a high density of exemplars therefore lead
w1
to a high output of the exemplar network (Equation 4),
if all of the weights are set to 1. Hence, the output of the
network can be interpreted as a measure of category mem-
x
k(x, x2)
w2  bership or, if appropriately normalized, as an estimate of
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
w1
to a high output of the exemplar network (Equation 4),
if all of the weights are set to 1. Hence, the output of the
network can be interpreted as a measure of category mem-
x
k(x, x2)
w2  bership or, if appropriately normalized, as an estimate of
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Insteadare setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2 
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2 
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2 
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
• Kernel function in eq. 1 is a
k(x, x3)
Conclusionsgaussian
on Similarityof width σ on each
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three data
and similarity measurespoint
based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN


• Imagine one hidden layer

For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2 
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
• Kernel function in eq. 1 is a
k(x, x3)
Conclusionsgaussian
on Similarityof width σ on each
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three data point
and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
Anyliterature,
network non-linear function
nets with (e.g. similar
“tuning functions” category borders) can be implemented as a
exemplar theories of categorization to kernel density esti-
to linear
the similarity kernel
combination are called radial
of basis function
similarities mators.
to all However, recently,
previously methods
seen basedpoints
data on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Trick

The kernel trick is a powerful machine learning tool


[Aizerman et al., 1964; Müller et al., 2001; Schölkopf and Smola,
2002; Shawe-Taylor and Cristianini, 2004]

• Psychological intuition [Jaekel, 2007]:


New predictions of arbitrarily complex categorizations require
only comparisons with previously learned examples
• Mathematical intuition:
Optimal solution lies in the space spanned by the data
→ Even in an infinite dimensional feature space we need only
estimate N parameters (N is the number of data points)

11 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Why Kernel Methods?

Kernel methods are powerful


Kernel methods can fit arbitrary functions
Often only a small number of kernel parameters needs to be
optimized (for MLPs we need to optimize more parameters)

Kernel methods are (sometimes) very efficient


Even if you don’t need to model non-linear dependencies
→ When you have less data points than your data has
dimensions kernel methods can offer a dramatic speedup

12 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Trick

Kernel Trick
Prediction (e.g. category membership) ŷ
given a new data point xnew can be expressed
as linear combination of data points xi

N
X
ŷ = f (xnew ) = kφ (xnew , xi )αi (2)
i=1

where i ∈ {1, . . . , N} are the indices of previously seen data points

Similarities between data points are stored in the kernel matrix K

Kij = k(xi , xj ): similarity between ith and jth data point

13 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Some Popular Kernel Functions

Linear Kernel

k(xi , xj ) = xi> xj (3)

Polynomial Kernel

k(xi , xj ) = (xi> xj + c)p (4)

Gaussian Kernel
2
k(xi , xj ) = e ||xi −xj ||2 /−2σ (5)

14 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Non-linear feature mapping

Example on the board

15 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Graphical explanation for non-linear feature mapping

taken from [Müller et al., 2001]

16 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Methods - Pros and Cons

+ Powerful Modeling Tool


(non-linear problems become linear in kernel space)
+ Omnipurpose Kernels
(Gaussian works well in many cases)
+ Algorithms are independent of kernel!

– Difficult to understand what’s happening in kernel space


– Model complexity increases with number of data points
→ If you have too much data, kernel methods can be slow

17 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Methods - Training and Testing in Practice


• Training and Testing Data X ∈ RDData ×N , y ∈ R1×N
• Compute Kernel on Training Data

K = k(Xtrain , Xtrain ) (6)

• Training
Compute weights α ∈ RNtrain ×1 for each training sample

α = someKernelAlgorithm(K , ytrain ) (7)

• Testing
Compute predictions on test data Xtest

ŷtest = (k(Xtest , Xtrain )α)> (8)

18 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

board

19 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression


In linear ridge regression we look for a linear combination of features w
that minimizes the prediction error and has a small norm

ERR (w) =(y − w> X)2 + λ||w||22 (9)

Computing its derivative

∂ERR (w)
= − 2Xy> + 2XX> w + λ2w (10)
∂w
1
→ w =X (y> − X> w)
λ
| {z }
a
=Xa (11)

we see that the optimal w is a linear combination a of all data points X


board

19 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

1
a = (y> − X> w)
λ
λa =y> − X> Xa
y> =(X> X + Iλ)a
> −1 >
→ a =(X
| {zX} +Iλ) y (12)
K

K is called kernel matrix - here just a linear kernel

20 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

1
a = (y> − X> w)>
λ
a =(K + Iλ)−1 y> (13)

K is called kernel matrix - here just a linear kernel


Kij denotes similarity between data point xi and xj

Predictions (linear kernels) y =w> X = (Xa)> X = a> X> X = (Ka)>

We derived kernel ridge regression only for K = X> X


But eq. 13 works for all kernel functions!
All that changes is how you compute the kernel K.

21 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression vs. Ridge Regression

For many algorithms there are kernel versions [Müller et al., 2001].

Ridge Regression w =(XX> + λI)−1 Xy>


Kernel Ridge Regression w =Xa
a =(X> X + λI)−1 y>

When should we use one or the other?

22 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Covariance and Kernel Matrix


Covariance Matrix
Given data matrices X ∈ RD×N
(D are dimensions, N number of
samples)
• Covariance matrices grow
quadratically with D

(Linear) Kernel Matrix → For low dimensional data

• Kernel matrices grow


quadratically with N
→ For high dimensional data

23 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression


• But: How uncertain are we about the predictions?

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression


• But: How uncertain are we about the predictions?
• We can obtain these uncertainties from kernel ridge regression

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression


• But: How uncertain are we about the predictions?
• We can obtain these uncertainties from kernel ridge regression
• Assuming that φ(x) are from a gaussian distribution
Kernel Ridge Regression is a Gaussian Process

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

A Gaussian Process
• describes a distribution over functions

25 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

A Gaussian Process
• describes a distribution over functions
• is completely specified by its mean and covariance function

µ(x) = E[φ(x)] Kernel Ridge Regression Predictions


k(xp , xq ) = E[(φ(xp ) − µ(xp ))(φ(xq ) − µ(xq ))]
f (x) =GP(µ(x), k(xp , xq ))

25 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes
There is a lot to know about Gaussian Processes (GPs),
most of which can be found in [Rasmussen and Williams, 2005]:

if you are interested in GPs check Chapter 2 in Rasmussen


it is free! and you can find it here:
http://www.gaussianprocess.org/gpml/chapters/
26 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

Consider a toy data example

ftrue (x ) = cos(x ), fmeasured = ftrue + ,  ∼ N (0, 0.1)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + ,  ∼ N (0, 0.1)

f(x)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + ,  ∼ N (0, 0.1)

f(x)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + ,  ∼ N (0, 0.1)

f(x) σ = 1.0, = 0.010

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Computation of Gaussian Kernel

Algorithm 1 Gaussian Kernel


Require: Data x1 , . . . , xn = X , x̂1 , . . . , x̂m = X̂ , kernel width σ
1: H = repmat(sum((X. ∗ X), 2), 1, m) − 2XX̂> + repmat(sum((X̂. ∗ X̂), 2)> , n, 1)
2: K = exp(−H./2./σ 2 );
3: return K

notes:
• x indicates train data and x̂ indicates test data
• the first and second lines calculate the Gaussian kernel between train and test
data all at once (K ∈ Rn×m )

HW: think about line 1 for 5min-1h

28 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression Algorithm

Algorithm 2 Kernel Ridge Regression - Training


Require: Kernel matrix K ∈ RN×N ,labels [y1 , . . . , yN ]> ∈ RU , ridge κ
Ensure: a
1: # Compute dual coefficients
2: a = (K + Iκ)−1 y>

Algorithm 3 Kernel Ridge Regression - Prediction


Require: Training Data Xtrain ∈ RD×Ntrain , Testing Data Xtest ∈ RD×Ntest , weights a,
regularizer κ, kernel parameter(s) Φ, kernel function k(., .)
Ensure: predicted output ŷ
1: Compute Ktrain = k(Xtrain , Xtrain , Φ)
2: Compute K = k(Xtest , Xtrain , Φ)
3: ŷ = (Ka)>

29 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression Algorithm

Algorithm 4 Gaussian Process - Prediction


Require: Training Data Xtrain ∈ RD×Ntrain , Testing Data Xtest ∈ RD×Ntest , weights a,
regularizer κ, kernel parameter(s) Φ, kernel function k(., .)
Ensure: Parameters of p(y |Xtrain , Xtest , Φ, κ) ∼ N (µ, Σ)
1: Compute Ktest = k(Xtest , Xtest , Φ)
2: Compute Ktrain = k(Xtrain , Xtrain , Φ)
3: Compute K = k(Xtest , Xtrain , Φ)
4: Compute mean of predicted label distribution p(y |Xtrain , Xtest , Φ, κ)
5: µ = (Ka)>
6: Compute covariance of p(y |Xtrain , Xtest , Φ, κ)
7: Σ = Ktest − K(Ktrain + κI)−1 K

30 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Hyperarameter Selection

For kernel ridge regression / GPs we need to specify


• Kernel parameter(s) σ
• Ridge parameter κ

These are called hyper parameters


How can we choose them optimally?

31 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Hyperarameter Selection For Gaussian Processes

Different hyper parameters lead to different solutions

σ = 0.5, = 0.100 σ = 1.0, = 0.100 σ = 2.0, = 0.100


f(x)

f(x)

f(x)
x x x
σ = 0.5, = 0.500 σ = 1.0, = 0.500 σ = 2.0, = 0.500
f(x)

f(x)

f(x)
x x x

32 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Generalization and Model Selection

Generalization: Correct categorization/prediction of new data

Similarity and Generalization are complementary concepts


[Jäkel et al., 2008; Shepard, 1987]:

Good algorithms generalize well to new data

33 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

−6 −6 −6
σ=0.2 κ=10 σ=0.5 κ=10 σ=2.0 κ=10
3 3 3
Generalization and Model Selection
2 2 2
1 1 1

x2

x2

x2
0 0 0
The best model
−1 is the model
−1 that generalizes
−1 best
−2 −2 −2

Overfitting −3
−2 0
Better
2
fit
−3
−2 0 2
−3 Underfitting
−2 0 2
x1 x1 x1
−6 −6
σ=0.2 κ=10
−6 σ=0.2 κ=10 σ=0.2
σ=0.5 κ=100
κ=10
−6 σ=0.5 κ=10 σ=0.5
σ=2.0 κ=100
κ=10
−6 σ=2.0 κ=10−6 σ=2.0 κ=100
3 3 33 3 33 3 3
2 2 22 2 22 2 2
1 1 11 1 11 1 1
x2

x2

x2
0
x22

x22
0

x2
00 00 0
x2

0
−1 −1 −1
−1 −1 −1
−1 −1 −1
−2 −2 −2
−2 −2 −2
−2 −2 −2
−3 −3 −3
−3 −3 −3
−3 −3 −3
−2 0 −2
2 0 2
−2
−2 00 −2
22 0 2
−2
−2 00 −2
22 0 2
−2 0 2
x1 x1 xx1 x1 xx1 x1 x1
1 1
σ=0.2 κ=100 σ=0.2 κ=100 σ=0.5 κ=100 σ=0.5 κ=100 σ=2.0 κ=100 σ=2.0 κ=100
Model
3 is too complex 3 3 Appropriate complexity
3 3 3 Model is too simple
→ Bad
2 generalization
2 2 → Good
2 generalization
2 2 → Bad generalization
1 1 1 1 1 1
x2

x2

x2
0 0 0
x2

x2

x2

0 0 0
−1 −1 −1 −1 −1 −1

−2
σ - kernel
−2 −2
width−2 −2
κ - −2regularizer
−3 −3 −3 −3 −3 −3
−2 0 −2
2 0 2
−2 0 −2
2 0 2
−2 0 −2
2 0 2
x1 x1 x1 x1 x1 x1

34 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data
• Repeat on different cross-validation folds

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data
• Repeat on different cross-validation folds
• Average performance on test set across all folds

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data
• Repeat on different cross-validation folds
• Average performance on test set across all folds
Cross-validation can be used for:

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data
• Repeat on different cross-validation folds
• Average performance on test set across all folds
Cross-validation can be used for:
Model evaluation
Test how good an algorithm actually is

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

How to Achieve Good Generalization?

When using powerful algorithms (MLPs, KRR, . . . )


every data set can be modeled perfectly! (overfitting)
But we want to model new data well (generalization)
Cross-validation implements this idea:
• Train model on part of data
• Test model on other part of data
• Repeat on different cross-validation folds
• Average performance on test set across all folds
Cross-validation can be used for:
Model evaluation
Test how good an algorithm actually is
Model selection
Optimize parameters of a model for generalization performance

35 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Cross-Validation

Split data set in F different training and test data


fold 1 [ x1 , x2 , x3 , x4 , x5 , x6 ]
| {z } | {z }
F1train F1test

36 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Cross-Validation

Split data set in F different training and test data


fold 1 [ x1 , x2 , x3 , x4 , x5 , x6 ]
| {z } | {z }
F1train F1test

fold 2 [ x1 , x2 , x3 , x4 , x5 , x6 ]
| {z } | {z }
F1test F1train

36 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Cross-Validation

Split data set in F different training and test data


fold 1 [ x1 , x2 , x3 , x4 , x5 , x6 ]
| {z } | {z }
F1train F1test

fold 2 [ x1 , x2 , x3 , x4 , x5 , x6 ]
| {z } | {z }
F1test F1train
fold 3 . . .
For each fold:
Train your model on the training data
Test your model on the test data

36 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Cross-Validation: Model Selection or Model Evaluation

Model Evaluation
Report mean evaluation score – e.g. accuracy – across folds

Model Selection
Take that parameter with the highest mean score across folds

Combined Model Selection and Evaluation


→ Nested Cross-Validation

37 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Nested Cross-Validation

Algorithm 5 Cross-Validation for Model Selection and Evaluation


Require: Data (x1 , y1 ) . . . , (xN , yN ), parameters σ1 , . . . , σS , Number of CV folds F
1: # Split data in F disjunct folds
2: for Outer folds fouter = 1, . . . , F do
3: # Pick folds {1, . . . , F } \ fouter for Model Selection
4: # Model Selection
5: for Fold finner = 1, . . . , F − 1 do
6: for Parameter s = 1, . . . , S do
7: # Train model on folds {1, . . . , F } \ {fouter , finner } with parameter σs
8: # Compute prediction on fold finner
9: end for
10: end for
11: # Pick best parameter σs for all finner
12: # Model Evaluation
13: # Train model on folds {1, . . . , F } \ fouter with parameter σs
14: # Test model on fold fouter
15: end for

38 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Efficient Cross-Validation for KRR

For leave-one-out cross-validation,


we need to compute (K + λI)−1 for all different λ.

This is computationally very costly!

39 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Efficient Cross-Validation for KRR

S = K(K + λI)−1 can be computed

without inverting K each time

by computing the eigendecomposition of K = ULU> first.


Then,

K(K + λI)−1 = UL(L + λI)−1 U>

Here, L + λI is a diagonal matrix, such that the inverse can be


computed just by inverting the diagonal elements.

39 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Summary

Kernel Ridge Regression


Non-linear functions modeled as comparison with training data
Gaussian Process yields prediction uncertainty
Optimization requires inversion of kernel matrix
(N × N, N=number of examples)
→ Difficult for very large data sets
Generalization and Model Selection
Good prediction on new data is called generalization
Good algorithms maximize generalization performance
Cross-Validation is a simple and powerful framework to
optimize generalization performance:
• Train model on part of data
• Test model on other part of data
• Repeat on different splits
• Average performance on test set

40 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Further reading

Please reading the following material:


• 2nd Chapter in Taylor ”Kernel Methods for Pattern Analysis”.
In my opinion it gives an excellent (and easy to understand)
review on the topic.
• seminal paper on kernel methods: Mueller et al., An
introduction to kernel-based learning algorithms. IEEE
Transactions on Neural Networks.
• for GP’s: 2nd Chapter in Rasmussen

41 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

References

A. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control,
25:821–837, 1964.
C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer US, 2007.
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. 2003.
F. Jaekel. Some Theoretical Aspects of Human Categorization Behaviour: Similarity and Generalization. PhD thesis, 2007.
F. Jäkel, B. Schölkopf, and F. A. Wichmann. Generalization and similarity in exemplar models of categorization: insights from machine learning. Psychon Bull Rev, 15(2):
256–71, 2008.
K.-R. Müller, S. Mika, G. Ratsch, K. Tsuda, and B. B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):
181–201, 2001.
K. P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning. The MIT Press, 1 edition, 2012. ISBN
0262018020,9780262018029.
C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2005.
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004.
R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237(4820):1317–23, 1987.

42 / 42

You might also like