0% found this document useful (0 votes)

13 views80 pages

10 KRR

The document discusses the relationship between human categorization and kernel methods, particularly focusing on Kernel Ridge Regression and its application in model selection through cross-validation. It highlights psychological models of categorization and the use of generalization gradients to measure similarity in perceptual spaces. The document emphasizes the complexity of non-linear classification and the implications for understanding how humans categorize stimuli.

Uploaded by

Nurbek Nussipbekov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views80 pages

10 KRR

Uploaded by

Nurbek Nussipbekov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Machine Learning

Kernel Ridge Regression

Siamac Fazli

Nazarbayev University – Dept. of Computer Science

1 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Overview

• Psychological Models of Categorization

• Kernel Methods
• Kernel Ridge Regression
• Generalization and Model Selection using Cross-Validation

2 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

3 3
Positive Examples Positive Examples
Negative Examples Negative Examples
2 2

1 1
x2

x2
0 0

−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
1 1

Some data is linearly separable Other data is not separable

3 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

3
Positive Examples
Negative Examples
2

1
Non-linear models (such as NN) can
learn linearly non separable problems
2

0
x

Kernel methods can implicitly model

−1
arbitrarily complex hidden layers
−2
Psychological models of learning are
−3
−3 −2 −1 0 1 2 3 very similar to kernel methods
x1

Linearly non-separable data

4 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Psychological experiments show [Shepard, 1987]:

We can learn arbitrary concepts/categorizations
(i.e. also non-linear classification boundaries)
Central question: How do our brains do that?
Simpler question: What is our perceptual space?
• Do we extract non-linear features?
• How are perceptual similarities related to physical similarity?
• How can we measure internal similarities?

5 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )
3. Measure amount of saliva secreted

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

Ivan Pavlov

Pavlovian Conditioning
(example from [Jäkel et al., 2008]):
1. Condition dog to salivate after 1000Hz Tone
2. Present other tones (900Hz, 1100Hz, . . . )
3. Measure amount of saliva secreted
4. Perceptual similarity ∝ amount of saliva

Nobel Price 1904

6 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the

categorization literature has often relied on indirect mea-
Human categorization behavior
sures of similarity. Shepard (1987) argued that generaliza-
tion gradients should be used to measure similarity, and
this is the stance that is taken in almost all exemplar mod-
els. In classical conditioning, generalization gradients are
1. Generate two classes of stimuli
obtained by conditioning on a certain stimulus and mea-
suring the response to related, but different, stimuli (Ghir-
Perceived Size

landa & Enquist, 2003; Mostofsky, 1965). For example,

a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the

landa & Enquist, 2003; Mostofsky, 1965). For example,

2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the

landa & Enquist, 2003; Mostofsky, 1965). For example,

2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the

landa & Enquist, 2003; Mostofsky, 1965). For example,

2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
4. Test categorization on new stimuli
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle Bush and Mosteller (1951, p. 413) to conclude, “Although
Figure 1. An illustration of a perceptual space. The stimuli
there are several intuitive notions as to what is meant by
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Categorization Revisited

258 JÄKEL, SCHÖLKOPF, AND WICHMANN

found. Following the lead of Shepard (1957, 1987), the

landa & Enquist, 2003; Mostofsky, 1965). For example,

2. Train humans to categorize stimuli
a dog could be conditioned to salivate in response to a
1000-Hz tone. The generalization gradient is obtained by
measuring the salivation of the dog in response to neigh-
3. Generate new stimuli
boring frequencies. Not surprisingly, the generalization to
new stimuli is higher the more similar the new stimuli are
to the conditioned stimulus. Intuitively, one would like to
4. Test categorization on new stimuli
explain generalization in terms of psychological similar-
ity, and indeed researchers have tried to obtain measures
of similarity that are independent of any generalization be-
5. Human categorization reveals
havior (e.g., by integrating just-noticeable differences). In
animal studies, however, this proved to be hard, which led
Perceived Angle perceptual space
Bush and Mosteller (1951, p. 413) to conclude, “Although
there are several intuitive notions as to what is meant by
Figure 1. An illustration of a perceptual space. The stimuli
Example of and
are circles with a spoke Perceptual Space
can vary on two dimensions. Two
artificial categories are depicted, separated by a linear decision
‘similarity,’ one usually means the properties which give
rise to generalization. We see no alternative to using the
boundary. amount of generalization as an operational definition of
degree of ‘similarity.’”
If generalization gradients are the best measure to assess
right (crosses), whereas stimuli in the second cluster are similarity, Shepard (1987) reasoned, generalization gradi-
smaller with spokes pointing upward (circles). It is very ents should be used in the construction of a perceptual
tempting to draw a line (not necessarily straight) separat- space. Applying ordinal MDS to many data sets, Shepard
ing the two clusters in order to explain a subject’s categori- (1987) found a pattern that is now called Shepard’s univer-
zation behavior (Ashby & Gott, 1988). The perceptron, for sal law of generalization: The amount of generalization
example, implements a linear decision rule (Rosenblatt, decreases (approximately) exponentially with the distance
1958). By comparing the similarity to the mean of the in perceptual space. Shepard (1957, 1958) had earlier used 7 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Sheperd’s Law of Universal Generalization

Perceptual Similarity of new data x decays

exponentially with distance from prototype µ

8 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Sheperd’s Law of Universal Generalization

Perceptual Similarity of new data x decays

exponentially with distance from prototype µ

µ Gaussian Kernel
k(µ,x)
σ

−(µ−x )2
x k(µ, x ) = e σ (1)
Gaussian Kernel in R1

8 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

The internal feature space is not clear

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

The internal feature space is not clear
The similarity measure we use internally is not clear

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples
We can train/modify it by presenting examples

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Connection To Machine Learning?

We can learn complex non-linear decision boundaries

The internal feature space is not clear
The similarity measure we use internally is not clear
We can investigate it by presenting examples
We can train/modify it by presenting examples
⇒ The same holds true for kernel methods!

9 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer

For the other two stimuli in this category, the generaliza-

tion gradients overlap quite a bit and form a bigger hump.
k(x, x1) Regions with a high density of exemplars therefore lead
w1
to a high output of the exemplar network (Equation 4),
if all of the weights are set to 1. Hence, the output of the
network can be interpreted as a measure of category mem-
x
k(x, x2)
w2 bership or, if appropriately normalized, as an estimate of
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer
•
For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
w1
to a high output of the exemplar network (Equation 4),
if all of the weights are set to 1. Hence, the output of the
network can be interpreted as a measure of category mem-
x
k(x, x2)
w2 bership or, if appropriately normalized, as an estimate of
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer
•
For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Insteadare setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 certain region of space. In statistics, the same idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer
•
For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
k(x, x3)
Conclusions on Similarity
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer
•
For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
• Kernel function in eq. 1 is a
k(x, x3)
Conclusionsgaussian
on Similarityof width σ on each
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three data
and similarity measurespoint
based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
network literature, nets with “tuning functions” similar exemplar theories of categorization to kernel density esti-
to the similarity kernel are called radial basis function mators. However, recently, methods based on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Multilayer Perceptron

260 JÄKEL, SCHÖLKOPF, AND WICHMANN

• Imagine one hidden layer
•
For the other Each unit
two stimuli stores
in this category,a the
copy of each
generaliza-
tion gradientstraining
overlap quitedata
a bit andpoint
form a bigger
x hump.
k(x, x1) i
Regions with a high density of exemplars therefore lead
to a high output of the exemplar network > (Equation 4),
w1
• weights
if all of the Instead are setof f (w
to 1. Hence,x), eachof unit
the output the
network can be interpreted as a measure of category mem-
x w2
computes a similarity k(x , x)
bership or, if appropriately normalized, as an estimatei of
k(x, x2)
the probability that a stimulus from the category lies in a
w3 • k(x
certain region i , x)
of space. is a kernel
In statistics, the samefunction
idea is used
in so-called kernel density estimators (Ashby & Alfonso-
Reese, 1995).
• Kernel function in eq. 1 is a
k(x, x3)
Conclusionsgaussian
on Similarityof width σ on each
Figure 3. An RBF network calculates a weighted sum over
We have briefly reviewed the idea of a perceptual space
similarity to the various exemplars. A small network with three data point
and similarity measures based on Shepard’s universal law
exemplars is depicted. of generalization. We noted that Shepard’s law is akin to
what is called a kernel in machine learning and statistics.
Ashby and Alfonso-Reese (1995) have already compared
Anyliterature,
network non-linear function
nets with (e.g. similar
“tuning functions” category borders) can be implemented as a
exemplar theories of categorization to kernel density esti-
to linear
the similarity kernel
combination are called radial
of basis function
similarities mators.
to all However, recently,
previously methods
seen basedpoints
data on kernels have
nets. These nets have repeatedly been advocated by Pog- attracted a lot of attention in machine learning. In what
gio and coworkers (Poggio, 1990; Poggio & Bizzi, 2004) follows, we will first systematically compare two psy-
as a plausible model for brain function. chological exemplar theories (GCM and ALCOVE) to a
For now, it is enough to imagine that all weights are set method from machine learning: kernel logistic regression.
to 1. Figure 4 shows again the two categories of circles We will then go on to address the issue of generalization 10 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Trick

The kernel trick is a powerful machine learning tool

[Aizerman et al., 1964; Müller et al., 2001; Schölkopf and Smola,
2002; Shawe-Taylor and Cristianini, 2004]

• Psychological intuition [Jaekel, 2007]:

New predictions of arbitrarily complex categorizations require
only comparisons with previously learned examples
• Mathematical intuition:
Optimal solution lies in the space spanned by the data
→ Even in an infinite dimensional feature space we need only
estimate N parameters (N is the number of data points)

11 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Why Kernel Methods?

Kernel methods are powerful

Kernel methods can fit arbitrary functions
Often only a small number of kernel parameters needs to be
optimized (for MLPs we need to optimize more parameters)

Kernel methods are (sometimes) very efficient

Even if you don’t need to model non-linear dependencies
→ When you have less data points than your data has
dimensions kernel methods can offer a dramatic speedup

12 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Trick

Kernel Trick
Prediction (e.g. category membership) ŷ
given a new data point xnew can be expressed
as linear combination of data points xi

N
X
ŷ = f (xnew ) = kφ (xnew , xi )αi (2)
i=1

where i ∈ {1, . . . , N} are the indices of previously seen data points

Similarities between data points are stored in the kernel matrix K

Kij = k(xi , xj ): similarity between ith and jth data point

13 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Some Popular Kernel Functions

Linear Kernel

k(xi , xj ) = xi> xj (3)

Polynomial Kernel

k(xi , xj ) = (xi> xj + c)p (4)

Gaussian Kernel
2
k(xi , xj ) = e ||xi −xj ||2 /−2σ (5)

14 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Non-linear feature mapping

Example on the board

15 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Graphical explanation for non-linear feature mapping

taken from [Müller et al., 2001]

16 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Methods - Pros and Cons

+ Powerful Modeling Tool

(non-linear problems become linear in kernel space)
+ Omnipurpose Kernels
(Gaussian works well in many cases)
+ Algorithms are independent of kernel!

– Difficult to understand what’s happening in kernel space

– Model complexity increases with number of data points
→ If you have too much data, kernel methods can be slow

17 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Methods - Training and Testing in Practice

• Training and Testing Data X ∈ RDData ×N , y ∈ R1×N
• Compute Kernel on Training Data

K = k(Xtrain , Xtrain ) (6)

• Training
Compute weights α ∈ RNtrain ×1 for each training sample

α = someKernelAlgorithm(K , ytrain ) (7)

• Testing
Compute predictions on test data Xtest

ŷtest = (k(Xtest , Xtrain )α)> (8)

18 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

board

19 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

In linear ridge regression we look for a linear combination of features w
that minimizes the prediction error and has a small norm

ERR (w) =(y − w> X)2 + λ||w||22 (9)

Computing its derivative

∂ERR (w)
= − 2Xy> + 2XX> w + λ2w (10)
∂w
1
→ w =X (y> − X> w)
λ
| {z }
a
=Xa (11)

we see that the optimal w is a linear combination a of all data points X

board

19 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

1
a = (y> − X> w)
λ
λa =y> − X> Xa
y> =(X> X + Iλ)a
> −1 >
→ a =(X
| {zX} +Iλ) y (12)
K

K is called kernel matrix - here just a linear kernel

20 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

From Linear to Kernel Ridge Regression

1
a = (y> − X> w)>
λ
a =(K + Iλ)−1 y> (13)

K is called kernel matrix - here just a linear kernel

Kij denotes similarity between data point xi and xj

Predictions (linear kernels) y =w> X = (Xa)> X = a> X> X = (Ka)>

We derived kernel ridge regression only for K = X> X

But eq. 13 works for all kernel functions!
All that changes is how you compute the kernel K.

21 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression vs. Ridge Regression

For many algorithms there are kernel versions [Müller et al., 2001].

Ridge Regression w =(XX> + λI)−1 Xy>

Kernel Ridge Regression w =Xa
a =(X> X + λI)−1 y>

When should we use one or the other?

22 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Covariance and Kernel Matrix

Covariance Matrix
Given data matrices X ∈ RD×N
(D are dimensions, N number of
samples)
• Covariance matrices grow
quadratically with D

(Linear) Kernel Matrix → For low dimensional data

• Kernel matrices grow

quadratically with N
→ For high dimensional data

23 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression

• But: How uncertain are we about the predictions?

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression

• But: How uncertain are we about the predictions?
• We can obtain these uncertainties from kernel ridge regression

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Uncertainty for Kernel Ridge Regression Predictions

• We can fit arbitrary functions with kernel ridge regression

• But: How uncertain are we about the predictions?
• We can obtain these uncertainties from kernel ridge regression
• Assuming that φ(x) are from a gaussian distribution
Kernel Ridge Regression is a Gaussian Process

24 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

A Gaussian Process
• describes a distribution over functions

25 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

A Gaussian Process
• describes a distribution over functions
• is completely specified by its mean and covariance function

µ(x) = E[φ(x)] Kernel Ridge Regression Predictions

k(xp , xq ) = E[(φ(xp ) − µ(xp ))(φ(xq ) − µ(xq ))]
f (x) =GP(µ(x), k(xp , xq ))

25 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes
There is a lot to know about Gaussian Processes (GPs),
most of which can be found in [Rasmussen and Williams, 2005]:

if you are interested in GPs check Chapter 2 in Rasmussen

it is free! and you can find it here:
http://www.gaussianprocess.org/gpml/chapters/
26 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

Consider a toy data example

ftrue (x ) = cos(x ), fmeasured = ftrue + , ∼ N (0, 0.1)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + , ∼ N (0, 0.1)

f(x)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + , ∼ N (0, 0.1)

f(x)

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Gaussian Processes

ftrue (x ) = cos(x ), fmeasured = ftrue + , ∼ N (0, 0.1)

f(x) σ = 1.0, = 0.010

27 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Computation of Gaussian Kernel

Algorithm 1 Gaussian Kernel

Require: Data x1 , . . . , xn = X , x̂1 , . . . , x̂m = X̂ , kernel width σ
1: H = repmat(sum((X. ∗ X), 2), 1, m) − 2XX̂> + repmat(sum((X̂. ∗ X̂), 2)> , n, 1)
2: K = exp(−H./2./σ 2 );
3: return K

notes:
• x indicates train data and x̂ indicates test data
• the first and second lines calculate the Gaussian kernel between train and test
data all at once (K ∈ Rn×m )

HW: think about line 1 for 5min-1h

28 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression Algorithm

Algorithm 2 Kernel Ridge Regression - Training

Require: Kernel matrix K ∈ RN×N ,labels [y1 , . . . , yN ]> ∈ RU , ridge κ
Ensure: a
1: # Compute dual coefficients
2: a = (K + Iκ)−1 y>

Algorithm 3 Kernel Ridge Regression - Prediction

Require: Training Data Xtrain ∈ RD×Ntrain , Testing Data Xtest ∈ RD×Ntest , weights a,
regularizer κ, kernel parameter(s) Φ, kernel function k(., .)
Ensure: predicted output ŷ
1: Compute Ktrain = k(Xtrain , Xtrain , Φ)
2: Compute K = k(Xtest , Xtrain , Φ)
3: ŷ = (Ka)>

29 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Kernel Ridge Regression Algorithm

Algorithm 4 Gaussian Process - Prediction

Require: Training Data Xtrain ∈ RD×Ntrain , Testing Data Xtest ∈ RD×Ntest , weights a,
regularizer κ, kernel parameter(s) Φ, kernel function k(., .)
Ensure: Parameters of p(y |Xtrain , Xtest , Φ, κ) ∼ N (µ, Σ)
1: Compute Ktest = k(Xtest , Xtest , Φ)
2: Compute Ktrain = k(Xtrain , Xtrain , Φ)
3: Compute K = k(Xtest , Xtrain , Φ)
4: Compute mean of predicted label distribution p(y |Xtrain , Xtest , Φ, κ)
5: µ = (Ka)>
6: Compute covariance of p(y |Xtrain , Xtest , Φ, κ)
7: Σ = Ktest − K(Ktrain + κI)−1 K

30 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Hyperarameter Selection

For kernel ridge regression / GPs we need to specify

• Kernel parameter(s) σ
• Ridge parameter κ

These are called hyper parameters

How can we choose them optimally?

31 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Hyperarameter Selection For Gaussian Processes

Different hyper parameters lead to different solutions

σ = 0.5, = 0.100 σ = 1.0, = 0.100 σ = 2.0, = 0.100

f(x)

f(x)
x x x
σ = 0.5, = 0.500 σ = 1.0, = 0.500 σ = 2.0, = 0.500
f(x)

f(x)

f(x)
x x x

32 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

Generalization and Model Selection

Generalization: Correct categorization/prediction of new data

Similarity and Generalization are complementary concepts

[Jäkel et al., 2008; Shepard, 1987]:

Good algorithms generalize well to new data

33 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary

−6 −6 −6
σ=0.2 κ=10 σ=0.5 κ=10 σ=2.0 κ=10
3 3 3
Generalization and Model Selection
2 2 2
1 1 1

x2
0 0 0
The best model
−1 is the model
−1 that generalizes
−1 best
−2 −2 −2

Overfitting −3
−2 0
Better
2
fit
−3
−2 0 2
−3 Underfitting
−2 0 2
x1 x1 x1
−6 −6
σ=0.2 κ=10
−6 σ=0.2 κ=10 σ=0.2
σ=0.5 κ=100
κ=10
−6 σ=0.5 κ=10 σ=0.5
σ=2.0 κ=100
κ=10
−6 σ=2.0 κ=10−6 σ=2.0 κ=100
3 3 33 3 33 3 3
2 2 22 2 22 2 2
1 1 11 1 11 1 1
x2

x2
0
x22

x22
0

x2
00 00 0
x2

0
−1 −1 −1
−1 −1 −1
−1 −1 −1
−2 −2 −2
−2 −2 −2
−2 −2 −2
−3 −3 −3
−3 −3 −3
−3 −3 −3
−2 0 −2
2 0 2
−2
−2 00 −2
22 0 2
−2
−2 00 −2
22 0 2
−2 0 2
x1 x1 xx1 x1 xx1 x1 x1
1 1
σ=0.2 κ=100 σ=0.2 κ=100 σ=0.5 κ=100 σ=0.5 κ=100 σ=2.0 κ=100 σ=2.0 κ=100
Model
3 is too complex 3 3 Appropriate complexity
3 3 3 Model is too simple
→ Bad
2 generalization
2 2 → Good
2 generalization
2 2 → Bad generalization
1 1 1 1 1 1
x2

x2
0 0 0
x2

0 0 0
−1 −1 −1 −1 −1 −1

−2
σ - kernel
−2 −2
width−2 −2
κ - −2regularizer
−3 −3 −3 −3 −3 −3
−2 0 −2
2 0 2
−2 0 −2
2 0 2
−2 0 −2
2 0 2
x1 x1 x1 x1 x1 x1

34 / 42
Human Categorization Kernel Methods Kernel Ridge Regression Cross-Validation Summary