Artificial N eu r al Networks
and Deep Lear n in g
1
Contents
• Introduction
Motivation, Biological Background
• Th res h o l d L o g i c U n i t s
Definition, Geometric Interpretation, Limitations, Networks of T L U s , Training
• General N e u r a l Networks
Structure, Operation, Training
• M u l t i - layer Perceptrons
Definition, Function Approximation, Gradient Descent, Backpropagation, Variants, Sensitivity Analysis
• Deep Learn i n g
Many-layered Perceptrons, Rectified Linear Units, Auto-Encoders, Feature Construction, Image Analysis
• R a d i a l B a s i s Fu n ct ion Networks
Definition, Function Approximation, Initialization, Training, Generalized Version
• Self-Org an i zi n g Map s
Definition, Learning Vector Quantization, Neighborhood of Output Neurons
• Hopfield Networks and B o l t z m a n n Machines
Definition, Convergence, Associative Memory, Solving Optimization Problems, Probabilistic Models
• Recu rren t N e u r a l Networks
Differential Equations, Vector Networks, Backpropagation through Time
2
R a d i a l Bas is Function Networks
205
R a d i a l B a si s Function Networks
A radial basis function network ( R B F N ) is a neural network with
a graph G = (U, C ) that satisfies the following conditions
(i) Uin ∩ Uout = ∅,
( i i ) C = (Uin × Uhidden) ∪ C ′ , C ′ ⊆ (Uhidden × Uout)
The network input function of each hidden neuron is a distance function
of the input vector and the weight vector, that is,
where d : IR n × IRn → IR+0 is a function satisfying
(symmetry),
(triangle inequality).
206
R a d i a l B a si s Function Networks
206
Distance Functions
Illustration of distance functions: Mi n k o wsk i F a mi l y
Well-known special cases from this family are:
k= 1: Manhattan or city block distance,
k= 2: Euclidean distance,
k → ∞ : maximum distance, that is,
k= 1 k= 2 k→∞
207
R a d i a l B a si s Function Networks
The network input function of the output neurons is the weighted sum of their inputs:
(u) ⊤→
∀u ∈ Uout : →u, i→
f net (w nu) = →
w u inu = Σ wuv outv .
v∈pred ( u )
The activation function of each hidden neuron is a so-called radial function, that
is, a monotonically decreasing function
f : IR+0 → [0, 1] with f (0) = 1 and lim f ( x ) = 0.
x →∞
The activation function of each output neuron is a linear function, namely
(u)
f act (net u , θu ) = netu −θ u .
208
R a d i a l A c t i v a t io n Functions
rectangle function: triangle function:
1 1
net net
0 0
0 σ 0 σ
cosine until zero: Gaussian function:
1 1
—1
1 e 2
2
net e−2 net
0 0
0 σ 2σ 0 σ 2σ
209
R a d i a l B a si s Function Networks: E x a mp l e s
R a d i a l basis function networks for the conjunction x 1 ∧ x 2
x1 1
1
1 x2
1
2 0 y
0
x2 1
0 x1 1
x1 1
0
−1 x2
6
5
−1 y
0
x2 0
0 x1 1
210
R a d i a l B a si s Function Networks: E x a mp l e s
R a d i a l basis function networks for the biimplication x 1 x2
Idea: logical decomposition
x1 x2 ≡ ( x 1 ∧ x 2 ) ∨ ¬(x1 ∨ x 2 )
x1 1
1
1 2 1
1 x2
0 y
0 0
x2 0 1 1
2
0 x1 1
211
R a d i a l B a s i s Function Networks: Function A p p r o x i ma t i o n
y y
y4 y4
y3 y3
y2 y2
y1 y1
x x
x1 x2 x3 x4 x1 x2 x3 x4
1 ·y4
Approximation of a function 0
by rectangular pulses, each of 1 ·y3
0
which can be represented by a 1 ·y2
neuron of an radial basis function 0
1 ·y1
network. 0
212
R a d i a l B a s i s Function Networks: Function A p p r o x i ma t i o n
x1 y1
x2 σ y2
x 0 y
x3 σ y3
x4 y4
σ = 12∆ x = i +1 —x i
1
σ 2
(x )
A radial basis function network that computes the step function on the preceding slide
and the piecewise linear function on the next slide (depends on activation function).
213
R a d i a l B a s i s Function Networks: Function A p p r o x i ma t i o n
y y
y4 y4
y3 y3
y2 y2
y1 y1
x x
x1 x2 x3 x4 x1 x2 x3 x4
Approximation of a function by
1
0
·y4
triangular pulses, each of which 1
0
·y3
can be represented by a neuron of 1 ·y2
an radial basis function network. 0
1
0
·y1
214
R a d i a l B a s i s Function Networks: Function A p p r o x i ma t i o n
y y
2 2
1 1
x x
0 0
2 4 6 8 2 4 6 8
−1 −1
1
0
·w1
Approximation of a function by Gaussian
1
functions with radius σ = 1. It is w1 = 1, 0
·w2
w2 = 3 and w3 = −2. 1
0
·w3
215
R a d i a l B a s i s Function Networks: Function A p p r o x i ma t i o n
R a d i a l basis function network for a s u m of three Gau s s ian functions
1
2 1
x 5 3 y
1 0
6 −2
• The weights of the connections from the input neuron to the hidden neurons
determine the locations of the Gaussian functions.
• The weights of the connections from the hidden neurons to the output neuron
determine the height/direction (upward or downward) of the Gaussian functions.
216
Tr a in in g R a d i a l Bas is Function Networks
217
R a d i a l B a s i s Function Networks: Initialization
Let L fixed = { l 1 , . . . , l m } be a fixed learning task, consisting of m
training patterns
Si mp l e radial basis function network:
One hidden neuron v k , k = 1, . . . , m, for each training pattern:
∀k ∈ {1 , . . . , m }
:
If the activation function is the Gaussian function,
the radii σ k are chosen heuristically
∀k ∈ {1 , . . . , m } :
where
218
R a d i a l B a s i s Function Networks: Initialization
Initializing the connections from the hidden to the output neurons
m
∀u : Σ wuvm out (vlm) −θ u = ou(l) or abbreviated
k =1
( l1 ) ( lm ) ⊤
where →
o u = (ou , . . . , ou ) is the vector of desired outputs, θ u = 0, and
This is a linear equation system, that can be solved by inverting the matrix A :
219
R B F N Initialization: E x a m p l e
Si mp l e radial basis function network for the biimplication x 1 x2
1
2
0
0
w1
x1 x2 y x1 1 1
2 w2
0 0 1 0
1 0 0 0 y
0 1 0 0
1 1 1 x2 1 w3
1 2
w4
1
1
1
2
220
R B F N Initialization: E x a m p l e
Si mp l e radial basis function network for the biimplication x 1 x2
where
221
R B F N Initialization: E x a m p l e
Si mp l e radial basis function network for the biimplication x 1 x2
act
act
single basis all basis functions output
function
• Initialization leads already to a perfect solution of the learning task.
• Subsequent training is not necessary.
222
R a d i a l B a s i s Function Networks: Initialization
N o r m a l radial basis function networks:
Select subset of k training patterns as centers.
Compute (Moore–Penrose) pseudo inverse:
A + = (A ⊤ A) −1 A ⊤ .
The weights can then be computed by
223
R B F N Initialization: E x a m p l e
N o r m a l radial basis function network for the biimplication x 1 x2
Select two training patterns:
• l1 =
• l4 =
x1 1
1 2 w1
1
θ y
0
x2 0 1 w2
2
224
R B F N Initialization: E x a m p l e
N o r m a l radial basis function network for the biimplication x 1 x2
where
a ≈ −0.1810, b ≈ 0.6810,
c ≈ 1.1781, d ≈ −0.6688, e ≈ 0.1594.
Resulting weights:
225
R B F N Initialization: E x a m p l e
N o r m a l radial basis function network for the biimplication x 1 x2
(1, 0)
act
act
basis function (0,0) basis function (1,1) output
• Initialization leads already to a perfect solution of the learning task.
• This is an accident, because the linear equation system is not over-determined,
due to linearly dependent equations.
226
R a d i a l B a s i s Function Networks: Initialization
H o w to choose the radial basis function centers?
• Use all data points as centers for the radial basis functions.
◦ Advantages: Only radius and output weights need to be determined; desired
output values can be achieved exactly (unless there are inconsistencies).
◦ Disadvantage: Often much too many radial basis functions; computing the
weights to the output neuron via a pseudo-inverse can become infeasible.
• Use a random subset of data points as centers for the radial basis functions.
◦ Advantages: Fast; only radius and output weights need to be determined.
◦ Disadvantages: Performance depends heavily on the choice of data points.
• Use the result of clustering as centers for the radial basis functions, e.g.
◦ c-means clustering (on the next slides)
◦ Learning vector quantization (to be discussed later)
227
R B F N Initialization: c -means Clustering
• Choose a number c of clusters to be found (user input).
• Initialize the cluster centers randomly
(for instance, by randomly selecting c data points).
• D a t a point assignment:
Assign each data point to the cluster center that is closest to it
(that is, closer than any other cluster center).
• C l u s t e r center update:
Compute new cluster centers as the mean vectors of the assigned data points.
(Intuitively: center of gravity if each data point has unit weight.)
• Repeat these two steps (data point assignment and cluster center update)
until the clusters centers do not change anymore.
It can be shown that this scheme must converge,
that is, the update of the cluster centers cannot go on forever.
228
c -Means Clustering: E x a m p l e
Data set to cluster. Initial position of cluster centers.
Choose c = 3 clusters. Randomly selected data points.
(From visual inspection, can be (Alternative methods include
difficult to determine in general.) e.g. latin hypercube sampling)
229
Delaunay Triangulations and Voronoi D i a g r a ms
• Dots represent cluster centers.
• Left: Del au n ay Triangulation
The circle through the corners of a triangle does not contain another point.
• Right: Voronoi D i a g r a m / Tesselation
Midperpendiculars of the Delaunay triangulation: boundaries of the regions
of points that are closest to the enclosed cluster center (Voronoi cells).
230
Delaunay Triangulations and Voronoi D i a g r a ms
• Del au n ay Triangulation: simple triangle (shown in gray on the left)
• Voronoi D i a g r a m: midperpendiculars of the triangle’s edges
(shown in blue on the left, in gray on the right)
231
c -Means Clustering: E x a m p l e
232
R a d i a l B a si s Function Networks: T r a i n i n g
T r a i n i n g radial basis function networks:
Derivation of update rules is analogous to that of multi-layer perceptrons.
Weights from the hidden to the output neurons.
Gradient:
Weight update rule:
Typical learning rate: η3 ≈ 0.001.
(Two more learning rates are needed for the center coordinates and the radii.)
233
R a d i a l B a si s Function Networks: T r a i n i n g
T r a i n i n g radial basis function networks:
Center coordinates (weights from the input to the hidden neurons).
Gradient:
Weight update rule:
Typical learning rate: η1 ≈ 0.02.
234
R a d i a l B a si s Function Networks: T r a i n i n g
T r a i n i n g radial basis function networks:
Center coordinates (weights from the input to the hidden neurons).
Special case: Eu cl i d ean distance
Special case: Gau s s i an activation function
235
R a d i a l B a si s Function Networks: T r a i n i n g
T r a i n i n g radial basis function networks:
Radii of radial basis functions.
Gradient:
Weight update rule:
Typical learning rate: η2 ≈ 0.01.
236
R a d i a l B a si s Function Networks: T r a i n i n g
T r a i n i n g radial basis function networks:
Radii of radial basis functions.
Special case: Gau s s i an activation function
(The distance function is irrelevant for the radius update,
since it only enters the network input function.)
237
R a d i a l B a s i s Function Networks: Generalization
Generalization of the distance function
Idea: Use anisotropic (direction dependent) distance function.
Example: Mahalanobis distance
Example: biimplication
x1 1
2 1
1 1
0 y x2
3
1
0
x2 2
0 x1 1
238
Application: Recognition of H a n d w r i t t e n D i g i t s
• C o mp a r i s o n of various classifiers:
◦ Nearest Neighbor (1NN) ◦ Learning Vector Quantization (LVQ)
◦ Decision Tree (C4.5) ◦ Radial Basis Function Network (RBF)
◦ Multi-Layer Perceptron (MLP) ◦ Support Vector Machine (SVM)
• Di s t i n ct i on of the number of R B F training phases:
◦ 1 phase: find output connection weights e.g. with pseudo-inverse.
◦ 2 phase: find R B F centers e.g. with some clustering plus 1 phase.
◦ 3 phase: 2 phase plus error backpropagation training.
• Initialization of radial basis function centers:
◦ Random choice of data points
◦ c-means Clustering
◦ Learning Vector Quantization
◦ Decision Tree (one R B F center per leaf)
240
Application: Recognition of H a n d w r i t t e n D i g i t s
Classification results:
Classifier Accuracy • LVQ: 200 vectors
Nearest Neighbor (1NN) 97.68% (20 per class)
Learning Vector Quantization (LVQ) 96.99% C4.5: 505 leaves
Decision Tree (C4.5) 91.12% c-means: 60 centers(?)
2-Phase-R B F (data points) 95.24% (6 per class)
2-Phase-R B F (c-means) 96.94% SVM 10 classifiers,
2-Phase-R B F (LVQ) 95.86% : ≈ 4200 vectors
2-Phase-R B F (C4.5) 92.72% MLP 1 hidden layer
3-Phase-R B F (data points) 97.23% : with 200 neurons
3-Phase-R B F (c-means) 98.06% • Results are medians
3-Phase-R B F (LVQ) 98.49% of three training/test runs.
3-Phase-R B F (C4.5) 94.83%
Support Vector Machine (SVM) 98.76% • Error backpropagation
Multi-Layer Perceptron (MLP) 97.59% improves R B F
results.
245