TFM Lichtner Bajjaoui Aisha
TFM Lichtner Bajjaoui Aisha
Neural Networks
Aisha Lichtner-Bajjaoui
1 Introduction 1
1.1 History of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic Results of Linear Algebra and Analysis . . . . . . . . . . . . . . 2
i
Contents ii
Bibliography 47
Chapter 1
Introduction
In this work, we are going to introduce Neural Networks. First, we are going to
give a mathematical formulation of the concept of Neural Networks. Later on, we
will examine some important properties of Neural Networks and make a connection
to common statistical methods such as Principal Component Analysis and Singular
Value Decomposition. In the last chapter, we will give a practical application of
a neural network for a regression problem. The concept of a Neural Network is
inspired by the activities of a human brain. Neurons receive information, if the
information is relevant to the neuron, a signal is sent to other neurons via synapses.
The main difference between Neural Networks and rule-based statistical methods,
is the learning ability of Neural Networks. At the beginning of a training phase
a network has no explicit information. During the training phase the inter-neural
connections are changed in a way that the network solves the given problem best.
Therefore Neural Networks can provide solutions to a wide spectrum of problems.
A Neural Network is an abstract model consisting of one or more layers, that are
connected in a certain way. The weighted connection between the layers plays the
role of synapses. Each layer consists of units modelling neurons in the human brain.
The units carry activation functions, modelling the impulses, that the real neurons
send when being triggered. Just like the brain, the network will be trained to learn
a specific task and later should perform a similar task in an unknown situation, us-
ing the experience that it gained before. For that matter, during the training phase
already-observed information is passed to the model and the model produces an out-
put. The output is evaluated on its ability to approximate the observed information.
Depending on the result, the model is then changed to improve performance.
1
1.2. Basic Results of Linear Algebra and Analysis 2
time, computers had just been invented and were not close to being able to handle
algorithms of such complexity. In the following years, scientist lost interest in Neural
Networks due to lack of progress in the field and other, at the time, more promising
methods. In 1970 Seppo Linnainmaa discovered Backpropagation, that later should
revolutionize the performance of Neural Networks. Still, Neural Networks were not
in focus of the Artificial Intelligence Community until 2012, when students won
the ImageNet Large Scale Visual Recognition Competition with remarkable results.
Since then Neural Networks are an uprising topic in Artificial Intelligence.
1.
tr(M N ) = tr((M N )T )
2.
tr(U QU T ) = tr(Q)
1. F is convex
and therefore
F 0 (z) = 2M T M z − 2M T c
so with the previous Theorem we can conclude
MT Mz = MT c (1.2.1)
1.2. Basic Results of Linear Algebra and Analysis 4
Lemma 1.2.13. For any matrices P, Q and R (of dimensions that allow multipli-
cations) it states:
3. (P ⊗ Q)(R ⊗ S) = P R ⊗ QS
4. (P ⊗ Q)−1 = P −1 ⊗ Q−1
5. (P ⊗ Q)T = P T ⊗ QT
Mathematical definition of a
Neural Network
In this chapter, we will introduce Neural Networks in the words of Statistics. The
general idea of a Neural Network is to find a model that, based on a set of N samples
f (xi ) = yi
as well as possible. The model always consists of one input layer, an output layer
and at least one hidden layer. The layers are connected with weighted transitions,
that can be modelled by a product of a weight matrix with a vector, that stands for
the output of the previous layer. Each layer consists of several units, the neurons.
The amount of units in each layer defines the dimension of the layer. The activation
functions in the neurons are essentially modelling a threshold that decides whether
the information that a neuron got will be relevant for the further calculations or
not. If the information is relevant it will be passed on to the next layer until the
output layer is reached. During the so-called training phase, the model is given a
sample data set D where the matrix [x1 , ..., xN ] is the input and [y1 , ...yN ] the output
matrix. After processing each sample (xi , yi ) ∈ D, the distance of the output that
the network produces to the actual desired output yi , is measured by some loss
function, for example, the L2 -norm or the euclidean norm. According to the results,
the weights of the transitions are modified to achieve a better result in the next
iteration. The dimension of the input layer is determined by the dimension of the
input data, while the dimension of the output corresponds to the problem we want to
solve using the model. For example, if we are trying to use the model for regression,
the output layer will be of dimension one, because we want the model to assign
one value to each input. For a classification problem, we use an output layer of
dimension k, where k describes the k different categories we want to sort the data
in.
5
6
For a sufficiently big sample, the model is expected to produce a good approxi-
mation of the desired output. In the following, we will try to find the best estimator
Fb of f that satisfies
Fb(xi ) = yi .
That means the best approximation of f for given observation D. Then we will
apply Fb to unknown values z to make a prediction or classification.
Y := [y1 , .., yN ]
will model the corresponding observed training output vectors yi and form a
k × N -matrix.
7 Chapter 2. Mathematical definition of a Neural Network
Definition 2.0.2. By Z := [z1 , ..zM ] we denote the test set with test vectors
zi ∈ Rn .
Definition 2.0.3. Let X̃ be a n- dimensional random vector that models the random
choice of training input vectors xi . The random vector Z describes the choice of test
data. By Ỹ we will denote a k-dimensional random vector that models the output
of the network for all possible inputs.
Example 2.0.5. Let the loss function L be the quadratic error that is
= E[(Ỹ − F (X̃))2 ]
= E[(Ỹ + E(Ỹ |X̃) − E(Ỹ |X̃) − F (X̃))2 ]
= E[(Ỹ − E(Ỹ |X̃))2 ] + E[(E(Ỹ |X̃) − F (X̃))2 ]
+ 2E[(Ỹ − E(Ỹ |X̃))(E(Ỹ |X̃) − F (X̃))].
Observe that we do not always know if the optimal estimator E(Ỹ |x) is contained
in the space of functions that the network can implement or that satisfy given
conditions on the approximating function, such as, for example, being continuous.
We call the difference between E(Ỹ |x) and Fb the error.
In order not to have a result that measures the accuracy of the approximation only
for a specific sample D, we consider the expectation ED of (E(Ỹ |x) − FbD (x))2 , that
is the expectation with respect to all possible sample sets D, hence the average over
9 Chapter 2. Mathematical definition of a Neural Network
If FbD on average differs from E(Ỹ |·), that means (ED [FbD (x)] − E(Ỹ |x))2 is big,
we call FbD biased. If we found a non-biased FbD , the variance can still be a cause
of poor performance. For example in the optimal case , FbD (x) = E(Ỹ |x), we obtain
the variance
ED [(ED [E(Ỹ |x)] − E(Ỹ |x))2 ]
that can have a big impact, depending on how much E(Ỹ |·) varies for the given data.
By that, we can see that,on one hand, a function Fb can approximate f very well
for a specific training sample, but on the other hand, applying the approximation to
other samples the result might not be good. This problem is called underfitting,
the network has, due to a bad choice of training samples, not learned enough about
the properties of the function f it approximates. On the other hand, F might be
an unbiased approximation that is very sensitive to the given data, thus on average
approximates E(Ỹ |·) well, but still causes poor performance. We call this behaviour
of a network overfitting, that means, the network models the noise in the training
sample.
We can observe that there is a dilemma between reducing the bias and reducing
the variance. By increasing the dimension of the training samples, that can be done
by adding a layer of a higher dimension, that has the effect of giving the training
samples more features and therefore more information, the bias can be reduced. But
by giving more information the variance will be in general increased. This creates a
trade-off between variance and bias.
2.2. The Neurons 10
There are different ways in which the neurons process the inputs. Essentially we
can distinguish between two types of neurons:
h(s) = hw, si
error function, the results depend only on the choice of weights, that are calculated
during the training phase, the problem (2.0.4) can be reduced to
N
1 X
W ? = arg min (yi − F (xi ; W ))2 (2.3.1)
W N i=1
W is the matrix, that models all weights of the whole network. The problem (2.3.1)
is solved by a gradient descent method, like Backpropagation. The gradient descent
methods are based on the idea of minimizing a function by iteratively moving in the
direction of the steepest descent, that is the direction in which the gradient of the
function points. So for (2.3.1) the t-th iteration step is modelled by
∂(yi − F (xi ; W ))
∆Wk,j (t) = −(t)
∂Wk,j
Remark 2.3.2. Notice P that another approach would be to build the gradient of
the whole expression N N
1
i=1 (yi − F (xi ; W )). However that would slow down the
procedure, since for each iteration step, all samples are considered.
Definition 2.3.3. A Multilayer Perceptron where the output of the network equals
the given input is called Autoassociative Multilayer Perceptron.
Chapter 3
N
X
ΣXX := (xi − µxi ) · (xi − µxi )T
i=1
XN
ΣXY := (xi − µxi ) · (yi − µyi )T
i=1
XN
ΣY X := (yi − µyi ) · (xi − µxi )T
i=1
XN
ΣY Y := (yi − µyi ) · (yi − µyi )T
i=1
Σ := ΣY X ΣXX −1 ΣXY
13
3.1. Global minimum and Principal Component Analysis 14
Remark 3.0.4. Notice that, for centered samples X and Y , that means µX = 0,
we can observe that ΣXX = XX T , ΣY Y = Y Y T , ΣXY = XY T and ΣY X = Y X T .
Remark 3.0.5. Additionally, we can expect the sample covariance matrix to have
only positive eigenvalues, which implies full rank. A rank deficient covariance matrix
would indicate a poor choice of training samples since linear dependencies describe
correlations in the training set.
P
Definition 3.0.6. For eigenvalues λ1 > λ2 > ... > λn of and an index set
I = {i1 , ..ip } withP
i1 < i2 < ... < ip ≤ n. Define the matrix of the orthonormal
eigenvectors of associated to the eigenvalues λ1 , λ2 , ..., λp by
Furthermore, define
Up := [u1 , ..., up ]
as the matrix of the first p eigenvectors corresponding to the p biggest eigen-
values λ1 > ... > λp .
PM = M (M T M )−1 M T .
Theorem 3.1.2. If W is of rank at most p and ΣXX is invertible. Then the problem
stated above has a unique solution
M = ΣY X ΣXX −1
where M is a (n × n)-matrix.
3.1. Global minimum and Principal Component Analysis 16
During the training phase, we want to minimize the error by adjusting the weights
in each layer. That means by changing the function F . A minimum can be found
by using a numerical algorithm, such as an implementation of the gradient descent
method. Therefore, the network suffers from a very common problem in numerical
analysis: It is hard to tell whether the gradient descent method can find the global
minimum or if it will end in a local minimum. Since we are considering a network
that consists of only one hidden layer and all units are linear, the function F must
be of the form F (xi ) = W xi = ABxi , where B is a real p × n matrix that represents
the transition from the input layer to the hidden layer and A a real n × p matrix
representing the transition from the hidden layer to the output layer. As a network
architecture that does not use all units of a hidden layer is not practicable, we can
expect A and B to have full rank p, that means to represent surjective, respec-
tively injective maps. Although using linear layers restricts the network capabilities
strongly, since only linear functions can be approximated, it still has an important
purpose: Understanding the activities in the hidden layers. Notice that the results
that are found for one hidden layer can be easily extended to a deeper architecture
with more hidden layers, since by linearity of the transitions they can be interpreted
as one hidden layer.
We can now make some statements about optimal weights, that means the matri-
ces A and B for that the quadratic error function attains a minimum. These optimal
weights are never unique since they can always be multiplied by an invertible matrix
C and result in the same global map
(AC)(C −1 B) = AB = W.
17 Chapter 3. Properties and Connection to Statistical Methods
Theorem 3.1.3. For a fixed n × p matrix A the function RE (A, B) is convex in the
coefficients of B. It also states that:
AT ABΣXX = AT ΣY X . (3.1.1)
2. If ΣXX is invertible and A has full rank p, then E is strictly convex in B and
a unique minimum is reached in
B = B(A)
b = (AT A)−1 AT ΣY X ΣXX −1 . (3.1.2)
B = B(A)
b = (AT A)−1 AT . (3.1.3)
3.1. Global minimum and Principal Component Analysis 18
Proof.
So we obtain that
X
RE (A, B) = ||yi − ABxi ||2
i∈{1,..,N }
So it follows
vec(AT ABΣXX ) = vec(AT ΣY X ) (3.1.5)
and therefore RE attains a minimum in B that satisfies
AT ABΣXX = AT ΣY X
2. To prove the second statement, we assume that ΣXX is invertible and A has full
rank p. By (1.2.2) and Lemma 1.2.8 we can conclude that AT A a symmetric
and positive definite. Lemma 3.0.3 states and ΣXX is symmetric and positive
semidefinite. Since by assumption ΣXX has full rank it follows that ΣXX is
positive definite. We can deduce then with Lemma 1.2.13(7) that
(X T ⊗ A)T (X T ⊗ A) = AT A ⊗ ΣXX
is positive definite and symmetric. We know by Theorem 1.2.10(3) that E has
a unique minimum in
B(A)
b = (AT A)−1 AT ΣY X ΣXX −1
that we obtain by (3.1.1).
3. Since in the autoassociative case xi = yi for i ∈ {1, .., N }, it results that
ΣXX = ΣXY = ΣY X = ΣY Y , so
B(A)
b = (AT A)−1 AT .
ABΣXX B T = ΣY X B T . (3.1.6)
2. If ΣXX is invertible and B has full rank p, then RE is strictly convex in A and
a unique minimum is reached in
A = A(B)
b = ΣY X B T (BΣXX B T )−1 . (3.1.7)
A = A(B)
b = ΣXX B T (BΣXX B T )−1 . (3.1.8)
3.1. Global minimum and Principal Component Analysis 20
Proof.
We can now conclude with Theorem 1.2.10(1) that E is convex in A and has
a global minimum if A satisfies
ABΣXX B T = ΣY X B T .
2. Let B be of full rank and ΣXX be invertible, therefore full rank. With Remark
1.2.2 and ΣXX being positive semidefinite and symmetric we obtain that by
A(B)
b = ΣY X B T (BΣXX B T )−1 ,
that we obtain from (3.1.6) and the fact that (BΣXX B T ) is invertible as a
positive definite quadratic matrix.
A(B)
b = ΣXX B T (BΣXX B T )−1 .
Using the previous results, we can now make some statements about the global
map that corresponds to a critical point.
Theorem 3.1.5. Let ΣXX be invertible and A of full rank. A and B define a critical
point of RE , that means δRE
δaij
= δR E
δbij
= 0, if and only if
W = PA ΣY X ΣXX −1 (3.1.9)
and A satisfies
Or equivalently, B = B,
b as defined in (3.1.2) and A satisfies (3.1.10).
W = PA (3.1.11)
with B = B
b and A satisfies
Proof.
1.
”=⇒” :
Let ΣXX be invertible and A, B matrices that define a critical point of RE .
From Theorem 3.1.3 we can deduce that B must be of the form
B(A)
b = (AT A)−1 AT ΣY X (ΣXX )−1 .
So
W = AB = AB(A)
b = (AT A)−1 AT ΣY X (ΣXX )−1 = PA ΣY X (ΣXX )−1
| {z }
=PA
and therefore
W ΣXX W T = ΣY X W T
which equals by (3.1.9),
Since the covariance matrix ΣXX as well as the orthogonal projection are
symmetric, the inverse of a symmetric matrix is symmetric and by definition
(ΣY X )T = ΣXY we obtain
Since by
”⇐=”:
Assume A satisfies (3.1.10) and the global map W satisfy (3.1.9). By multi-
plying (3.1.9) by (AT A)−1 AT on the left we get:
(AT A)−1 AT PA ΣY X ΣXX −1 = (AT A)−1 AT W = (AT A)−1 AT AB = B
⇐⇒ (AT A)−1 AT A(AT A)−1 AT ΣY X ΣXX −1 = B
⇐⇒ B = (AT A)−1 AT Σ Σ −1 = B(A).
YX XX
b
We observe that B satisfies (3.1.2).
Now we can see that A satisfies (3.1.7), by
X X
P A = PA PA
⇐⇒ ΣY X ΣXX −1 ΣXY PA = PA ΣY X ΣXX −1 ΣXY PA
| {z } | {z }
=W T =W
⇐⇒ ΣY X W T = ΣY X B T AT = W ΣXY PA = W ΣXX B T AT
| {z }
=ΣXX W T
⇐⇒ ΣY X B T AT = ABΣXX B T AT .
Multiplying that by A(AT A)−1 on the right side, we obtain
ΣY X B T AT A(AT A)−1 = ABΣXX B T AT A(AT A)−1
⇐⇒ ΣY X B T = ABΣXX B T .
So we can conclude, that A and B satisfy, respectively (3.1.6) and (3.1.2) (thus
(3.1.1)). Therefore, by Theorem 3.1.3 and Theorem 3.1.4 they define a critical
point of RE .
2. ”=⇒”
In the autoassociative case, we consequently obtain by Σ = ΣXX and (3.1.3)
and with the same reasoning as in the general case:
W = PA and
PA ΣXX = PA ΣXX PA = ΣXX PA .
”⇐=”:
Let A satisfy (3.1.12) and the global map W satisfy (3.1.11). Multiplying
(3.1.11) on the left side by A−1 gives us directly B = B
b = (AT A)−1 AT .
Using (3.1.12) and (3.1.11) we get
ΣXX PA = PA ΣXX PA ⇐⇒ ΣXX PAT = PA ΣXX PAT
⇐⇒ ΣXX B T AT = ABΣXX B T AT
⇐⇒ ΣXX B T (AT A) = ABΣXX B T (AT A)
⇐⇒ ΣXX B T = ABΣXX B T .
3.1. Global minimum and Principal Component Analysis 24
A satisfies (3.1.6). Therefore, by Theorem 3.1.3 and Theorem 3.1.4 they define
a critical point of RE .
A = UI C (3.1.13)
and
and
X
RE (A, B) = trΣY Y − λi . (3.1.16)
i∈I
A = UI CB = C −1 UIT (3.1.17)
and
W = PUI . (3.1.18)
Proof.
Since Σ is a real symmetric matrix it follows by the Spectral Theorem 1.2.9 that it
is diagonalizable:
Σ = U ΛU T (3.1.19)
(AT A)−1 AT ΣY X ΣXX −1 = ((UI C)T (UI C))−1 (UI C)T ΣY X ΣXX −1
= (C T UI T UI C)−1 C T UIT ΣY X ΣXX −1
= C −1 (UI T UI )−1 (C T )−1 C T UI T ΣY X ΣXX −1
= C −1 UI T ΣY X ΣXX −1
=B
and
it follows that A and B satisfy (3.1.7) and respectively (3.1.2). Therefore A and B
define a critical point.
”=⇒ :”
Let A and B define critical points of E.
PU T A = U T PA U (3.1.20)
since by definition of PU T A :
We can conclude by
(3.1.19) X (3.1.10) X
U PU T A U T U ΛU T} = PA = PA = U ΛU T U PU T A U T
| {z } | {z P
=PA =
that
U PU T A ΛU T = U ΛPU T A U T
and consequently,
PU T A Λ = ΛPU T A .
3.1. Global minimum and Principal Component Analysis 26
= Λ · PU T A
⇐⇒ λk · pi,k = λi · pi,k for all i, k ∈ {1, .., n}.
Knowing that λ1 > ...λn > 0, it follows, that λi 6= 0 for i 6= k, thus PU T A is a diagonal
matrix of rank p. Since orthogonal projectors only have eigenvalues 1 and 0, we can
define a unique index ordered set I = {i1 , ..., ip }, with 1 ≤ i1 < ... < ip ≤ n, that
represents the indices of the positions where PU T A has an entry that is one. That
means ik ∈ I if pik ,ik = 1. By that, we can define a diagonal matrix
0 ···
j1 0
.. ..
0 . .
MI =
.. ... ..
. .
0 ··· 0 jn
and clearly
1 0 ··· ··· 0
· · · un,1 .. u1,1 · · ·
u1,1 un,1
.. .. 0 0 . .. ..
. . .. . .
U MI U T = . ..
0 1 . . ..
.. .
. ..
.
.
.. ..
u1,n · · · un,n u1,n · · · un,n
0 ··· ··· 0 1
0 u3,1 · · · un,1 u1,1 · · · u1,n
u1,1
.. .. .. ..
. . . .
=
.. .. .
.. ..
. . .
u1,n 0 u3,n · · · un,n un,1 · · · un,n
0 u3,1 · · · un,1 u1,1 · · · u1,n
u1,1
.. .. 0 0
. . u3,1 · · · u3,n
= . ..
..
. .. ..
. .
u1,n 0 u3,n · · · un,n un,1 · · · un,n
= UI UIT .
So,
X (3.1.10) X
tr(W ΣXX W T ) = tr(PA PA ) = tr(PA ) = tr(U PU T A U T U ΛU T )
(1.2.7)
= tr(PU T A U T U Λ) = tr(PU T A Λ)
and
X
tr(W ΣXY ) = tr(PA ) = tr(U PU T A U T U ΛU T ) = tr(PU T A Λ).
That means a critical point W of rank p is always the product of the Ordinary
Least Squares Regression matrix P and an orthogonal projection onto the subspace
spanned by p eigenvectors of .
Finally, we can state our main result, that is, that RE has no local minima and
all critical points, other than the global minima, correspond to saddle points.
Theorem 3.1.7. The critical point W of RE associated with the index set {1, ..., p},
in other words, associated with the biggest p-eigenvalues, is the unique local and
global minimum of RE . The other np − 1 index sets correspond to saddle points.
Proof. Let A and B matrices such that (3.1.13) and (3.1.14) are satisfied, and
therefore define a critical point of RE . Furthermore, let I be a p-index set with
I 6= {1, .., p}. Our aim is now to show that A and B do not define a local or global
minimum, but a saddle point, that means there exist à and B̃ in a neighbourhood
of A and B such that
RE (Ã, B̃) < RE (A, B)
P
Let Up the matrix of the first p eigenvectors of corresponding to the p-biggest
eigenvalues and UI of the i-th eigenvectors, for i ∈ I. Moreover, let j ∈ I and
k ∈ {1, .., p}\I. Now let us construct a matrix ŨI by replacing the j-th vector of UI
by
1
ũj := (1 + 2 ) 2 (uj + uk )
That can be interpreted as moving from uj a little in the direction of the vector uk
29 Chapter 3. Properties and Connection to Statistical Methods
of Up . Notice that,
u˜1
..
1
.
(1+2 )− 2 (uj +uk ) 1
T
ŨI ŨI =
u˜1 ··· (1+2 )− 2 (uj +uk ) ũj+1 ··· u˜p
ũj+1
..
.
u˜p
1
<ũ1 ,ũ1 > <(1+2 )− 2 (uj +uk ),ũ1 > <u˜p ,ũ1 >
1
<ũ1 ,ũ2 > ··· <(1+2 )− 2 (uj +uk ),ũ2 > ···
..
.
=
.. ..
1 1
. <(1+2 )− 2 (uj +uk ),(1+2 )− 2 (uj +uk )> .
..
.
<ũ1 ,ũp > e <u˜p ,ũp >
Since uk 6∈ I is not a vector of ŨI and all vectors of ŨI are orthonormal and
1 1
h(1 + 2 )− 2 (uj + uk ), (1 + 2 )− 2 (uj + uk )i = (1 + 2 )−1 ||uj + uk ||2 = 1
and therefore,
u1
.
T T
.. 1
PU T Ã = U ŨI ŨI U = uk
u˜1 ··· (1+2 )− 2 (uj +uk ) ··· u˜p
..
.
un
u˜1
..
.
1
(1+2 )− 2 (u +u ) u1 ··· ··· un
×
ũj+1
j k
..
.
u˜p
1
···<u1 ,(1+2 )− 2 (uj +uk )>··· <u1 ,u˜p >
<u1 ,ũ1 >
...
1
<uk ,(1+2 )− 2 (uj +uk )>
= .. ..
1
. <uj ,(1+2 )− 2 (uj +uk )> .
<up ,ũ1 > <up ,u˜p >
. ..
.. .
<un ,ũ1 > <un ,u˜p >
1
···<(1+2 )− 2 (uj +uk ),u1 >··· <ũp ,up > ···<ũp ,un >
<u˜1 ,u1 >
.. .. .. ..
. 1
. . .
×
<(1+2 )− 2 (uj +uk ),uk >
1
<(1+2 )− 2 (uj +uk ),uj >
..
.
<ũp ,un >··· <ũp ,up > ···<ũn ,up >
2 (λk − λj )
= RE (A, B) − .
1 + 2
All eigenvalues are non-negative and of decreasing order, so by k < j follows λk > λj .
Therefore, it yields that, by having chosen arbitrary, that there is a neighbourhood
31 Chapter 3. Properties and Connection to Statistical Methods
of A, B such that there exist Ã, B̃ in that neighbourhood that satisfies RE (Ã, B̃) <
RE (A, B). Hence, A, B do not define a minimum, and by convexity of RE neither a
maximum, so they define a saddle point.
Remark 3.1.8. The critical points defined by rank deficient matrices A and B,
correspond to saddle points as well, but will not be discussed in this work. (See [3],
page 58)
3.1.5 Conclusion
We took a closer look at a linear Multilayer Perceptron with one hidden layer and
could observe that, up to equivalence, the unique global minimum can be found and
there are no local minima. All critical points, apart from the global minimum, refer
to saddle points. By observing autoassociative architecture, we found a measure of
how well a network can recognize unknown patterns. Moreover, we have seen that in
some special cases the optimal results coincide with well-known results in Statistics,
such as Ordinary Least Square Regression and Principal Component Analysis. That
means instead of using a gradient descent method on the error function, in these
special cases, we could use these already known methods directly. However, using
the gradient descent method, with appropriate parameters, to avoid the algorithm
ending in a saddle point, will also work on more general data and can be expected
to produce the same results, as no local minima will disturb the gradient descent.
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 32
H0 := BX + b1 v T
as the input of the hidden layer, with n-dimensional bias vector b1 and v a N -
dimensional vector of ones. Furthermore, define H := [h1 , ..., hN ] = g(H0 ) as the
output matrix of the hidden layer, with
for some nonlinear function g. Since we are trying to approximate the identity map
for linear input xi , in the output layer the non-linearity has to be removed, that
means it must be of the form
yi = Ahi + b2
Where b2 describes the bias of the output of the hidden layer. Consequently, the
network activity is described by the function F with component-wise activity
Fi (xi ) = A(g(Bxi + b1 v T ) + b2 v T )
or equivalently,
F (X) = A(g(BX + b1 v T ) + b2 v T ) = Y
Since we want the network to reproduce the input the occurring error will be mea-
sured by the quadratic error function
with respect to the matrix norm ||A|| = tr(AAT ). We want to find the optimal
weights A, B and optimal biases b1 , b2 that minimizes the error function.
Remark 3.2.1. Observe that the norm equals the norm used in the previous chapter
(see (3.1.4)).
such that
M = U DV
with positive, decreasing and uniquely defined entries s1 > ... > sr > 0 that are the
singular values of M .
Proof. See [4], page 217-218.
Remark 3.2.3. In the case that M is a quadratic matrix, the singular value de-
composition equals the eigendecomposition that was used in the previous chapter in
the proof of (3.1.6).
By deriving the error function with respect to b2 we can find an optimal bias
vector bb2 .
Theorem 3.2.4. Let RE be the quadratic error function defined by
RE := ||X − AH − b2 v T ||2
with respect to the matrix norm ||A|| = tr(AAT ). Let X be a n × N -matrix and v
an n-dimensional vector of ones. Then, minimization of E with respect to b2 yields
the optimal bias vector b̂2 is given by
1
b̂2 = (X − AH)v.
N
Observe that this bias vector ensures that the average input equals the average
output, which is desired when trying to reproduce data in the most accurate way.
Proof.
First note that
RE = ||X − F (X)||2
= ||X − AH − b2 v T ||2
= tr((X − AH − b2 v T )(X − AH − b2 v T )T )
= tr((X − AH) − b2 v T )(X − AH)T − vbT2 ))
= tr((X − AH)(X − AH)T ) + tr((X − AH)(−vbT2 ))
+ tr((b2 v T )(vbT2 ) − tr(b2 v T (X − AH)T )
And equivalently
∂
(−tr(b2 v T (X − AH)T )) = −v = −(X − AH)v.
∂b2
Clearly,
∂
N tr(b2 bT2 ) = 2N b2 .
∂b2
35 Chapter 3. Properties and Connection to Statistical Methods
1 1 1
RE = ||X − AH − (X − AH)vv T ||2 = ||X(Id − v T ) − AH(Id − vv T )||2 .
N v N
So to simplify, we define X 0 := X(Id − N1 vv T ) and H 0 := H(Id − N1 vv T ) so it yields
RE = ||X 0 − AH 0 ||2 .
Observe, that the transition from the output of hidden-layer to the output layer is
linear, so both A and H 0 are matrices with real entries that can be expected to have
full rank and we are considering the same norm as in previous chapter (see Remark
3.2.1. Thereupon, we find that this problem can be reduced to the autoassociative
case (3.1.4) from the previous chapter with optimal matrix
b = U 0 C −1
A (3.2.2)
p
for some arbitrary invertible (p × p)-matrix C and Up0 the matrix of the first eigen-
vectors of sample covariance matrix of X 0 .
Remark 3.2.5. Building X 0 out of X is centering the data by removing the average
vector µX .
Remark 3.2.6. We notice that the optimal weight matrix, as well as the optimal
bias, is obtained independently from the way the output H, or rather H 0 , of the
hidden layer was produced. That means the choice of the nonlinear function g has
no impact on the choice of the optimal biases and weights, connection the output
layer and the output of the hidden layer.
b = CU 0 T
B p
Now we can find an explicit representation of the optimal bias vector b̂2 .
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 36
Corollary 3.2.7. In the linear case the optimal bias vector is given by
H0 = H = CUp0 T X + b1 v T (3.2.3)
Now, taking in account (3.2.4) and inserting the optimal weight matrix A,
b we obtain
1
b̂2 = (X − Up0 C −1 (CUp0 T X + b1 v T ))v
N
1
= (Xv − Up0 Up0 T Xv + Up0 C −1 b1 N )
N
= (Id − Up0 Up0 T )µx − Up0 C −1 b1
Remark 3.2.8. Since the eigenvalue decomposition, that was used to find the op-
timal weights in the previous chapter, is a special case of the singular value decom-
position, the optimal weights obtained in this section have been calculated using
Singular Value Decomposition.
Consequently,
b 0 = CU 0 T X
H (3.2.4)
p
and a small, but invertible scaling matrix C. That means with entries that must
satisfy |cij | < 1.
Now by substituting (3.2.7) in (3.2.6) we obtain
H̃0 = α1−1 CUp0 T X + (−α1−1 CUp0 T µX )v T
1
= α1−1 CUp0 T X + (−α1−1 CUp0 T Xv)v T
N
1
= α1−1 CUp0 T X(Id − vv T )
N
−1 0T 0
= α1 CUp X .
Using the singular value decomposition of matrix X 0 gives us X 0 = Un Λn Vn with
Un consisting of the normalized eigenvectors of ΣX 0 X√
0 , Vn the normalized eigenvectors
T
of ΣX 0 X 0 and Λn a diagonal matrix with entries λi of the decreasingly ordered
eigenvalues of ΣX 0 X 0 . We obtain by using the Singular Value Decomposition
H̃0 = α1−1 CUp0 T Un Λn Vn
= α1−1 CΛp Vp
and now by applying g to H̃0
g(H̃0 ) = α0 vv T + CΛp Vp = H.
b
By H 0 = H(Id − vv T ) follows,
b 0 = CΛp Vp
H
with optimal bias
b̂2 := µX − α0 Up0 C −1 v
which equals (3.2.3) and therefore prooves the Theorem.
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 38
3.2.5 Conclusion
In this section we could observe that for linear input data the choice of a specific
activation function for the units of the hidden layer does not affect the result of
an autoassociative network, given that this function can be linearly approximated
in a neighbourhood of zero. Hence, non-linear functions at the hidden layer do
not improve a network that handles linear input. This result can be inductively
extended to a network of several hidden layers. Therefore for linear input data, it
is recommendable to use linear layers, since an explicit form of the solution to the
optimization problem (3.2.1) can be obtained by using the results of the previous
section. This avoids the problem of local minima that appears when we are trying
to find a minimum using gradient descent for a network with non-linear layers.
Moreover, we understand that the optimal bias vector is the one that removes the
average from the data, that means transfers the data to centred data with average
zero.
Chapter 4
An Application to Neural
Networks
39
4.2. The Dataset 40
However, we will separate the scheduled time Std UTC into day of the week, time
and season that the flight was performed.
Figure 4.1: The data set: relevant columns are marked blue.
Using Interquartile Range we can then detect and remove outliers. The Interquar-
tile range method gives the interval, called the IQR- interval, in which 50% of the
difference between scheduled block time and actual block time lie.
Later the interval boundaries are extended by the factor 1.5 and everything that
is not included in the extended interval is being removed (see (4.2a)).
Applyinhg the IQR-Method to column DiffBlock, we find that, 6% of the flights
produce outliers and the IQR interval covers 8 minutes. The extended interval cov-
ers values from zero to 23 minutes. This can be visualized by a plot called boxplot
(see (4.2b)).
• modSimple a simple model with two hidden layers of dimension 12 and 10.
• modDeep a deeper architecture with four hidden layers of dimensions 5, 12, 10, 5.
• modWider a wider model with two hidden layers of dimensions 600, 65.
• modPCA a model with one linear hidden layer of dimension 70 followed by two,
non-linear hidden layers of dimensions 12, 10.
Notice that since we are looking at a regression problem, the output layer must be
of dimension one in all models.
As non-linear activation function we chose the relu function
g(s) = max(0, s)
The loss function L is, as in the previous chapter, the mean square error. As an
optimizer we chose Adam, that is an implementation of a stochastic gradient descent
method. In order to avoid overfitting, we will force the model to stop training, if the
loss does not improve after 5 training epochs. Additionally, to prevent overfitting,
20% of the training dataset is extracted as validation set that helps to measure the
error for unknown values during the training phase. If the mean absolute error for
the validation set differs significantly from the train error, the model is overfitting.
We observe that the deeper model (see (4.3b)) and the simple model (see (4.3a))
tend to overfit faster.
43 Chapter 4. An Application to Neural Networks
(a) Mean absolute error by epochs for (b) Mean absolute error by epochs for
the simple model the deeper model
wider model (see (4.4c)) learns a bit better than the other two models. It ends after
43 epochs with a mean absolute error of 5.445 minutes , while the simple model
(see (4.4a)) ends with a mean absolut training error of 5.562 and the deeper model
(see (4.4b)) with 5.682. The better performance of the wider model could be an
indicator of the data set providing too little information to the network (see (2.1)).
4.3. Use Multilayer Perceptrons for Regression Problem 44
(a) Training progress of the simple model. (b) Training progress of the deeper model.
(a) Distribution of the error, simple model. (b) Distribution of the error, deeper model.
predictions were too short, while the others perform worse in this regard. Since that
prevents causing delays in other flights, for our purpose the deeper model is the best
choice.
4.3. Use Multilayer Perceptrons for Regression Problem 46
4.3.4 Conclusion
We have applied Multilayer Perceptrons to a regression problem with relatively few
variables. Although the networks were given little information, the approximations
were satisfying. A deeper architecture outperformed the other architectures because
it has the effect of creating more variance. However, for our purpose, avoiding delays,
the deeper network performed best.
Bibliography
References
[1] J.R. M. and H. Neudecker,(1986). Symmertry, 0 − 1 matrices and Jaco-
bians Economic Theory, 2 ,157-190.
[3] P. Baldi and K. Hornik, (1988). Neural Networks and Principal Com-
ponent Analysis: Learning from Examples Without Local Minima,
Neural Networks, 2, 53-58.
Sources
47