0% found this document useful (0 votes)
19 views51 pages

TFM Lichtner Bajjaoui Aisha

Uploaded by

Jair Aguiar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views51 pages

TFM Lichtner Bajjaoui Aisha

Uploaded by

Jair Aguiar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

A Mathematical Introduction to

Neural Networks

Aisha Lichtner-Bajjaoui

Adviser: Josep Vives

Advanced Mathematics Master Program of the


Universitat de Barcelona
Barcelona, January 12, 2020
Contents

1 Introduction 1
1.1 History of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic Results of Linear Algebra and Analysis . . . . . . . . . . . . . . 2

2 Mathematical definition of a Neural Network 5


2.1 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Training a Multilayer Perceptron . . . . . . . . . . . . . . . . 11
2.3.2 Autoassociative Multilayer Perceptron . . . . . . . . . . . . . 12

3 Properties and Connection to Statistical Methods 13


3.1 Global minimum and Principal Component Analysis . . . . . . . . . . 14
3.1.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 15
3.1.2 Ordinary Least Square Regression . . . . . . . . . . . . . . . . 15
3.1.3 Results about Local and Global Minima . . . . . . . . . . . . 16
3.1.4 The Autoassociative Case and its Connection to PCA . . . . . 31
3.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Autoassociation with Multilayer Perceptrons and Singular Value De-
composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 33
3.2.2 Explicit Results . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Linear Function in the Hidden Units . . . . . . . . . . . . . . 35
3.2.4 Hidden layer with Non-Linear Units . . . . . . . . . . . . . . . 36
3.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 An Application to Neural Networks 39


4.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Decoding the Airport Combinations by One Hot Encoding . . 40
4.2.2 Find and Remove Outliers . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Spliting and Centering the Data . . . . . . . . . . . . . . . . . 41
4.3 Use Multilayer Perceptrons for Regression Problem . . . . . . . . . . 41
4.3.1 Build Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Train Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

i
Contents ii

4.3.3 Evaluate Models on Test Data . . . . . . . . . . . . . . . . . . 44


4.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47
Chapter 1

Introduction

In this work, we are going to introduce Neural Networks. First, we are going to
give a mathematical formulation of the concept of Neural Networks. Later on, we
will examine some important properties of Neural Networks and make a connection
to common statistical methods such as Principal Component Analysis and Singular
Value Decomposition. In the last chapter, we will give a practical application of
a neural network for a regression problem. The concept of a Neural Network is
inspired by the activities of a human brain. Neurons receive information, if the
information is relevant to the neuron, a signal is sent to other neurons via synapses.
The main difference between Neural Networks and rule-based statistical methods,
is the learning ability of Neural Networks. At the beginning of a training phase
a network has no explicit information. During the training phase the inter-neural
connections are changed in a way that the network solves the given problem best.
Therefore Neural Networks can provide solutions to a wide spectrum of problems.
A Neural Network is an abstract model consisting of one or more layers, that are
connected in a certain way. The weighted connection between the layers plays the
role of synapses. Each layer consists of units modelling neurons in the human brain.
The units carry activation functions, modelling the impulses, that the real neurons
send when being triggered. Just like the brain, the network will be trained to learn
a specific task and later should perform a similar task in an unknown situation, us-
ing the experience that it gained before. For that matter, during the training phase
already-observed information is passed to the model and the model produces an out-
put. The output is evaluated on its ability to approximate the observed information.
Depending on the result, the model is then changed to improve performance.

1.1 History of Neural Networks


The idea of an artificial Neural Networks goes back to 1940 when Walter Pitts and
Warren McCulloch discovered that neurons in the human brain essentially perform
combinations of logical operations and have binary outputs that depend on a specific
threshold: active or not active. Based on that fact, mathematical models were built
and created great interest in the Artificial Intelligence community. However, at that

1
1.2. Basic Results of Linear Algebra and Analysis 2

time, computers had just been invented and were not close to being able to handle
algorithms of such complexity. In the following years, scientist lost interest in Neural
Networks due to lack of progress in the field and other, at the time, more promising
methods. In 1970 Seppo Linnainmaa discovered Backpropagation, that later should
revolutionize the performance of Neural Networks. Still, Neural Networks were not
in focus of the Artificial Intelligence Community until 2012, when students won
the ImageNet Large Scale Visual Recognition Competition with remarkable results.
Since then Neural Networks are an uprising topic in Artificial Intelligence.

1.2 Basic Results of Linear Algebra and Analysis


Definition 1.2.1. A quadratic (n × n)-matrix A ∈ Rn×n is called
• positive definite if hx, Axi = xT Ax > 0 for all x ∈ Rn with x 6= 0
• positive semidefinite if hx, Axi = xT Ax ≥ 0 for all x ∈ Rn with x 6= 0
Remark 1.2.2. Note that for every (n × p)- matrix A the matrices AT A ∈ Rp×p
and AAT ∈ Rn×n are symmetric and positive semidefinite by bilinearity of the scalar
product
hx, AT Axi = hAx, Axi = ||Ax||2 ≥ 0
and
hx, AAT xi = hAT x, AT xi ≥ 0
Lemma 1.2.3. Let A ∈ Rn×p matrix. A map f defined by
f (x) = M x
is
1. injective, if and only if A has full column rank p.
2. surjective if and only if A has full row rank n.
Definition 1.2.4. Let A be an (m × n)- matrix with column vectors a1 , ..., an and
B a p × q matrix, then the Kronecker product
A⊗B
is a mp × nq matrix M defined by
mij = aij B
Definition 1.2.5. Let A be an (m × n)- matrix with column vectors a1 , ..., an . The
transformation of matrix A into a column vector by
P := (aT1 , ..., aTn )T
is denoted by
vec(A) := P
3 Chapter 1. Introduction

Definition 1.2.6. Let M be a quadratic (n × n)-matrix, then the trace of M is


defined by
Xn
tr(M ) = aii
i=1

Lemma 1.2.7. Let M be a (n×m)-matrix , N be a (m×n)-matrix, Q be a quadratic


(n × n)- matrix and U be an orthogonal (n × n)-matrix, then it states that,

1.
tr(M N ) = tr((M N )T )

2.
tr(U QU T ) = tr(Q)

Lemma 1.2.8. Let M be a symmetric (n × m) -dimensional positive semidefinite


matrix of full column rank m ≤ n. Then M is positive definite.

Theorem 1.2.9. Spectral Theorem


Let M ∈ Rn×n be a symmetric matrix. Then there exist a diagonal matrix D ∈ Rn×n
and an orthogonal matrix U ∈ Rn×n such that M = U DU T . The diagonal entries of
D are the eigenvalues of M and the columns of U are the corresponding eigenvectors.

Theorem 1.2.10. Let M be a n × m matrix, c any n-dimensional vector and F :


Rm → R a quadratic function given by F (z) = ||c − M z||2 . It states that:

1. F is convex

2. ∇ F(z)= 0 if and only if z corresponds to a global minimum

3. If in addition M T M is positive definite, then F is strictly convex and has a


unique minimum in z = (M T M )−1 M T c

Remark 1.2.11. Observe that

||c − M z||2 =< c − M Z, c − M z >=< c, c > −2 < c, M Z > + < M z, M z >


= cT c − 2cT M z + z T M T M z

and therefore
F 0 (z) = 2M T M z − 2M T c
so with the previous Theorem we can conclude

Corollary 1.2.12. A minimum is reached when M from previous Theorem satisfies

MT Mz = MT c (1.2.1)
1.2. Basic Results of Linear Algebra and Analysis 4

Lemma 1.2.13. For any matrices P, Q and R (of dimensions that allow multipli-
cations) it states:

1. tr(P QT ) = (vecP )T vec(Q)

2. vec(P QRT ) = (R ⊗ P )vec(Q)

3. (P ⊗ Q)(R ⊗ S) = P R ⊗ QS

4. (P ⊗ Q)−1 = P −1 ⊗ Q−1

5. (P ⊗ Q)T = P T ⊗ QT

6. If additionally P and Q are symmetric and positive semidefinite, then (P ⊗ Q)


is symmetric and positive semidefinite.

7. If additionally P and Q are symmetric and positive definite, then (P ⊗ Q) is


symmetric and positive definite.

Proof. see [1], page 167-190.


Chapter 2

Mathematical definition of a
Neural Network

In this chapter, we will introduce Neural Networks in the words of Statistics. The
general idea of a Neural Network is to find a model that, based on a set of N samples

D = {[x1 , ..xN ], [y1 , .., yN ]} (2.0.1)

will approximate a unknown function f , with

f (xi ) = yi

as well as possible. The model always consists of one input layer, an output layer
and at least one hidden layer. The layers are connected with weighted transitions,
that can be modelled by a product of a weight matrix with a vector, that stands for
the output of the previous layer. Each layer consists of several units, the neurons.
The amount of units in each layer defines the dimension of the layer. The activation
functions in the neurons are essentially modelling a threshold that decides whether
the information that a neuron got will be relevant for the further calculations or
not. If the information is relevant it will be passed on to the next layer until the
output layer is reached. During the so-called training phase, the model is given a
sample data set D where the matrix [x1 , ..., xN ] is the input and [y1 , ...yN ] the output
matrix. After processing each sample (xi , yi ) ∈ D, the distance of the output that
the network produces to the actual desired output yi , is measured by some loss
function, for example, the L2 -norm or the euclidean norm. According to the results,
the weights of the transitions are modified to achieve a better result in the next
iteration. The dimension of the input layer is determined by the dimension of the
input data, while the dimension of the output corresponds to the problem we want to
solve using the model. For example, if we are trying to use the model for regression,
the output layer will be of dimension one, because we want the model to assign
one value to each input. For a classification problem, we use an output layer of
dimension k, where k describes the k different categories we want to sort the data
in.

5
6

For a sufficiently big sample, the model is expected to produce a good approxi-
mation of the desired output. In the following, we will try to find the best estimator
Fb of f that satisfies
Fb(xi ) = yi .
That means the best approximation of f for given observation D. Then we will
apply Fb to unknown values z to make a prediction or classification.

Figure 2.1: Neural Network with several hidden layers

Definition 2.0.1. Let xi ∈ Rn and yi ∈ Rk for all i ≤ N ∈ N. Define

D := {[x1 , ..xN ], [y1 , .., yN ]}

the sample of size N . Let


X := [x1 , ..xN ]
denote an n × N matrix that consists of the training input vectors xi in a sample
D of size N . In the following we will call D the training set.

Y := [y1 , .., yN ]
will model the corresponding observed training output vectors yi and form a
k × N -matrix.
7 Chapter 2. Mathematical definition of a Neural Network

Definition 2.0.2. By Z := [z1 , ..zM ] we denote the test set with test vectors
zi ∈ Rn .

Definition 2.0.3. Let X̃ be a n- dimensional random vector that models the random
choice of training input vectors xi . The random vector Z describes the choice of test
data. By Ỹ we will denote a k-dimensional random vector that models the output
of the network for all possible inputs.

Definition 2.0.4. A function F : Rn → Rk represents the network activity and


maps input vectors to output vectors.

The function F satisfies


F (xi ) + i = yi ,
where i is a random variable with unknown distribution P that describes the noise
that is produced by the unknown values Z.

Our goal is to train the network, by changing the function F so it approximates


f as good as possible by minimizing a risk function R that depends on the choice
of a model F . It is calculated by integrating over a loss function L, that measures
the approximation error
Z
R(F ) = L(Ỹ , F (X̃))dP (X̃, Ỹ ), (2.0.2)

where P denotes the unknown joint distribution of (X̃, Ỹ ).


Since we cannot calculate the integral because we are missing knowledge about
the measure P , we will instead try to minimize the average loss in the sample D,
for given observations (xi , yi )
N
1 X
RE (F ) = L(yi , F (xi )).
N i=1

We can understand, however, that an optimal solution to (2.0.2) must satisfy

F (xi ) = E(Ỹ |xi ) (2.0.3)

That is the expected value of Ỹ , if we already know the output of xi .


To make that connection more understandable we will look at the following
example.

Example 2.0.5. Let the loss function L be the quadratic error that is

L(y, F (x)) = (y − F (x))2 .


2.1. Bias and Variance 8

Calculating the risk for this loss function we obtain


Z
R(F ) = (y − F (x))2 dP (x, y)

= E[(Ỹ − F (X̃))2 ]
= E[(Ỹ + E(Ỹ |X̃) − E(Ỹ |X̃) − F (X̃))2 ]
= E[(Ỹ − E(Ỹ |X̃))2 ] + E[(E(Ỹ |X̃) − F (X̃))2 ]
+ 2E[(Ỹ − E(Ỹ |X̃))(E(Ỹ |X̃) − F (X̃))].

Notice that since E(Ỹ |X̃) is a projection of Ỹ on the sub-σ-field generated by X̃ it


follows that (Ỹ − E(Ỹ |X̃)) is orthogonal to E(Ỹ |X̃). We can conclude therefore, by

R(F ) = E[(Ỹ − E(Ỹ |X̃))2 ] + E[(E(Ỹ |X̃) − F (X̃))2 ],

that the best choice of F to minimize R is the conditional expectation

F (x) = E(Ỹ |x)

Definition 2.0.6. Fb : Rn → Rk denotes a estimator that satisfies

FbD = arg min R(F ) (2.0.4)


F

for a given sample D.

Observe that we do not always know if the optimal estimator E(Ỹ |x) is contained
in the space of functions that the network can implement or that satisfy given
conditions on the approximating function, such as, for example, being continuous.
We call the difference between E(Ỹ |x) and Fb the error.

2.1 Bias and Variance


We are considering the estimator FbD for a given sample D and the quadratic loss
function. Taking in account the previous observation (2.0.3), the expected quadratic
error for an observation x, given D, is

E[(E(Y |x) − FbD (x))2 |D]. (2.1.1)

In order not to have a result that measures the accuracy of the approximation only
for a specific sample D, we consider the expectation ED of (E(Ỹ |x) − FbD (x))2 , that
is the expectation with respect to all possible sample sets D, hence the average over
9 Chapter 2. Mathematical definition of a Neural Network

all possible samples of size N . We observe

ED [(E(Ỹ |x) − FbD (x))2 ]


= ED [((FbD (x) − ED [FbD (x)]) + (ED (FbD (x) − E(Ỹ |x))2 ]
= ED [(FbD (x) − ED (FbD (x))2 ] + ED [(ED [FbD (x)] − E(Ỹ |x))2 ]
+ 2ED [(FbD (x) − ED (FbD (x))](ED [FbD (x)] − E(Ỹ |x))
= ED [(FbD (x) − ED (FbD (x))2 ] + (ED [FbD (x)] − E(Ỹ |x))2
+ 2ED [(FbD (x) − ED (FbD (x))](ED [FbD (x)] − E(Ỹ |x))
= (ED [FbD (x)] − E(Ỹ |x))2 + ED [(ED [FbD (x)] − FbD (x))2 ]
| {z } | {z }
bias variance

If FbD on average differs from E(Ỹ |·), that means (ED [FbD (x)] − E(Ỹ |x))2 is big,
we call FbD biased. If we found a non-biased FbD , the variance can still be a cause
of poor performance. For example in the optimal case , FbD (x) = E(Ỹ |x), we obtain
the variance
ED [(ED [E(Ỹ |x)] − E(Ỹ |x))2 ]
that can have a big impact, depending on how much E(Ỹ |·) varies for the given data.

By that, we can see that,on one hand, a function Fb can approximate f very well
for a specific training sample, but on the other hand, applying the approximation to
other samples the result might not be good. This problem is called underfitting,
the network has, due to a bad choice of training samples, not learned enough about
the properties of the function f it approximates. On the other hand, F might be
an unbiased approximation that is very sensitive to the given data, thus on average
approximates E(Ỹ |·) well, but still causes poor performance. We call this behaviour
of a network overfitting, that means, the network models the noise in the training
sample.

We can observe that there is a dilemma between reducing the bias and reducing
the variance. By increasing the dimension of the training samples, that can be done
by adding a layer of a higher dimension, that has the effect of giving the training
samples more features and therefore more information, the bias can be reduced. But
by giving more information the variance will be in general increased. This creates a
trade-off between variance and bias.
2.2. The Neurons 10

Figure 2.2: Bias and Variance

2.2 The Neurons


In this section, we will take a closer look at the activities in the neurons. Each neuron
receives a signal s = (s1 , ..., sm ) that consist of, depending on the architecture of
the network, m ∈ N scalar inputs s1 , ..., sm and a vector that models the weights
w of the transition from the previous neuron towards the neuron. A function h is
then acting on the inputs, producing one output h(s1 , ..., sm ) that is passed on to
an activation function g and generates the one-dimensional output g(h(s1 , ..., sm )).

There are different ways in which the neurons process the inputs. Essentially we
can distinguish between two types of neurons:

1. Neurons that perform a scalar product


A way to handle the inputs is to build the scalar product of the inputs and
the vector w of weights. Hence the activity of the neuron can be described by

h(s) = hw, si

An example of common choices for the function g is a sigmoid function g(x) =


1
1+e−t
, the identity or a sign function.

2. Neurons that calculate a distance


Alternatively, the Neurons can also calculate the distance between an input
signal s and w. Hence
h(s) = ||w − s||
for some norm || · ||.
11 Chapter 2. Mathematical definition of a Neural Network

Figure 2.3: Neuron activity

2.3 Multilayer Perceptron


The activity of a network is mainly characterised by the choice of activation function
and weights connecting the layers, but also on the choice of a specific architecture.
A network architecture is a choice of the number of hidden layers, the size of the
layers, meaning the number of neurons and the way the neurons are connected. In
this work, we will only examine Multilayer Perceptrons.
Definition 2.3.1. A Multilayer Perceptron is a network that consists of one
input layer, one output layer and one or more hidden layers. The neurons in a layer
are only connected to neurons in different layers, in a way that they do not build
any cycles, hence, are not connected to neurons of the previous layer. The Neurons
that are used only perform a scalar product operation and the network activity can
be described by a function
y = F (x; W )
for any input x and output y, where W describes the chosen weights.
Let us take now a closer look at how the Multilayer Perceptron learns.

2.3.1 Training a Multilayer Perceptron


In the training phase the Multilayer Perceptron is given a sample D, defined like
in (2.0.1). Based on the given information, the network is trying to find a F that is
a solution to (2.0.4). Since for a fixed choice of network architecture with quadratic
2.3. Multilayer Perceptron 12

error function, the results depend only on the choice of weights, that are calculated
during the training phase, the problem (2.0.4) can be reduced to
N
1 X
W ? = arg min (yi − F (xi ; W ))2 (2.3.1)
W N i=1

W is the matrix, that models all weights of the whole network. The problem (2.3.1)
is solved by a gradient descent method, like Backpropagation. The gradient descent
methods are based on the idea of minimizing a function by iteratively moving in the
direction of the steepest descent, that is the direction in which the gradient of the
function points. So for (2.3.1) the t-th iteration step is modelled by

Wk,j (t) = Wk,j (t − 1) + ∆Wk,j (t)

where the gradient ∆Wk,j (t) is calculated as follows

∂(yi − F (xi ; W ))
∆Wk,j (t) = −(t)
∂Wk,j

Remark 2.3.2. Notice P that another approach would be to build the gradient of
the whole expression N N
1
i=1 (yi − F (xi ; W )). However that would slow down the
procedure, since for each iteration step, all samples are considered.

2.3.2 Autoassociative Multilayer Perceptron


An important
of a Multilayer Perceptron is the Autoassociative Multilayer Perceptron, that
can be interpreted as learning the identity map F (x) = x.
A significant application of that architecture is to understand how to compress data
efficiently, hence without losing important information. Since the network is first
lowering the dimension of the input and then tries to reproduce it as well as possible,
one can understand that in the optimal case, that is in the minimum of R, the data
is compressed in the hidden layer in a way that preserves as much information as
possible.

Definition 2.3.3. A Multilayer Perceptron where the output of the network equals
the given input is called Autoassociative Multilayer Perceptron.
Chapter 3

Properties and Connection to


Statistical Methods

In this chapter, we will discuss important properties of Multilayer Perceptrons and


their connection to already well knows statistical methods. To begin with we will
introduce some notations and definitions that will be used in this chapter.
Definition 3.0.1. For an (n × m)-dimensional matrix M . We define the average
vector by
1
µM := M v
n
where v denotes a m-dimensional vector of ones.
Definition 3.0.2. For a sample matrix X with column vectors xi and a sample
matrix Y with column vectors yi for i ∈ {1, .., N }, let the (n × n)-dimensional
matrices

N
X
ΣXX := (xi − µxi ) · (xi − µxi )T
i=1
XN
ΣXY := (xi − µxi ) · (yi − µyi )T
i=1
XN
ΣY X := (yi − µyi ) · (xi − µxi )T
i=1
XN
ΣY Y := (yi − µyi ) · (yi − µyi )T
i=1

define the sample covariance matrices. Finally define

Σ := ΣY X ΣXX −1 ΣXY

Lemma 3.0.3. ΣXX is positive semidefinite and symmetric.

13
3.1. Global minimum and Principal Component Analysis 14

Remark 3.0.4. Notice that, for centered samples X and Y , that means µX = 0,
we can observe that ΣXX = XX T , ΣY Y = Y Y T , ΣXY = XY T and ΣY X = Y X T .

Remark 3.0.5. Additionally, we can expect the sample covariance matrix to have
only positive eigenvalues, which implies full rank. A rank deficient covariance matrix
would indicate a poor choice of training samples since linear dependencies describe
correlations in the training set.
P
Definition 3.0.6. For eigenvalues λ1 > λ2 > ... > λn of and an index set
I = {i1 , ..ip } withP
i1 < i2 < ... < ip ≤ n. Define the matrix of the orthonormal
eigenvectors of associated to the eigenvalues λ1 , λ2 , ..., λp by

UI := [ui1 , ..., uip ].

Furthermore, define
Up := [u1 , ..., up ]
as the matrix of the first p eigenvectors corresponding to the p biggest eigen-
values λ1 > ... > λp .

Definition 3.0.7. For an n × p matrix M with p < n we denote the orthogonal


projection on the subspace spanned by the column vectors of M by PM .
If M is of full rank p, then

PM = M (M T M )−1 M T .

Let us first see that Multilayer Perceptrons are universal approximators.

Theorem 3.0.8. If f is a measurable function, N the size of a sample and φ(W )


denotes the space of functions that a Multilayer Perceptron, with one hidden non-
linear layer and weights W , can implement. Then, for any fixed precision value ,
there exists an optimal weight matrix W such that, the probability that F (·, W ) ∈
φ(W ) approximates f with precision , tends to one for N → ∞

Proof. See [2], page 359-366.

3.1 Global minimum and Principal Component


Analysis
In this section we will take a closer look at the learning capacity of Multilayer
Perceptrons using gradient descent methods, discussing, in particular, the problem
of local minima. As a result, we will observe that there is a unique global minimum
and no local minima. Moreover, we will find a connection to Ordinary Least Square
Regression and Principal Component Analysis in the autoassociative case. To begin
with, let us introduce Principal Component Analysis and Ordinary Least Square
Regression.
15 Chapter 3. Properties and Connection to Statistical Methods

3.1.1 Principal Component Analysis


Principal Component Analysis is a method in statistics to represent high dimensional
data in simplified terms, meaning in a lower dimension. To do that, losing as little
information as possible, we are looking for the principal components of the data,
that is the components of the specific data set that carry the most information, hence
varies the most. The data is then being represented with respect to the principal
components. That means the dimension and therefore the complexity is reduced
with losing as little information as possible.
Let us put this concept into mathematical terms.
Definition 3.1.1. Let X be a (n × n)-dimensional matrix, with covariance matrix
C of rank n. The orthonormal eigenvectors of C are denoted by ui = (u1i , ..., uni )
and the eigenvalues by λi and arranged in decreasing order. That is
λ1 > λ2 ... > λn
The vectors defined by
Pi = u1i X1 + ... + uni Xn
are called the principal components.
Notice that
V ar(P1 ) = λ1 > ... > V ar(Pn ) = λn
Therefore, the first components carry the most information.
The principal components can be obtained by diagonalizing the covariance ma-
trix in the following way
C = U ΛU T
with Λ = diag(λ1 , ..., λn ) and U is an orthogonal matrix with orthonormal column
vectors. The matrix X can now be represented by its principal components, by
using the linear transformation
P = XU

3.1.2 Ordinary Least Square Regression


Let us remember briefly another common method in Statistics, the Ordinary Least
Square Regression. That is, finding a solution to a problem of the form
M x i = yi
for M being a (n × n)-matrix and xi , yi fixed n -dimensional vectors, that minimizes
2
P
i∈{1,..,N } ||yi − M xi || .

Theorem 3.1.2. If W is of rank at most p and ΣXX is invertible. Then the problem
stated above has a unique solution
M = ΣY X ΣXX −1
where M is a (n × n)-matrix.
3.1. Global minimum and Principal Component Analysis 16

3.1.3 Results about Local and Global Minima


We are observing a Multilayer Perceptron with one linear input layer of n units, one
linear output layer of n units and one linear hidden layer of p < n units. N training
vectors xi , i ∈ {1, .., N } of dimension n are being used, that represent different
patterns, that we want the network to recognize and map to N output patterns yi ,
i ∈ {1, .., N }, of dimension n. We will denote X as the input matrix and Y is the
output matrix. In this section, we will consider centred training data, that satisfies
µx = 0. The occurring error, for chosen network architecture F will be measured by
the quadratic error function
X
RE := ||yi − F (xi )||2
i∈{1,..,N }

During the training phase, we want to minimize the error by adjusting the weights
in each layer. That means by changing the function F . A minimum can be found
by using a numerical algorithm, such as an implementation of the gradient descent
method. Therefore, the network suffers from a very common problem in numerical
analysis: It is hard to tell whether the gradient descent method can find the global
minimum or if it will end in a local minimum. Since we are considering a network
that consists of only one hidden layer and all units are linear, the function F must
be of the form F (xi ) = W xi = ABxi , where B is a real p × n matrix that represents
the transition from the input layer to the hidden layer and A a real n × p matrix
representing the transition from the hidden layer to the output layer. As a network
architecture that does not use all units of a hidden layer is not practicable, we can
expect A and B to have full rank p, that means to represent surjective, respec-
tively injective maps. Although using linear layers restricts the network capabilities
strongly, since only linear functions can be approximated, it still has an important
purpose: Understanding the activities in the hidden layers. Notice that the results
that are found for one hidden layer can be easily extended to a deeper architecture
with more hidden layers, since by linearity of the transitions they can be interpreted
as one hidden layer.
We can now make some statements about optimal weights, that means the matri-
ces A and B for that the quadratic error function attains a minimum. These optimal
weights are never unique since they can always be multiplied by an invertible matrix
C and result in the same global map

(AC)(C −1 B) = AB = W.
17 Chapter 3. Properties and Connection to Statistical Methods

Figure 3.1: Multilayer Perceptron with one linear hidden layer

Theorem 3.1.3. For a fixed n × p matrix A the function RE (A, B) is convex in the
coefficients of B. It also states that:

1. E attains its minimum for any B that satisfies

AT ABΣXX = AT ΣY X . (3.1.1)

2. If ΣXX is invertible and A has full rank p, then E is strictly convex in B and
a unique minimum is reached in

B = B(A)
b = (AT A)−1 AT ΣY X ΣXX −1 . (3.1.2)

3. In the autoassociative case it follows:

B = B(A)
b = (AT A)−1 AT . (3.1.3)
3.1. Global minimum and Principal Component Analysis 18

Proof.

1. Let A be any fixed (n×p)-matrix and B a (p×n)-matrix, for that RE attains a


minimum. First observe that, by definition, the square of the euclidean norm
of a vector is just the sum of its squared elements. Therefore we can conclude,
that the sum of the squared norm of two vectors v, w is equal to the squared
norm of the vector p := (v T , wT ) = vec([v, w])T . That means

||v||2 + ||w||2 = ||p||2 = ||(v T , wT )||2

So we obtain that
X
RE (A, B) = ||yi − ABxi ||2
i∈{1,..,N }

= ||vec(Y − ABX)||2 (3.1.4)


= vec(Y − ABX)T vec(Y − ABX)
(1.2.13(1))
= tr(vec(Y − ABX)vec(Y − ABX)T ).

By Lemma 1.2.13(2) we know

||vec(Y − ABX)||2 = ||vec(Y ) − (X T ⊗ A)vecB||2

and therefore by Theorem 1.2.10 and (1.2.12) it follows that

RE (A, B) = ||vec(Y ) − (X T ⊗ A)vec(B)||2

is convex in the elements of B and a global minimum is reached if and only if

(X T ⊗ A)T (X T ⊗ A)vec(B) = (X T ⊗ A)T vec(Y ).

Looking at the left hand side we obtain


(1.2.13(5))
(X T ⊗ A)T (X T ⊗ A)vec(B) = (X ⊗ AT )(X T ⊗ A)vec(B)
(1.2.13(3))
= (XX T ⊗ AT A)vec(B)
= (ΣXX ⊗ AT A)vec(B)
1.2.13(2)
= vec(AT ABΣXX )

While at the right hand we observe


1.2.13(5)
(X T ⊗ A)vec(y) = (X ⊗ AT )vec(Y )
1.2.13(2)
= vec(AT Y X T )
= vec(AT ΣY X )
19 Chapter 3. Properties and Connection to Statistical Methods

So it follows
vec(AT ABΣXX ) = vec(AT ΣY X ) (3.1.5)
and therefore RE attains a minimum in B that satisfies
AT ABΣXX = AT ΣY X

2. To prove the second statement, we assume that ΣXX is invertible and A has full
rank p. By (1.2.2) and Lemma 1.2.8 we can conclude that AT A a symmetric
and positive definite. Lemma 3.0.3 states and ΣXX is symmetric and positive
semidefinite. Since by assumption ΣXX has full rank it follows that ΣXX is
positive definite. We can deduce then with Lemma 1.2.13(7) that
(X T ⊗ A)T (X T ⊗ A) = AT A ⊗ ΣXX
is positive definite and symmetric. We know by Theorem 1.2.10(3) that E has
a unique minimum in
B(A)
b = (AT A)−1 AT ΣY X ΣXX −1
that we obtain by (3.1.1).
3. Since in the autoassociative case xi = yi for i ∈ {1, .., N }, it results that
ΣXX = ΣXY = ΣY X = ΣY Y , so

B(A)
b = (AT A)−1 AT .

A similar result can be formulated for a fixed matrix B of weights connecting


the input layer to the hidden layer.
Theorem 3.1.4. For a fixed p × n matrix B the function RE (A, B) is convex in the
coefficients of A. It also states that:
1. RE attains its minimum for any A that satisfies

ABΣXX B T = ΣY X B T . (3.1.6)

2. If ΣXX is invertible and B has full rank p, then RE is strictly convex in A and
a unique minimum is reached in
A = A(B)
b = ΣY X B T (BΣXX B T )−1 . (3.1.7)

3. In the autoassociative case it follows

A = A(B)
b = ΣXX B T (BΣXX B T )−1 . (3.1.8)
3.1. Global minimum and Principal Component Analysis 20

Proof.

1. Like in the previous proof, we use

RE (A, B) = ||vec(Y − ABX)||2 .

Now in order to find an expression dependent on the vector vec(A) we can


write

RE (A, B) = ||vec(Y − ABX)||2 = ||vec(Y ) − vec(IdA(BX)||2


(1.2.13(2))
= ||vec(Y ) − ((BX)T ⊗ Id)vec(A)||2 .

We can now conclude with Theorem 1.2.10(1) that E is convex in A and has
a global minimum if A satisfies

(X T B T ⊗ Id)T (X T B T ⊗ Id)vec(A) = (X T B T ⊗ Id)T vec(Y ).

Looking at the left-hand side, we obtain:


(1.2.13(5))
(X T B T ⊗ Id)T (X T B T ⊗ Id)vec(A) = (BX ⊗ Id)(X T B T ⊗ Id)vec(A)
= (BXX T B T ⊗ Id)vec(A)
= (BΣXX B T ⊗ Id)vec(A)
(1.2.13(2))
= vec(IdA(BΣXX B T )T )
= vec(ABΣXX B T ).

And the right hand side provides


(1.2.13(5))
(X T B T ⊗ Id)T vec(Y ) = (BX ⊗ Id)vec(Y )
(1.2.13(2))
= vec(Y X T B T )
= vec(ΣY X B T ).

So it follows that RE has a global minimum if A satisfies

ABΣXX B T = ΣY X B T .

2. Let B be of full rank and ΣXX be invertible, therefore full rank. With Remark
1.2.2 and ΣXX being positive semidefinite and symmetric we obtain that by

hx, BΣXX B T xi = hB T x, ΣXX B T xi ≥ 0,

the matrix BΣXX B T is positive semidefinite and by symmetry of ΣXX , sym-


metric. Since it is also full rank p we can conclude by Lemma 1.2.8 that it is
positive definite.
21 Chapter 3. Properties and Connection to Statistical Methods

By Lemma 1.2.13(7) we can conclude that

(BΣXX B T ⊗ Id) = (X T B T ⊗ Id)T (X T B T ⊗ Id)

is positive definite. From Theorem 1.2.10(3) we can deduce that RE has a


unique minimum in

A(B)
b = ΣY X B T (BΣXX B T )−1 ,

that we obtain from (3.1.6) and the fact that (BΣXX B T ) is invertible as a
positive definite quadratic matrix.

3. As seen in previous proof, for the autoassociative case we know ΣXX = ΣY X


and therefore the previous formula can be simplified to

A(B)
b = ΣXX B T (BΣXX B T )−1 .

Using the previous results, we can now make some statements about the global
map that corresponds to a critical point.

Theorem 3.1.5. Let ΣXX be invertible and A of full rank. A and B define a critical
point of RE , that means δRE
δaij
= δR E
δbij
= 0, if and only if

1. In the general case:


The global map W = AB is of the form

W = PA ΣY X ΣXX −1 (3.1.9)

and A satisfies

PA Σ = PA ΣPA = ΣPA . (3.1.10)

Or equivalently, B = B,
b as defined in (3.1.2) and A satisfies (3.1.10).

2. In the autoassociative case with Σ = ΣXX :


The global map is of the form

W = PA (3.1.11)

with B = B
b and A satisfies

PA ΣXX = PA ΣXX PA = ΣXX PA . (3.1.12)


3.1. Global minimum and Principal Component Analysis 22

Proof.

1.
”=⇒” :
Let ΣXX be invertible and A, B matrices that define a critical point of RE .
From Theorem 3.1.3 we can deduce that B must be of the form

B(A)
b = (AT A)−1 AT ΣY X (ΣXX )−1 .

So

W = AB = AB(A)
b = (AT A)−1 AT ΣY X (ΣXX )−1 = PA ΣY X (ΣXX )−1
| {z }
=PA

and it follows (3.1.9). Furthermore, A must satisfy, that by multiplying (3.1.6)


by AT on the right hand side, we get
T T T T
|{z} | {zA } = ΣY X B
AB ΣXX B | {zA }
=W =W T =W T

and therefore
W ΣXX W T = ΣY X W T
which equals by (3.1.9),

PA ΣY X ΣXX −1 ΣXX (PA ΣY X ΣXX −1 )T = ΣY X (PA ΣY X ΣXX −1 )T


⇐⇒ PA ΣY X ΣXX −1 ΣXX (ΣXX −1 )T ΣY X T PAT = ΣY X ΣXX −1 ΣY X PA .

Since the covariance matrix ΣXX as well as the orthogonal projection are
symmetric, the inverse of a symmetric matrix is symmetric and by definition
(ΣY X )T = ΣXY we obtain

PA ΣY X ΣXX −1 ΣXY PA = ΣY X ΣXX −1 ΣXY PA

which equals by definition of Σ to


X X
PA PA = PA .

Since by

(Σ)T = (ΣY X ΣXX −1 ΣXY )T = ΣXY T (ΣXX −1 )T ΣY X T = ΣY X ΣXX −1 ΣXY ,

we know that Σ is symmetric. We can conclude

PA ΣPA = (PA ΣPA )T = (ΣPA )T = PA Σ = ΣPA .


23 Chapter 3. Properties and Connection to Statistical Methods

”⇐=”:
Assume A satisfies (3.1.10) and the global map W satisfy (3.1.9). By multi-
plying (3.1.9) by (AT A)−1 AT on the left we get:
(AT A)−1 AT PA ΣY X ΣXX −1 = (AT A)−1 AT W = (AT A)−1 AT AB = B
⇐⇒ (AT A)−1 AT A(AT A)−1 AT ΣY X ΣXX −1 = B
⇐⇒ B = (AT A)−1 AT Σ Σ −1 = B(A).
YX XX
b
We observe that B satisfies (3.1.2).
Now we can see that A satisfies (3.1.7), by
X X
P A = PA PA
⇐⇒ ΣY X ΣXX −1 ΣXY PA = PA ΣY X ΣXX −1 ΣXY PA
| {z } | {z }
=W T =W

⇐⇒ ΣY X W T = ΣY X B T AT = W ΣXY PA = W ΣXX B T AT
| {z }
=ΣXX W T

⇐⇒ ΣY X B T AT = ABΣXX B T AT .
Multiplying that by A(AT A)−1 on the right side, we obtain
ΣY X B T AT A(AT A)−1 = ABΣXX B T AT A(AT A)−1

⇐⇒ ΣY X B T = ABΣXX B T .

So we can conclude, that A and B satisfy, respectively (3.1.6) and (3.1.2) (thus
(3.1.1)). Therefore, by Theorem 3.1.3 and Theorem 3.1.4 they define a critical
point of RE .
2. ”=⇒”
In the autoassociative case, we consequently obtain by Σ = ΣXX and (3.1.3)
and with the same reasoning as in the general case:
W = PA and
PA ΣXX = PA ΣXX PA = ΣXX PA .
”⇐=”:
Let A satisfy (3.1.12) and the global map W satisfy (3.1.11). Multiplying
(3.1.11) on the left side by A−1 gives us directly B = B
b = (AT A)−1 AT .
Using (3.1.12) and (3.1.11) we get
ΣXX PA = PA ΣXX PA ⇐⇒ ΣXX PAT = PA ΣXX PAT
⇐⇒ ΣXX B T AT = ABΣXX B T AT
⇐⇒ ΣXX B T (AT A) = ABΣXX B T (AT A)
⇐⇒ ΣXX B T = ABΣXX B T .
3.1. Global minimum and Principal Component Analysis 24

A satisfies (3.1.6). Therefore, by Theorem 3.1.3 and Theorem 3.1.4 they define
a critical point of RE .

∈ Rn×n be full rank with n ∈ N distinct ordered eigenvalues


P
Theorem 3.1.6. Let
λ1 > λ2 > ... > λn and I = {i1 , ..ip } with i1 < i2 < ... < ip for p ≤ n the ordered
index set such P that , UI := [ui1 , ..., uip ] is the matrix formed by the orthonormal
eigenvectors of associated to the eigenvalues λ1 , λ2 , ..., λn .
Two full rank matrices A ∈ Rn×p and B ∈ Rp×n define a critical point of E if and
only if there exists a ordered set I of p indices and an invertible n × n matrix C such
that
1. In the general case, A and B satisfy

A = UI C (3.1.13)

and

B = C −1 UIT ΣY X ΣXX −1 . (3.1.14)

Additionally, if a such critical point exists, then it states

W = PUI ΣY X ΣXX −1 (3.1.15)

and
X
RE (A, B) = trΣY Y − λi . (3.1.16)
i∈I

2. In the auto associative case, A and B satisfy

A = UI CB = C −1 UIT (3.1.17)

and

W = PUI . (3.1.18)

Proof.
Since Σ is a real symmetric matrix it follows by the Spectral Theorem 1.2.9 that it
is diagonalizable:

Σ = U ΛU T (3.1.19)

where U is an orthogonal matrix of eigenvectors of Σ and Λ = diag(λ1 , λ2 , ..., λn ),


a diagonal Matrix with the decreasing eigenvalues of Σ on the diagonal. Since Σ is
full rank we know that ΣXX , ΣY X , ΣXY are full rank.
”⇐= :
25 Chapter 3. Properties and Connection to Statistical Methods

Let A = UI C and B = C −1 UIT ΣY X ΣXX −1 .


By

(AT A)−1 AT ΣY X ΣXX −1 = ((UI C)T (UI C))−1 (UI C)T ΣY X ΣXX −1
= (C T UI T UI C)−1 C T UIT ΣY X ΣXX −1
= C −1 (UI T UI )−1 (C T )−1 C T UI T ΣY X ΣXX −1
= C −1 UI T ΣY X ΣXX −1
=B

and

ΣY X B T (BΣXX B T )−1 = ΣY X (C −1 UI T ΣY X ΣXX −1 )T ((C −1 UI T ΣY X ΣXX −1 )


× ΣXX (C −1 UI T ΣY X ΣXX −1 )T )−1
= ΣY X (C −1 UI T ΣY X ΣXX −1 )T ((C −1 UI T ΣY X ΣXX −1 )T )−1
× ΣXX −1 ((C −1 UI T ΣY X ΣXX −1 ))−1
= ΣY X ΣXX −1 (C −1 UI T ΣY X ΣXX −1 )−1
= ΣY X ΣY X −1 ΣXX −1 ΣXX UIT −1 C = UI C
= A,

it follows that A and B satisfy (3.1.7) and respectively (3.1.2). Therefore A and B
define a critical point.
”=⇒ :”
Let A and B define critical points of E.

First, observe that

PU T A = U T PA U (3.1.20)

since by definition of PU T A :

PU T A = U T A((AT U )(U T A))−1 AT U = U T A(AT A)−1 AT U = U T PA U.


| {z }
=PA

We can conclude by
(3.1.19) X (3.1.10) X
U PU T A U T U ΛU T} = PA = PA = U ΛU T U PU T A U T
| {z } | {z P
=PA =

that
U PU T A ΛU T = U ΛPU T A U T
and consequently,
PU T A Λ = ΛPU T A .
3.1. Global minimum and Principal Component Analysis 26

Now we can observe

p1,1 · · · p1,n λ1,1 0 · · ·


  
0
 .. ..   ..
.
.. 
 . .  0 . 
PU T A · Λ =  . .  . . .
 .. ..   .. .. .. 

pn,1 · · · pn,n 0 · · · 0 λn,n


 
λ1 · p1,1 · · · λn · p1,n
=  ... ..
 
. 
λ1 · pn,1 · · · λn · pn,n
λ1,1 0 · · · p1,1 · · ·
  
0 p1,n
... ..   .. .. 
 0 .  . . 

= . . ..   . .. 
 .. .. .   .. . 
0 · · · 0 λn,n pn,1 · · · pn,n
 
λ1 · p1,1 · · · λp1,n
 .. ..
= .

. 
λn pn,1 ··· λn · λpn,n

= Λ · PU T A
⇐⇒ λk · pi,k = λi · pi,k for all i, k ∈ {1, .., n}.

Knowing that λ1 > ...λn > 0, it follows, that λi 6= 0 for i 6= k, thus PU T A is a diagonal
matrix of rank p. Since orthogonal projectors only have eigenvalues 1 and 0, we can
define a unique index ordered set I = {i1 , ..., ip }, with 1 ≤ i1 < ... < ip ≤ n, that
represents the indices of the positions where PU T A has an entry that is one. That
means ik ∈ I if pik ,ik = 1. By that, we can define a diagonal matrix

0 ···
 
j1 0
.. .. 

0 . .
MI = 

.. ... .. 
 . .
0 ··· 0 jn

such that jk = 1 if jk ∈ I, for k ∈ {1, ..., n} and thereforeP PU T A = MI . Now define


UI := [ui1 , ..., uip ] as the matrix of the ik -th eigenvector of . See that, without loss
of generality, if j2 = 0 and j1 = j3 = ... = jn = 1, we can observe UI = [u1 , u3 , .., un ]
27 Chapter 3. Properties and Connection to Statistical Methods

and clearly
 
1 0 ··· ··· 0 
· · · un,1  ..  u1,1 · · ·
  
u1,1 un,1
 .. ..   0 0 .   .. .. 
 . .  ..   . . 
U MI U T =  . .. 
0 1 . . .. 
 .. .

.   ..

. 
.

 .. .. 
u1,n · · · un,n u1,n · · · un,n
0 ··· ··· 0 1
0 u3,1 · · · un,1 u1,1 · · · u1,n
  
u1,1
.. ..   .. .. 
. .  . . 

=

.. ..  .
  .. .. 
 . . . 
u1,n 0 u3,n · · · un,n un,1 · · · un,n
 
0 u3,1 · · · un,1  u1,1 · · · u1,n 
 
u1,1
 .. ..  0 0 
 . .   u3,1 · · · u3,n 
= . ..
 ..
 
.   .. .. 
 . . 
u1,n 0 u3,n · · · un,n un,1 · · · un,n
= UI UIT .

So, it yields PA = U PU T A U T = UI UIT = PUI . In other words: PA is the orthog-


onal projection on the subspace spanned by the column vectors of UI . Therefore,
we can deduce that the column vectors of A span the same space as the column
vectors of UI . That means there exists a bijection from A to UI , thus there exists an
invertible (p × p)-dimensional matrix C such that A = UI C. Since by assumption A
and B define critical points of RE and A is invertible, by (3.1.2) B must be of the
form
B = B(A)
b = C −1 UIT ΣY X ΣXX −1 .

From previous Theorem and PA = PUI follows directly (3.1.15).


It is left to prove (3.1.16). As in the proof of (3.1.3) we can write

RE (A, B) = (vec(Y − ABX))T (vec(Y − ABX))


= vec(Y )T vec(Y ) − 2(vec(ABX)T vec(Y )) + vec(ABX)T vec(ABX)

and by using (1.2.13(1)), we obtain

RE (A, B) = tr(Y Y T ) − 2tr(ABXY T ) + tr(ABXX T B T AT )


= tr(ΣY Y ) − 2tr(W ΣXY ) + tr(W ΣXX W T ).

If A is of full rank and B = B(A),


b then

W (A) = AB = UI CC −1 UIT ΣY X ΣXX T = PUI ΣY X ΣXX −1 = PA ΣY X ΣXX −1 .


3.1. Global minimum and Principal Component Analysis 28

So,
X (3.1.10) X
tr(W ΣXX W T ) = tr(PA PA ) = tr(PA ) = tr(U PU T A U T U ΛU T )
(1.2.7)
= tr(PU T A U T U Λ) = tr(PU T A Λ)

and
X
tr(W ΣXY ) = tr(PA ) = tr(U PU T A U T U ΛU T ) = tr(PU T A Λ).

Consequently, for any A of rank p we get a explicit formula for RE (A),

RE (A) = tr(ΣY Y ) − tr(PU T A Λ)


P
Since A = UI C, we conclude that PU T A = MI and therefore tr(PU T A Λ) = i∈I . So
it follows (3.1.16).
The autoassiciative case follows directly from applying ΣXX = ΣY X = ΣXY = ΣY Y .

That means a critical point W of rank p is always the product of the Ordinary
Least Squares Regression matrix P and an orthogonal projection onto the subspace
spanned by p eigenvectors of .
Finally, we can state our main result, that is, that RE has no local minima and
all critical points, other than the global minima, correspond to saddle points.

Theorem 3.1.7. The critical point W of RE associated with the index set {1, ..., p},
in other words, associated with the biggest p-eigenvalues, is the unique local and
global minimum of RE . The other np − 1 index sets correspond to saddle points.

Proof. Let A and B matrices such that (3.1.13) and (3.1.14) are satisfied, and
therefore define a critical point of RE . Furthermore, let I be a p-index set with
I 6= {1, .., p}. Our aim is now to show that A and B do not define a local or global
minimum, but a saddle point, that means there exist à and B̃ in a neighbourhood
of A and B such that
RE (Ã, B̃) < RE (A, B)
P
Let Up the matrix of the first p eigenvectors of corresponding to the p-biggest
eigenvalues and UI of the i-th eigenvectors, for i ∈ I. Moreover, let j ∈ I and
k ∈ {1, .., p}\I. Now let us construct a matrix ŨI by replacing the j-th vector of UI
by
1
ũj := (1 + 2 ) 2 (uj + uk )

That can be interpreted as moving from uj a little in the direction of the vector uk
29 Chapter 3. Properties and Connection to Statistical Methods

of Up . Notice that,

 u˜1 
..

1
.  
(1+2 )− 2 (uj +uk ) 1
T
 
ŨI ŨI =
  u˜1 ··· (1+2 )− 2 (uj +uk ) ũj+1 ··· u˜p
ũj+1 
 .. 
.
u˜p
 1 
<ũ1 ,ũ1 > <(1+2 )− 2 (uj +uk ),ũ1 > <u˜p ,ũ1 >
1
 <ũ1 ,ũ2 > ··· <(1+2 )− 2 (uj +uk ),ũ2 > ··· 
 .. 
.
 
=
 .. ..

1 1 
 . <(1+2 )− 2 (uj +uk ),(1+2 )− 2 (uj +uk )> . 
 .. 
.
<ũ1 ,ũp > e <u˜p ,ũp >

Since uk 6∈ I is not a vector of ŨI and all vectors of ŨI are orthonormal and

1 1
h(1 + 2 )− 2 (uj + uk ), (1 + 2 )− 2 (uj + uk )i = (1 + 2 )−1 ||uj + uk ||2 = 1

it follows that ŨIT ŨI = Id. Define now à := ŨI C and B̃ := B(


b Ã) for some invertible
p × p matrix. Consequently,

PÃ = ŨI C(C T ŨIT ŨI C)−1 C T ŨIT = ŨI ŨIT .


3.1. Global minimum and Principal Component Analysis 30

and therefore,
 u1 
.
T T
 ..  1

PU T Ã = U ŨI ŨI U =  uk
  u˜1 ··· (1+2 )− 2 (uj +uk ) ··· u˜p
.. 
.
un

 u˜1 
..
 .  
1
 (1+2 )− 2 (u +u )  u1 ··· ··· un
×
 ũj+1
j k 

 .. 
.
u˜p
1
···<u1 ,(1+2 )− 2 (uj +uk )>··· <u1 ,u˜p >
 
<u1 ,ũ1 >
 ... 
 1

 <uk ,(1+2 )− 2 (uj +uk )> 
=  .. .. 
 
1
 . <uj ,(1+2 )− 2 (uj +uk )> . 
 <up ,ũ1 > <up ,u˜p > 
 . .. 
.. .
<un ,ũ1 > <un ,u˜p >
1
···<(1+2 )− 2 (uj +uk ),u1 >··· <ũp ,up > ···<ũp ,un >
 
<u˜1 ,u1 >
.. .. .. ..

 . 1
. . . 

×
 <(1+2 )− 2 (uj +uk ),uk > 
1

 <(1+2 )− 2 (uj +uk ),uj > 
 .. 
.
<ũp ,un >··· <ũp ,up > ···<ũn ,up >

So by orthonormality of eigenvectors, we obtain a diagonal matrix with diagonal


entries of the form


 0 for i 6∈ I ∪ {k}

1 for i ∈ I and i 6= j, i 6= k
di = 2
.


 1/(1 +  ) for i = j
2
 /(1 + 2 ) for i = k
Then, we get
RE (Ã, B̃) = trΣY Y − trPU T Ã Λ
P
and since Λ carries the eigenvalues of we can conclude
X 1 2 λk X 2 (λk − λj )
RE (Ã, B̃) = trΣY Y − ( λi + λj + ) = trΣYY − λi −
(1 + 2 1 + 2 i∈I
1 + 2
i∈I\{j}

2 (λk − λj )
= RE (A, B) − .
1 + 2
All eigenvalues are non-negative and of decreasing order, so by k < j follows λk > λj .
Therefore, it yields that, by having chosen  arbitrary, that there is a neighbourhood
31 Chapter 3. Properties and Connection to Statistical Methods

of A, B such that there exist Ã, B̃ in that neighbourhood that satisfies RE (Ã, B̃) <
RE (A, B). Hence, A, B do not define a minimum, and by convexity of RE neither a
maximum, so they define a saddle point.
Remark 3.1.8. The critical points defined by rank deficient matrices A and B,
correspond to saddle points as well, but will not be discussed in this work. (See [3],
page 58)

3.1.4 The Autoassociative Case and its Connection to PCA


As a special case, we take a closer look at the autoassociative architecture.
Looking at the autoassociative case of our result (3.1.6) we find that in the global
minimum
A = Up C
B = C −1 UpT
W = PA .
Up := U{1,...,p} is the matrix of the first p-eigenvectors of ΣXX .
Now choosing the identity map for C, we can observe that the optimal way of
lowering the dimension of the input data is, up to equivalence, the projection of
the data on the subspace spanned by the eigenvectors of ΣXX . That coincides with
the transformation to the principal components of X, that we have seen in (3.1.1).
Furthermore, this result answers an important question: How well can a network
recognize patterns that it has not seen in the training phase, that means that cannot
be found in X? The answer can be given by
RE (A, b = ||Up v − v||2
b B)

where v describes an unknown input-pattern. Hence, the error can be measured


by the distance of the unknown pattern to the space spanned by the principal
components of the patterns, seen during the training phase.

3.1.5 Conclusion
We took a closer look at a linear Multilayer Perceptron with one hidden layer and
could observe that, up to equivalence, the unique global minimum can be found and
there are no local minima. All critical points, apart from the global minimum, refer
to saddle points. By observing autoassociative architecture, we found a measure of
how well a network can recognize unknown patterns. Moreover, we have seen that in
some special cases the optimal results coincide with well-known results in Statistics,
such as Ordinary Least Square Regression and Principal Component Analysis. That
means instead of using a gradient descent method on the error function, in these
special cases, we could use these already known methods directly. However, using
the gradient descent method, with appropriate parameters, to avoid the algorithm
ending in a saddle point, will also work on more general data and can be expected
to produce the same results, as no local minima will disturb the gradient descent.
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 32

3.2 Autoassociation with Multilayer Perceptrons


and Singular Value Decomposition
In this section, we will take a closer look at the autoassociative case. Having dis-
cussed the linear case in the previous section, we will now move onto a non-linear
case, which consists of a network with one non-linear hidden layer. Hence, we will
be looking at an auto-associative network consisting of one linear input and one
linear output layer, both of n units and one non-linear hidden layer of p < n units
with non-linear activation function g. As in the previous chapter, let B denote a
real p × n matrix that represents the transition from the input layer to the hidden
layer and A a real n × p matrix representing the transition from the output of the
hidden layer to the output layer. As before, X := [x1 , ..., xN ] is the input matrix
and Y := [y1 , ..., yN ] the output matrix. Additionally, we will define

H0 := BX + b1 v T

as the input of the hidden layer, with n-dimensional bias vector b1 and v a N -
dimensional vector of ones. Furthermore, define H := [h1 , ..., hN ] = g(H0 ) as the
output matrix of the hidden layer, with

hi = g(Bxi + b1 ) i ∈ {1, .., N }

for some nonlinear function g. Since we are trying to approximate the identity map
for linear input xi , in the output layer the non-linearity has to be removed, that
means it must be of the form
yi = Ahi + b2
Where b2 describes the bias of the output of the hidden layer. Consequently, the
network activity is described by the function F with component-wise activity

Fi (xi ) = A(g(Bxi + b1 v T ) + b2 v T )

or equivalently,
F (X) = A(g(BX + b1 v T ) + b2 v T ) = Y
Since we want the network to reproduce the input the occurring error will be mea-
sured by the quadratic error function

RE := ||X − F (X)||2 (3.2.1)

with respect to the matrix norm ||A|| = tr(AAT ). We want to find the optimal
weights A, B and optimal biases b1 , b2 that minimizes the error function.

Remark 3.2.1. Observe that the norm equals the norm used in the previous chapter
(see (3.1.4)).

To begin with, let us recapitulate the Singular Value Decomposition.


33 Chapter 3. Properties and Connection to Statistical Methods

Figure 3.2: Autoassociative Multilayer Perceptron with one hidden layer

3.2.1 Singular Value Decomposition


Singular Value Decomposition is a tool from linear algebra to factorize real- or
complex-valued matrices. Since in our case we are only interested in real values, we
will just state the Theorem for the real case.
Theorem 3.2.2. Let M ∈ Rn×m real valued matrix of rank r ≤ m, n. There exist
two real orthogonal matrices U ∈ Rn×n , V ∈ Rm×m and a quasi-diagonal matrix
D ∈ Rn×m of the form  
s1
..

 . 

D= sr
 

0
 
 
...

such that
M = U DV
with positive, decreasing and uniquely defined entries s1 > ... > sr > 0 that are the
singular values of M .
Proof. See [4], page 217-218.
Remark 3.2.3. In the case that M is a quadratic matrix, the singular value de-
composition equals the eigendecomposition that was used in the previous chapter in
the proof of (3.1.6).

3.2.2 Explicit Results


In the following we will first examine the transition from the output of the hidden
layer H to the linear output layer Y and then take a look at the non-linear activity
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 34

between input layer X and the output of the hidden layer H.

By deriving the error function with respect to b2 we can find an optimal bias
vector bb2 .
Theorem 3.2.4. Let RE be the quadratic error function defined by

RE := ||X − AH − b2 v T ||2

with respect to the matrix norm ||A|| = tr(AAT ). Let X be a n × N -matrix and v
an n-dimensional vector of ones. Then, minimization of E with respect to b2 yields
the optimal bias vector b̂2 is given by
1
b̂2 = (X − AH)v.
N
Observe that this bias vector ensures that the average input equals the average
output, which is desired when trying to reproduce data in the most accurate way.
Proof.
First note that

RE = ||X − F (X)||2
= ||X − AH − b2 v T ||2
= tr((X − AH − b2 v T )(X − AH − b2 v T )T )
= tr((X − AH) − b2 v T )(X − AH)T − vbT2 ))
= tr((X − AH)(X − AH)T ) + tr((X − AH)(−vbT2 ))
+ tr((b2 v T )(vbT2 ) − tr(b2 v T (X − AH)T )

so we get that the derivative equals


∂RE ∂
= (−tr(((X − AH)v)bT2 ) + N tr(b2 bT2 ) − tr(b2 v T (X − AH)T )
∂b2 ∂b2
Notice that ((X − AH)u) =: v and v T (X − AH)T = v T equals some n-dimensional
vector v, so the derivative is
∂ ∂ ∂ X
(−tr(((X −AH)v)bT2 ) = (−tr(vbT2 )) = − vi b2i = −v = −((X −AH)v)
∂b2 ∂b2 ∂b2 i≤n

And equivalently

(−tr(b2 v T (X − AH)T )) = −v = −(X − AH)v.
∂b2
Clearly,

N tr(b2 bT2 ) = 2N b2 .
∂b2
35 Chapter 3. Properties and Connection to Statistical Methods

Since vector v is a vector of ones we get −v T (X − AH)T = −(X − AH)v and


therefore
∂RE
= −2(((X − AH)v) − N b2 ).
∂b2
1
So, using the previously seen result (1.2.10) we can deduce b̂2 = N
(X − AH)v.

Now to find a corresponding optimal matrix A


b and H
b we substitute bb2 in RE
and obtain

1 1 1
RE = ||X − AH − (X − AH)vv T ||2 = ||X(Id − v T ) − AH(Id − vv T )||2 .
N v N
So to simplify, we define X 0 := X(Id − N1 vv T ) and H 0 := H(Id − N1 vv T ) so it yields
RE = ||X 0 − AH 0 ||2 .
Observe, that the transition from the output of hidden-layer to the output layer is
linear, so both A and H 0 are matrices with real entries that can be expected to have
full rank and we are considering the same norm as in previous chapter (see Remark
3.2.1. Thereupon, we find that this problem can be reduced to the autoassociative
case (3.1.4) from the previous chapter with optimal matrix
b = U 0 C −1
A (3.2.2)
p

for some arbitrary invertible (p × p)-matrix C and Up0 the matrix of the first eigen-
vectors of sample covariance matrix of X 0 .
Remark 3.2.5. Building X 0 out of X is centering the data by removing the average
vector µX .
Remark 3.2.6. We notice that the optimal weight matrix, as well as the optimal
bias, is obtained independently from the way the output H, or rather H 0 , of the
hidden layer was produced. That means the choice of the nonlinear function g has
no impact on the choice of the optimal biases and weights, connection the output
layer and the output of the hidden layer.

3.2.3 Linear Function in the Hidden Units


Before discussing the non-linear case we will take a look at the linear case. Notice,
that by linearity of matrix operations, a linear activation function can be interpreted
as a part of the linear transition A from the input layer to the output H of the hidden
layer that we have already examined in the previous section. Consequently, we can
just observe H0 = H. As before, this case can be reduced to the auto-associative
case (3.1.4) of the previous chapter, that gives the optimal weight matrix

b = CU 0 T
B p

Now we can find an explicit representation of the optimal bias vector b̂2 .
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 36

Corollary 3.2.7. In the linear case the optimal bias vector is given by

b̂2 = (Id − Up0 Up0 T )µx − Up0 C −1 b1

Proof. Substituting the optimal weight matrix B


b in the definition of H0 we obtain

H0 = H = CUp0 T X + b1 v T (3.2.3)

Now, taking in account (3.2.4) and inserting the optimal weight matrix A,
b we obtain

1
b̂2 = (X − Up0 C −1 (CUp0 T X + b1 v T ))v
N
1
= (Xv − Up0 Up0 T Xv + Up0 C −1 b1 N )
N
= (Id − Up0 Up0 T )µx − Up0 C −1 b1

Remark 3.2.8. Since the eigenvalue decomposition, that was used to find the op-
timal weights in the previous chapter, is a special case of the singular value decom-
position, the optimal weights obtained in this section have been calculated using
Singular Value Decomposition.

3.2.4 Hidden layer with Non-Linear Units


Let us proceed now by discussing the transition from the input layer to the output
of the hidden layer H. In particular, we want to observe the impact of the non-linear
activation function in the hidden layer on the choice of the optimal weight matrices.
We have seen already in the linear case, that there is an optimal weight matrix B b
0 0
depending on X , that produces an optimal output of the hidden layer H b , for the
0
centered input H0 .
b
Now we want to see, if by making small changes on H b 00 and B,
b denoted by H̃0 ,
respectively B̃, we can obtain the result g(H̃0 ) = H b 0 . We have seen before (3.2.3)
that Hb is of the form
Hb = CUp0 T X + b1 v T

Consequently,

b 0 = CU 0 T X
H (3.2.4)
p

We need to choose an appropriate candidate for the non-linear function g:


That is a function that can be linearly approximated somewhere, so the produced
output H 0 can be linearly approximated and we can use the results of the linear
case. Therefore, we will assume that g can be approximated linearly for values x in
a neighbourhood of zero.
37 Chapter 3. Properties and Connection to Statistical Methods

Theorem 3.2.9. If g is a nonlinear function that can be approximated linearly for


small values x, means for arbitrary small  > 0 and α0 , α1 ∈ R with α1 6= 0
g(x) = α0 + α1 x + 
for small x. Then, changing the optimal matrices B
b and H
b 0 to
B̃ = α1−1 CUp0 T (3.2.5)
H̃0 = α1−1 CUp0 T X + b1 v T (3.2.6)
satisfies
b 0.
g(H̃0 ) = H
Proof.
To begin with, notice that in order to use the linear approximation property of g we
have to scale the matrix H̃0 , to make the coefficients small. That can be done by
adjusting the bias vector b1 and the arbitrary invertible matrix C. A good choice
for the bias bb1 is forcing the average input N1 H̃0 v of the hidden layer to be zero by
choosing
bb1 = −α−1 CU 0 T µX (3.2.7)
1 p

and a small, but invertible scaling matrix C. That means with entries that must
satisfy |cij | < 1.
Now by substituting (3.2.7) in (3.2.6) we obtain
H̃0 = α1−1 CUp0 T X + (−α1−1 CUp0 T µX )v T
1
= α1−1 CUp0 T X + (−α1−1 CUp0 T Xv)v T
N
1
= α1−1 CUp0 T X(Id − vv T )
N
−1 0T 0
= α1 CUp X .
Using the singular value decomposition of matrix X 0 gives us X 0 = Un Λn Vn with
Un consisting of the normalized eigenvectors of ΣX 0 X√
0 , Vn the normalized eigenvectors
T
of ΣX 0 X 0 and Λn a diagonal matrix with entries λi of the decreasingly ordered
eigenvalues of ΣX 0 X 0 . We obtain by using the Singular Value Decomposition
H̃0 = α1−1 CUp0 T Un Λn Vn
= α1−1 CΛp Vp
and now by applying g to H̃0

g(H̃0 ) = α0 vv T + CΛp Vp = H.
b
By H 0 = H(Id − vv T ) follows,
b 0 = CΛp Vp
H
with optimal bias
b̂2 := µX − α0 Up0 C −1 v
which equals (3.2.3) and therefore prooves the Theorem.
3.2. Autoassociation with Multilayer Perceptrons and Singular Value
Decomposition 38

3.2.5 Conclusion
In this section we could observe that for linear input data the choice of a specific
activation function for the units of the hidden layer does not affect the result of
an autoassociative network, given that this function can be linearly approximated
in a neighbourhood of zero. Hence, non-linear functions at the hidden layer do
not improve a network that handles linear input. This result can be inductively
extended to a network of several hidden layers. Therefore for linear input data, it
is recommendable to use linear layers, since an explicit form of the solution to the
optimization problem (3.2.1) can be obtained by using the results of the previous
section. This avoids the problem of local minima that appears when we are trying
to find a minimum using gradient descent for a network with non-linear layers.
Moreover, we understand that the optimal bias vector is the one that removes the
average from the data, that means transfers the data to centred data with average
zero.
Chapter 4

An Application to Neural
Networks

4.1 The Problem


We are looking at a data set of flights that were performed by an airline in the
past three years. In order to plan efficiently, the airline wants to get an accurate
prediction of block times for future flights.
The block time describes the time between the moment when the breaks of a
specific airplane are being removed before departure and the time when the breaks
are being set at the destination airport. It includes the taxi out time, that is the
time between leaving the parking spot and take off, as well as the take off time,
the enroute time, landing time and the taxi in time. Block times determine the
departure and landing slots to be requested in each of the airports. Thus, since
slots need to be assigned several months before the beginning of each season, having
predicted the correct block times is key to achieve an optimal on-time performance
for the whole season. We are trying to predict block times for a whole season, only
depending on the weekday and time the flight is scheduled for, not on the month.
Ideally, we want the block time prediction to be as close to reality as possible, taking
as acceptable a maximum deviation of 10 minutes. Also, a time prediction than is
longer than actual time is better than forecasting shorter, because arriving late
results in higher chances of propagating the delay to the following flights, increasing
the costs drastically.

4.2 The Dataset


The used data set consists of 546863 rows of flights that were generated automat-
ically. Since for future predictions we will only have information about the sched-
ule, we should use the same variables for calculating the predictions. The relevant
columns are

39
4.2. The Dataset 40

• Std UTC, referring to the time the flight is scheduled

• DP latacode giving the code of the departure airport

• AR latacode, the arrival airport

• BlkSched the scheduled block time

• ActualBlock giving the measured actual block time

However, we will separate the scheduled time Std UTC into day of the week, time
and season that the flight was performed.

Figure 4.1: The data set: relevant columns are marked blue.

4.2.1 Decoding the Airport Combinations by One Hot En-


coding
For the columns DP latacode and AR latacode we have to find a representation
that can be interpreted by a neural network. An easy, yet effective way is One Hot
Encoding. It is used to encode categorical data, such as ”flights that departed in
Oslo”. The idea is to add extra columns for each category, in our case the airports
and decode by one if the flight starts at the specific airport and with zero otherwise.
The same is done with the arrival airports. This way we have increased the matrix
dimension to (546863 × 410).

4.2.2 Find and Remove Outliers


An outlier describes an observation that differs significantly from other observa-
tions in a sample. In the dataset we can find some flights that had block times that
exceeded the scheduled block time by far. This can happen due to extraordinary
weather conditions or strikes. However, these are exceptions and give no valuable
information to the network. They are likely to disturb the network in finding the
optimal weights, since high losses RE (F ) are created for a function F , that in gen-
eral approximates block times well. The network might overfit (see 2.1). Therefore,
we detect such outliers and remove them.
In order to do that, we first create a new column denoted by DiffBlock, containing
the differences between scheduled and actual block time.
41 Chapter 4. An Application to Neural Networks

Using Interquartile Range we can then detect and remove outliers. The Interquar-
tile range method gives the interval, called the IQR- interval, in which 50% of the
difference between scheduled block time and actual block time lie.
Later the interval boundaries are extended by the factor 1.5 and everything that
is not included in the extended interval is being removed (see (4.2a)).
Applyinhg the IQR-Method to column DiffBlock, we find that, 6% of the flights
produce outliers and the IQR interval covers 8 minutes. The extended interval cov-
ers values from zero to 23 minutes. This can be visualized by a plot called boxplot
(see (4.2b)).

(b) Boxplot of difference between scheduled


(a) IQR method.
block times and actual block times.

Figure 4.2: Removing outliers

4.2.3 Spliting and Centering the Data


Since we want to test our predictions for unkown data Z (see (2.0.2)) we first split
our data set into two disjoint sets. 80% of the original dataset will be the training
set X and the remaining 20% will serve as test set Z. Furthermore, as we could
observe in the previous chapter, the optimal choice for a bias vector for the input
data is the one that centers the data. Therefore, before processing the data using a
Multilayer Perceptron, we center the data.

4.3 Use Multilayer Perceptrons for Regression Prob-


lem
We will proceed now by building different types of Multilayer Perceptrons, that we
will call models in the further. We want to observe their learning capacity on the
training data and evaluate the predictions on the test data.
4.3. Use Multilayer Perceptrons for Regression Problem 42

4.3.1 Build Models


We build four different models.

• modSimple a simple model with two hidden layers of dimension 12 and 10.

• modDeep a deeper architecture with four hidden layers of dimensions 5, 12, 10, 5.

• modWider a wider model with two hidden layers of dimensions 600, 65.

• modPCA a model with one linear hidden layer of dimension 70 followed by two,
non-linear hidden layers of dimensions 12, 10.

Notice that since we are looking at a regression problem, the output layer must be
of dimension one in all models.
As non-linear activation function we chose the relu function

g(s) = max(0, s)

The loss function L is, as in the previous chapter, the mean square error. As an
optimizer we chose Adam, that is an implementation of a stochastic gradient descent
method. In order to avoid overfitting, we will force the model to stop training, if the
loss does not improve after 5 training epochs. Additionally, to prevent overfitting,
20% of the training dataset is extracted as validation set that helps to measure the
error for unknown values during the training phase. If the mean absolute error for
the validation set differs significantly from the train error, the model is overfitting.
We observe that the deeper model (see (4.3b)) and the simple model (see (4.3a))
tend to overfit faster.
43 Chapter 4. An Application to Neural Networks

4.3.2 Train Models


After having built the models we have to train them on our generated training set
X. Now we can observe the training capacity of each model. We notice that the

(a) Mean absolute error by epochs for (b) Mean absolute error by epochs for
the simple model the deeper model

(c) Mean absolute error by epochs for the


wider model

Figure 4.3: Training error.

wider model (see (4.4c)) learns a bit better than the other two models. It ends after
43 epochs with a mean absolute error of 5.445 minutes , while the simple model
(see (4.4a)) ends with a mean absolut training error of 5.562 and the deeper model
(see (4.4b)) with 5.682. The better performance of the wider model could be an
indicator of the data set providing too little information to the network (see (2.1)).
4.3. Use Multilayer Perceptrons for Regression Problem 44

(a) Training progress of the simple model. (b) Training progress of the deeper model.

(c) Training progress of the wider model.

Figure 4.4: Training progress.

4.3.3 Evaluate Models on Test Data


The final step is to apply the trained models to the test data set Z and evaluate
the result. Since we extracted Z from our initial data set, that only includes past
flights, we have information about actual block times and can therefore evaluate
the predictions, the model did for data that was not seen in the training phase
before. We are not only interested in minimizing the prediction error, but also
on choosing a model whose predictions of block times tend to be rather too long
than too short. Besides that, in general we are interested in choosing a model
that predicts block times in a way that the majority is less than 10 minutes too
short or too long. Therefore, we need extra evaluation functions that measure the
percentage of flights’ predicted block times are more than 10 minutes too short and
the percentage of flight whose predicted block times deviate from the actual block
time by more than 10 minutes. To get a more clear idea of the results, we will also
plot the distribution of the error. Count describes the amount of flights for which
the predictions have the corresponding Prediction Error. The red lines indicate
the interval of 10 minutes deviation,in which we want the error to be contained
in. We notice that the simple model (see (4.6a)) and the wider model(see (4.6c)),
although having produced a smaller absolute error, have an error that mostly lies on
the positive side. That means, the error comes mostly from block times that were
predicted too short.
The evaluation functions display the same result. The simple model predicts
the block times, such that 14.69% differ by more than 10 minutes from the actual
block times and for the wider model we get 15.79%. The deeper model performs
significantly worse, with 18.62, but it has the advantage that only 8.62% of the
45 Chapter 4. An Application to Neural Networks

(a) Distribution of the error, simple model. (b) Distribution of the error, deeper model.

(c) Distribution of the error, wider model.

Figure 4.5: Distribution of the error.

predictions were too short, while the others perform worse in this regard. Since that
prevents causing delays in other flights, for our purpose the deeper model is the best
choice.
4.3. Use Multilayer Perceptrons for Regression Problem 46

(a) Evaluation functions, simple model.

(b) Evaluation functions, deeper model.

(c) Evaluation functions, wider model.

4.3.4 Conclusion
We have applied Multilayer Perceptrons to a regression problem with relatively few
variables. Although the networks were given little information, the approximations
were satisfying. A deeper architecture outperformed the other architectures because
it has the effect of creating more variance. However, for our purpose, avoiding delays,
the deeper network performed best.
Bibliography

References
[1] J.R. M. and H. Neudecker,(1986). Symmertry, 0 − 1 matrices and Jaco-
bians Economic Theory, 2 ,157-190.

[2] K. Hornik, M. Stinchcombe, H. White (1989) Multilayer Networks are


Universal Approximators Neural Networks, 2 ,359-366.

[3] P. Baldi and K. Hornik, (1988). Neural Networks and Principal Com-
ponent Analysis: Learning from Examples Without Local Minima,
Neural Networks, 2, 53-58.

[4] D. Serre, Matrices, Springer, 2010.

Sources

[5] S. Geman, E. Bienenstock, R. Doursat, (1986). Neural Networks and the


Bias/Variance Dilemma Economic Theory, 2 ,157-190.

[6] S. Bosch, Lineare Algebra, Springer Verlag, 2001.

[7] P. Petersen, Linear Algebra, Springer, 2012.

[8] Die Zeit.


https://www.zeit.de/digital/internet/2016-08/
kuenstliche-intelligenz-geschichte-neuronale-netze-deep-learning
(viewed January 12, 2020).

47

You might also like