0% found this document useful (0 votes)
15 views59 pages

Slides NN

slides nn

Uploaded by

Francesco Gualdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views59 pages

Slides NN

slides nn

Uploaded by

Francesco Gualdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Neural Networks - Foundations

Machine Learning
Michael Wand
TA: Vincent Herrmann

{michael.wand, vincent.herrmann}@idsia.ch

Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI

Fall Semester 2024


Contents

The Perceptron
: ...and training by gradient descent
Multi-layer neural networks
Training by backpropagation
Design choices and tricks
Attribution: Some information about the perceptron
taken from Bishop’s Machine Learning textbook
https://www.microsoft.com/en-us/research/
people/cmbishop/prml-book. The description of
classical neural network theory is excellent, the
practical design and training considerations is
somewhat outdated.

Further reading: Goodfellow/Bengio/Courville, Deep


Learning https://www.deeplearningbook.org.

Neural Networks - Foundations 2


The Perceptron
Introduction

Again consider a two-class problem for


classification.
1
The classes may or may not be linearly
separable. ϕ(1)
0

-1

-1 0 ϕ(0) 1

Neural Networks - Foundations 4


Introduction

Again consider a two-class problem for


classification.
The classes may or may not be linearly 1
separable.
(1)
ϕ
We know the SVM and the maximum
margin criterion, as well as Logistic 0
Regression. Which other ways can we
think of to
1. parametrize a classifier, -1
2. define a criterion for training it,
3. actually compute the optimal
solution? -1 0 ϕ(0) 1

Neural Networks - Foundations 4


The Perceptron Model

We consider a linear model of the form

y (x) = f (wT ϕ(x))

with the usual fixed feature transformation ϕ and an activation


function (
+1, a ≥ 0
f (a) =
−1, a < 0
For simplicity, the bias is included in the feature transformation as a
fixed value ϕ0 (x) = 1.
The class targets are encoded as t = ±1 to match the possible
values of y (x) = f (wT ϕ(x)).

Neural Networks - Foundations 5


Gradient Descent

We wish to use Gradient Descent to


minimize the loss of a classifier.
Idea:
1. Start at any place x = x0 in the
“parameter space”.
2. Consider the local shape of the loss
function by computing the gradient
at the current position x. Note that
the gradient points in the direction
of steepest ascent.
3. Take a “step” in the direction of the
Img src: Wikipedia, Gradient Descent
negative gradient to decrease the
loss, arriving at a new position x.
4. Repeat steps 2 and 3 until satisfied.

Neural Networks - Foundations 6


Gradient Descent

Advantages of gradient descent:


: Conceptually simple and flexible
: Works for any underlying function, only constraint: gradient must be
defined and computable
: Works in any dimensionality (even in infinite-dimensional spaces)
: May offer a computationally tractable solution when other methods
fail (e.g. for large amounts of training samples, high-dimensional
spaces)
: Iterative approach allows a lot of flexible engineering where necessary

Neural Networks - Foundations 7


Gradient Descent

Disadvantages of gradient descent:


: May get stuck in local minimum (or on a plateau)
: Convergence may be slow
: No (general) rule to determine step size
: When the underlying function is not well known, no theoretical
guarantees about quality of the solution, speed, etc.

Image source: Belkin, Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation,
arXiv:2105.14368

Neural Networks - Foundations 8


Gradient Descent

In many cases, gradient descent requires some trial-and-error and


some heuristics to work
A lot of engineering has been done to fix fundamental issues of
gradient descent (particularly for neural networks)
But to the present day, it remains the method of choice for neural
network training!
There is some advanced, but really interesting research on why
gradient descent works well in the specific case of neural networks
We start with applying gradient descent to the perceptron, which is a
linear classifier.

Neural Networks - Foundations 9


The Perceptron Criterion

Assume a linear classifier.


Can we simply minimize the number of classification errors by
gradient descent?

Neural Networks - Foundations 10


The Perceptron Criterion

Assume a linear classifier.


Can we simply minimize the number of classification errors by
gradient descent?
: No, because the number of classification errors is not differentiable
(it only takes integers as values).

Neural Networks - Foundations 10


The Perceptron Criterion

Assume a linear classifier.


Can we simply minimize the number of classification errors by
gradient descent?
: No, because the number of classification errors is not differentiable
(it only takes integers as values).
We derive a differentiable criterion, the Perceptron Criterion
(Rosenblatt, 1962).

Neural Networks - Foundations 10


The Perceptron Criterion

A sample ϕn with target tn is correctly classified if wT ϕn tn > 0 (in


other words, wT ϕn and tn must have matching signs).
We assign
: zero error to any correctly classified pattern
: the error en = −wT ϕn tn ≥ 0 to any wrongly classified pattern.
The Perceptron Criterion is thus
X X
EP (w) = en = − wT ϕn tn
n∈M n∈M

where M is the set of all misclassified patterns.


Clearly, M can change in each gradient step.
The error EP (w) is always nonnegative, we want to minimize it.

Neural Networks - Foundations 11


The Perceptron Criterion

We apply Stochastic Gradient Descent to the error EP (w).


This means that we evaluate the error gradient for a single,
randomly selected (misclassified) sample xn :

∇w en = −∇w wT ϕn tn = −ϕn tn

w is changed by taking a gradient step with learning rate η:

wnew = w − η∇w en = w + ηϕn tn .

It is easy to see that this reduces the error for this particular sample
(but not necessarily the total error).
Nonetheless, if the training set is linearly separable, the algorithm
finishes in finitely many steps (there are no more misclassified
samples). Yet even then, convergence can be very slow.

Neural Networks - Foundations 12


Perceptron Problems

The algorithm has a variety of shortcomings:


If the data set is not linearly separable, the
algorithm never converges (and may not
find a good solution if stopped randomly).
The method does not generalize to more
than two classes.
Convergence can be very slow.
Despite these limitations, the perceptron re-
mains a major milestone in the theory and prac-
tice of neural networks (and of machine learning
in general).

Image source: Wikipedia, Perceptron, original


image from Cornell University

Neural Networks - Foundations 13


Multi-layer Neural Networks
About Nonlinearities

We have seen that in many cases, classes


(in classification) are not linearly
separable, but may be better separable
with a nonlinear function.
Also for regression, we have seen that we
may need a nonlinear function of the data.
Conclusion: We need to perform some
kind of nonlinear calculation.
We have done this by using nonlinear basis
functions to model the data, then applying
a linear model in feature space.
Are there better ways to parametrize a
nonlinear model? Image modified from Jeroen Kools (2020).
6 functions for generating artificial datasets
(https://www.mathworks.com/matlabcentral/fileexchange/
41459-6-functions-for-generating-artificial-
datasets), MATLAB Central File Exchange.

Neural Networks - Foundations 15


About Iterated Computations

Linear classification consists of exactly two computational steps


(computation of features, computation of scalar product).
(Formulation is a bit different for the SVM, but the situation is
fundamentally the same.)
Yet, the optimization was already quite complex.
In order to be better, we want to allow the classifier to take even
more complex functional forms—but how?

Neural Networks - Foundations 16


About Iterated Computations

Linear classification consists of exactly two computational steps


(computation of features, computation of scalar product).
(Formulation is a bit different for the SVM, but the situation is
fundamentally the same.)
Yet, the optimization was already quite complex.
In order to be better, we want to allow the classifier to take even
more complex functional forms—but how?
Idea: allow multiple computational steps
: each one of which may be simple
From mathematics (dynamical systems, complexity theory): Iterated
application of simple rules can generate very complex behavior.

Neural Networks - Foundations 16


About Iterated Computations

Linear classification consists of exactly two computational steps


(computation of features, computation of scalar product).
(Formulation is a bit different for the SVM, but the situation is
fundamentally the same.)
Yet, the optimization was already quite complex.
In order to be better, we want to allow the classifier to take even
more complex functional forms—but how?
Idea: allow multiple computational steps
: each one of which may be simple
From mathematics (dynamical systems, complexity theory): Iterated
application of simple rules can generate very complex behavior.
Take inspiration from the human brain, a network of neurons: Each
neuron has very simple behavior (and is somewhat understood), but
the behavior of the whole brain, with billions of interconnected
neurons, is extremely complex (and terribly hard to understand)!
Neural Networks - Foundations 16
Feedforward Fully-Connected Neural Networks

We define a feedforward fully-connected neural network as follows.


Let x = x1 , . . . , xD be the D-dimensional input vector.
M (1) neurons perform a perceptron-like computation
(1) (1) T (1) (1) (1)
um = (wm ) x + bm , zm = f (um ), m = 1, . . . , M (1)

with a differentiable activation function f (for gradient descent).


This step is iterated multiple times, taking the outputs
(ℓ−1)
z(ℓ−1) = (zm )m=1,...,M (ℓ−1) of the previous step as input:
(ℓ) (ℓ) T (ℓ−1) (ℓ) (ℓ) (ℓ)
um = (wm ) z + bm , zm = f (um ),
m = 1, . . . , M (ℓ) and ℓ = 2, . . . , L.

(note that the weights are usually independent for each step).
The output of the entire network is then y = z(L) .

Neural Networks - Foundations 17


Feedforward Fully-Connected Neural Networks

We additionally define z(0) to be the input, i.e.

z(0) = x.

For each layer ℓ ∈ 1, . . . , L, the computation is


 
(ℓ) (ℓ) T (ℓ−1) (ℓ)
zm = f (wm ) z + bm

which can be written as a matrix multiplication:


 
z(ℓ) = f W(ℓ) z(ℓ−1) + b(ℓ) .

The activation function is usually applied component-wise, but can


also be applied to the output vector as a whole.

Neural Networks - Foundations 18


Representation of a Neural Network

(ℓ)
The zm are neurons, each of which takes its input values and
computes a single output value from them
The inputs x1 , . . . , xD are occasionally called input neurons (even
though they do not compute anything)
The neurons are organized in layers 1, . . . , L. (Some people consider
the input the zeroth layer.)
The weights w are directed connections between the neurons, e.g.
the neurons of layer 2 are connected to the ones of layer 1 by the
(2)
weights wmn , m = 1, . . . , M (1) , n = 1, . . . , M (2) .

Neural Networks - Foundations 19


Representation of a Neural Network

(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2

(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)

The image graphically represents a neural network with three layers, or


two hidden layers. Computation runs from the left to the right. Note
that M (0) = D and M (3) = K .
Neural Networks - Foundations 20
Feedforward Fully-Connected Neural Networks

Each neuron computes the weighted sum of the connected inputs,


followed by a differentiable activation function. The activation
function should be nonlinear (why?); it can differ for each layer.
The neurons are organized in layers to allow parallel computation, to
avoid cyclic dependencies (we will discuss later how to implement
cycles, or recurrence, in NNs), and to simplify reasoning about the
system: a feedforward network.
There is a full set of connections between successive layers: a fully
connected network.
The process of computing NN outputs from inputs is called forward
propagation.
This kind of network is also called a Multi-layer perceptron.

Neural Networks - Foundations 21


Feedforward Fully-Connected Neural Networks

Activation functions need to be differentiable (because we wish to


apply gradient descent training).
For the hidden layers of the network, the activation function must be
nonlinear, because multiple linear computations can be collapsed to
a single one: In order to gain power from iterative computation, we
thus need nonlinear steps.
The activation function of the last layer usually depends on the task
(e.g. classification or regression).
Finally, in supervised training, we compare the output y = y(x) with
a target t and compute a scalar error E = E (y, t).
The error allows us to measure the performance of the network, and
to derive a criterion for training.

Neural Networks - Foundations 22


Feedforward Fully-Connected Neural Networks

Many possible activation functions for the hidden layers of a neural


network exist:
: Sigmoid, Hyperbolic Tangent: Monotonic, squeeze output to a fixed
range
: ReLU: “Almost linear” (a clipped identity function), works very well.
Encourages sparsity of representations. Currently state-of-the-art.
A large number of variants (not covered here) has been proposed.

Neural Networks - Foundations 23


Feedforward Fully-Connected Neural Networks

We see that the forward step comprises as many computation steps


as there are layers.
Thus we have achieved the goal of creating a “complex” calculation
from multiple simple steps (matrix multiplication + nonlinearity).
We now discuss how to set up the NN for a practical task.
Then we will derive the standard training method for the neural
network.

Neural Networks - Foundations 24


NN Setup for Regression

Assume a regression task: compute a mapping RD → RK .


Since the output of the last layer can have arbitrary range, one
usually chooses a linear activation function (for the last layer only!):
f (x) = x.
The hidden layers can have any nonlinear activation function.
Use the well-known squared error: E = 12 k (tk − yk )2 , where the
P
sum runs over the K components of the vectors1 .
Note that the NN naturally handles multi-dimensional targets.

1 this formula is for one sample only, for multiple samples take the mean
Neural Networks - Foundations 25
NN Setup for Classification

For a classification task with K classes, we use a K -dimensional


output layer.
A sample x ∈ RD is classified as belonging to class k if the output
neuron yk has the maximal value:

ĉ = arg max yk .
k

Problem: The arg max function has a degenerate gradient!

Neural Networks - Foundations 26


NN Setup for Classification

For a classification task with K classes, we use a K -dimensional


output layer.
A sample x ∈ RD is classified as belonging to class k if the output
neuron yk has the maximal value:

ĉ = arg max yk .
k

Problem: The arg max function has a degenerate gradient!


This is solved by letting the neural network output a probability
distribution over classes, i.e.
X
y = (yk )k=1,...,K with yk ≥ 0, yk = 1.
k

Advantage: We can derive a (differentiable) measure of the quality


of the output on theoretical grounds, using probability theory.

Neural Networks - Foundations 26


NN Setup for Classification

In order to make the network output a probability distribution, we


take exponentials and normalize. This is the softmax nonlinearity:
 y1
e yK

e
S(y) = P y , . . . , P y .
ke ke
k k

Note that in constrast to other activation functions, it is applied to


the full last layer of the network, not to each independent
component.
The hidden layers can have any nonlinear activation function (just as
for regression).

Neural Networks - Foundations 27


NN Setup for Classification

Assume a neural network with softmax output.


We compute the loss by measuring the cross-entropy between the
output distribution and the target distribution.
: We encode the targets in one-hot style, e.g. if a sample belongs to
class k, the target is

t = (0, . . . , 0, 1, 0, . . . , 0)

k-th element

: Consider this a probability distribution: obviously, a perfect


hypothesis y would exactly match this t, assigning probability 1 to
the correct class, and probability 0 otherwise.
: The cross-entropy loss is defined as
X
ECrossent = − (tk log yk ).
k

Intuition: The cross-entropy corresponds to the number of additional bits needed to encode the
correct output, given that we have access to the (possibly wrong) prediction of the network.

Neural Networks - Foundations 28


NN Setup for Classification

We note some properties of the cross-entropy loss:


It is always nonnegative (do you see why?)?
In the case of deterministic targets (exactly one tk = 1, all others
are zero), the formula simplifies to

ECrossent = − log ykcorrect ,

and we see that the loss goes to zero if ykcorrect approaches one (since
the yk must be a probability distribution, this implies that all other
yk must go to zero).
However, the loss also works for probabilistic targets.
The neural network gracefully handles probabilistic outputs and
multi-class classification.
Remark: For efficiency and numerical stability, one should merge
softmax loss and cross-entropy criterion into one function.

Neural Networks - Foundations 29


Training Neural Networks by Backpropagation
Gradient Descent by Backpropagation

We will use Gradient Descent to train a neural network.


: Remark 1: there are ways to perform gradient descent training even
in unsupervised or semi-supervised scenarios (training targets
unavailable or partially available).
: Remark 2: it is also possible to optimize neural networks without
gradient descent (e.g. by evolution http:
//people.idsia.ch/~juergen/compressednetworksearch.html).
This requires to compute the gradient of the neural network error
w.r.t. each weight.
We will first derive the Backpropagation algorithm which allows
performing this computation in an efficient way.

Neural Networks - Foundations 31


Backpropagation Training

Assume that for a given sample x, we have the error E (y) = E (z(L) ).
We must compute the gradients of E w.r.t. the weights.
We prepare
 ourselves
 Pby doing two simple  computations: Since
(ℓ) (ℓ) (ℓ) (ℓ−1) (ℓ)
zn = f un = f w
m mn m z + bn , we have (chain rule!)2

(ℓ)
∂zn  
(ℓ)
= f ′ un(ℓ) zm(ℓ−1)
∂wmn
(ℓ)
∂zn  
(ℓ)
= f ′ un(ℓ)
∂bn
(ℓ)
∂zn  
(ℓ−1)
= f ′ un(ℓ) wmn
(ℓ)

∂zm
for any ℓ = 1, . . . , L.
2 This assumes that the nonlinearity f is computed independently for each neuron,

which in practice is true except for the softmax nonlinearity. We will remove this
restriction later on.
Neural Networks - Foundations 32
Backpropagation Training

For the last layer, we can now immediately compute the gradients:
(L)
∂E ∂E ∂zn ∂E ′

(L)

(L−1)
(L)
= (L) (L)
= (L)
f u n zm
∂wmn ∂zn ∂wmn ∂zn
(L)
∂E ∂E ∂zn ∂E ′

(L)

(L)
= (L) (L)
= (L)
f un .
∂bn ∂zn ∂bn ∂zn
This computation is easiest for the last layer, since there is only one
(L)
“path” in which the weight wmn influences the error3 .
Let us now consider the general case.

3 Again, this is not correct when the nonlinearity is computed on the entire layer.
Neural Networks - Foundations 33
Backpropagation Training

(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2
E

(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)

The situation is slightly more complicated for the lower layers, since we
need to consider all paths which lead to a certain weight. In how many
ways does the indicated weight influence the loss?
Neural Networks - Foundations 34
Backpropagation Training

We write
 
∂E ∂E ∂E (ℓ)
=  , . . . ,  ∈ R1×M ;
∂z(ℓ) (ℓ) (ℓ)
∂z1 ∂z (ℓ)
M
 (ℓ) (ℓ)   (ℓ)

∂z ∂z ∂z
1 ... 1 1
 ∂z (ℓ−1) ∂z
(ℓ−1)   (ℓ) 
 1 ∂w
M (ℓ−1)
  ij 
∂z(ℓ) ∂z(ℓ)
   
 . .. . (ℓ)
 ∈ RM ×M
 (ℓ−1)  . (ℓ)
 ∈ RM × 1.

= . . . ;  ..
=
∂z(ℓ−1) . . (ℓ)
∂wij
  
(ℓ) (ℓ)
   (ℓ) 
 ∂z ∂z   ∂z 
 M (ℓ) ... M (ℓ)   Mℓ 
(ℓ−1) (ℓ−1) (ℓ)
∂z ∂z ∂w
1 M (ℓ−1) ij

Remember the rules for derivatives of multivariate functions: Input variables go into columns, output
components go into rows, i.e. for f : RD → RK ,
 ∂f1

  ∂x
∂f2
 
∂f 
∂f ∂f
 
∂f  =  ∂x

K ×D
= ··· ∈R

∂x  ∂x1 ∂x2 ∂xD  

 ··· 

∂fK
∂x

Neural Networks - Foundations 35


Backpropagation Training

(ℓ+1)
We furthermore decompose ∂z∂z(ℓ) into the gradient of the
nonlinearity and the network part:
∂z(ℓ+1) ∂z(ℓ+1) ∂u(ℓ+1)
(ℓ)
= .
∂z ∂u(ℓ+1) ∂z(ℓ)

(ℓ+1)
∂z
We define F(ℓ+1) := ∂u (ℓ+1) . In the case of a component-wise
(ℓ+1)
nonlinearity, F is a diagonal matrix.
Also note that because of u(ℓ+1) = W(ℓ+1) z(ℓ) + b(ℓ+1) , the second
(ℓ+1)
factor ∂u∂z(ℓ) is just the weight matrix W(ℓ+1) !

Neural Networks - Foundations 36


Backpropagation Training

Then, by the chain rule,

∂E ∂E ∂z(L) ∂z(ℓ+1) ∂z(ℓ)


(ℓ)
= · · ·
∂wmn ∂z(L) ∂z(L−1) ∂z(ℓ) ∂wmn(ℓ)

where the multiplications are matrix multiplications.


We leave out the formulas for updating the bias since they are very
similar.

Neural Networks - Foundations 37


Backpropagation Training

This gives us a straightforward way to compute the gradients for all


weights in the network.
Let δ (ℓ) be the gradient of the loss w.r.t. the activation of the ℓ-th
layer:
∂E ∂z(L) ∂z(ℓ+1) (ℓ)
δ (ℓ) = (L) (L−1)
··· (ℓ)
∈ R1×M
∂z ∂z ∂z
and note that it can be computed recursively:
∂z(ℓ+1) ∂E
δ (ℓ) = δ (ℓ+1) = δ (ℓ+1) F(ℓ+1) W(ℓ+1) and δ (L) = .
∂z(ℓ) ∂z(L)
Combining prior results, we also see
∂E ∂z(ℓ)
(ℓ)
= δℓ (ℓ)
= δℓ F(ℓ) zm(ℓ−1) .
∂wmn ∂wmn
. . . and that’s all we need.

Neural Networks - Foundations 38


Gradient Descent by Backpropagation

Here is the complete algorithm to perform a gradient step in neural


network training, using the backpropagation algorithm to compute the
gradients:
Perform the forward pass, save intermediate results
for ℓ = L, . . . , 1,
: compute δ (ℓ) from δ (ℓ+1) (except for ℓ = L, the recursion start)
: compute (and save) the weight gradients for layer ℓ
: all required formulas are on the previous slide.
Update all weights simultaneously:

wnew = w − η∇w

where η is the learning rate, and ∇w collects the gradients.

Neural Networks - Foundations 39


Gradient Descent by Backpropagation

Here is the complete algorithm to perform a gradient step in neural


network training, using the backpropagation algorithm to compute the
gradients:
The name backpropagation for this implementation of gradient
descent stems from the way the error is propagated from the
network output to its layers, in backwards order.
Every partial derivative can be computed by a local computation
(i.e. using the δ (ℓ) from the backward pass, and the z(ℓ−1) from the
forward pass).
The δ (ℓ) are also called errors, they assign credit or blame to each
node in each layer. Thus the error of the entire network (which we
want to minimize) is distributed over its components.
Such credit assignment is a fundamental problem in machine
learning.

Neural Networks - Foundations 40


Gradient Descent by Backpropagation

(1)
δ1 (3)
δ1
(2)
(1) δ1
δ2 (3)
δ2
(2)
δ2
E

(2)
δM (2)
(3)
(1)
δK
δM (1)
W (1) W (2) W (3)

(ℓ)
The errors δi assign credit or blame to each node in each layer.
Thus we have quantified the contribution of each node to the network loss.

Neural Networks - Foundations 41


Gradient Descent by Backpropagation

At this point, you should have learned about backpropagation:


that it is very similar to reverse forward propagation!
In the forward case, we compute neuron activations from layer 1 to
layer L
In the backward case, we compute errors from layer L to layer 1.
(Clearly, this makes only sense after a forward pass.)
Note that we need to collect intermediate results in both passes in order
to train the network.
Also note that backward and forward pass have the same complexity.
Finally, distinguish (Stochastic) Gradient Descent (an optimization
method) and backpropagation (which is used to compute gradients which
are required for gradient descent).

Neural Networks - Foundations 42


Backpropagation Training

In order to finish the picture, let us google the derivatives of the nonlinearities:
Function formula derivative
Sigmoid σ(x) = 1
σ ′ (x) = σ(x)(1 − σ(x))
1+e −x
x −x
Tanh tanh(x) = e x −e−x tanh′ (x) = 1 − tanh2 (x)
e +e (
1 if x > 0
ReLU f (x) = max(0, x) f ′ (x) =
0 otherwise
Linear f (x) = x f ′ (x) = (
1
S(x) = (S1 , . . . , SK ) Si (1 − Si ) i = j
Softmax x ∂i Sj =
with Si = Pe ei xk − Si Sj i ̸= j
k

And here are the derivatives of the errors:


Function formula derivative
∂EMSE
1
− yk ) 2
P
MSE EMSE = 2 k (tk ∂y = (yk − tk )k
 
∂ECrossent t
= − yk
P
Cross-Entropy ECrossent = − k (tk log yk ) y k k
P ∂ECE + SM P 
Cross-Entropy and ECE + SM = − k (tk log Sk (y)) y = i ti yk − tk k
Softmax
where in the latter case y is network output before the softmax nonlinearity.
Exercise: Which simple form does the combined softmax + cross-entropy error take if the target is
deterministic (only one ti is nonzero)?

Neural Networks - Foundations 43


Training Setup

We now have defined (and proved) the complete algorithm for


backpropagation.
In practical setups, one usually accumulates gradient information
from a mini-batch of several samples (say, 32 or 64)
: makes the gradient steps more stable
: parallelizes better.
A full iteration over all training samples is called an epoch.
The learning rate can be determined experimentally, but there are
also algorithms which adapt it automatically.
A simple stopping criterion could be derived by checking the error on
the training dataset: When the change is small for a few steps, we
have reached convergence and stop
: but this is not how we usually do it.
In the next section, you will learn more on how network training is
performed in practical situations.

Neural Networks - Foundations 44


Network Initialization

At the beginning of the training, the NN parameters must be


initialized with random values.
In particular, if all the weights have identical initial values (e.g.
zero), all neurons will learn the exact same input weights, causing
the whole learning to fail.
Several strategies have been proposed, for a simple network,
initialization from a uniform distribution is usually OK.
Usually, the mean over all weights should be zero. The standard
deviation should not be too high (often depends on the layer size).
Too high/low values could lead to exploding/vanishing gradients.

Neural Networks - Foundations 45


Network Design Considerations
Advanced backpropagation

The basic algorithm requires to fix a learning rate (and batch size)
The optimal learning rate depends on the task, the data quality, the
batch size, the error function, . . .
The optimal batch size depends on the task, the data quality, the
learning rate, the error function, . . .
Trial and error: If you see very small error reduction, the learning
rate might be too low, if the error fluctuates wildly (or even
increases), the learning rate may be too high
You could also use a learning rate schedule (e.g. higher learning rate
in the beginning, smaller learning rate for final finetuning)
If you observe high fluctuation, you may force smoother gradients by
averaging the gradient over several batches (momentum)
Several methods have been proposed to adapt the learning rate
based on the observed convergence, a well-known one is the Adam
optimizer (Kingma and Ba 2015).
Neural Networks - Foundations 47
Network Topologies

The optimal network topology depends on the task, the data quality
and the amount of data, . . .
No general rule, but note that if you have more than a few layers,
training quality decreases (i.e. the trained network does not perform
well).
: This is due to the structure of backpropagation (ultimately due to
the chain rule), where errors are computed by iterative
multiplications: The error norm follows a power law, with gradients
either vanishing or exploding.
: This can be avoided by gating techniques, including Highway
Networks (Srivastava et al. 2015) and Residual Networks (He et
al. 2016, a special case of Highway networks).
: The original gated neural network was the LSTM (Hochreiter &
Schmidhuber 1997), which we will get to know in the context of
recurrent neural networks.

Neural Networks - Foundations 48


Network Topologies

If the network is too shallow and/or too small (i.e. the number of
layers, or the number of neurons per layer is too small), the network
tends to underfit.
If the network is too large, it can overfit the training data, but in
practice this is not such a great problem.
You will usually make the network perform well on the training data,
and then use regularization to improve generalization,

Neural Networks - Foundations 49


Early Stopping

One of the simplest ways to prevent overfitting the network is to


control the error on a separate validation set (we know that from the
first lecture).
When the validation error starts to rise, stop training (Early
Stopping).
Note that the error fluctuates a bit: Usually one defines a patience
(say, 10 epochs) to wait if the validation error might fall again. If
the validation error does not fall, select the best performing network
so far.
In practice, if your task is small to medium-sized, train with a small
number of hidden nodes, then keep doubling until no more significant
improvement on the validation set.

Neural Networks - Foundations 50


Regularization

The network can be regularized in various ways.


For example, one can penalize the absolute value of the weights, or
the sum of their squares (we know that from linear regression):
X X
Ẽ (w) = E (w) + |wλ | or Ẽ (w) = E (w) + |wλ |2
λ λ

Another large class of regularization ideas comes from augmenting


data by adding noise:
: Input Noise (e.g. white noise) can be added to the input data
: Noise can also be injected into the network in the form of Dropout
: If we have knowledge of the underlying data, we can use
domain-specific noise (e.g. image transformation).
In all cases, the idea is to artificially create more input samples
(which should make sense, of course).

Neural Networks - Foundations 51


Special Layers

There exist a variety of methods to help training the neural network.


Often, they can be described as special layers (even though they are
not really layers).
As an example, Batch Normalization standardizes the input for each
layer for each mini-batch.
: can improve the quality of the solution
: often speeds up the training process (less epochs needed)
It also makes a lot of sense to standardize the input data.

Neural Networks - Foundations 52


Summary

In this lecture, you should have learned


the intuition behind a neural network
the practical implementation (as a series of matrix operations)
training by backpropagation (it is easier than it looks)

Neural Networks - Foundations 53

You might also like