Slides NN
Slides NN
Machine Learning
Michael Wand
TA: Vincent Herrmann
{michael.wand, vincent.herrmann}@idsia.ch
Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI
The Perceptron
: ...and training by gradient descent
Multi-layer neural networks
Training by backpropagation
Design choices and tricks
Attribution: Some information about the perceptron
taken from Bishop’s Machine Learning textbook
https://www.microsoft.com/en-us/research/
people/cmbishop/prml-book. The description of
classical neural network theory is excellent, the
practical design and training considerations is
somewhat outdated.
-1
-1 0 ϕ(0) 1
Image source: Belkin, Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation,
arXiv:2105.14368
∇w en = −∇w wT ϕn tn = −ϕn tn
It is easy to see that this reduces the error for this particular sample
(but not necessarily the total error).
Nonetheless, if the training set is linearly separable, the algorithm
finishes in finitely many steps (there are no more misclassified
samples). Yet even then, convergence can be very slow.
(note that the weights are usually independent for each step).
The output of the entire network is then y = z(L) .
z(0) = x.
(ℓ)
The zm are neurons, each of which takes its input values and
computes a single output value from them
The inputs x1 , . . . , xD are occasionally called input neurons (even
though they do not compute anything)
The neurons are organized in layers 1, . . . , L. (Some people consider
the input the zeroth layer.)
The weights w are directed connections between the neurons, e.g.
the neurons of layer 2 are connected to the ones of layer 1 by the
(2)
weights wmn , m = 1, . . . , M (1) , n = 1, . . . , M (2) .
(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2
(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)
1 this formula is for one sample only, for multiple samples take the mean
Neural Networks - Foundations 25
NN Setup for Classification
ĉ = arg max yk .
k
ĉ = arg max yk .
k
t = (0, . . . , 0, 1, 0, . . . , 0)
↑
k-th element
Intuition: The cross-entropy corresponds to the number of additional bits needed to encode the
correct output, given that we have access to the (possibly wrong) prediction of the network.
and we see that the loss goes to zero if ykcorrect approaches one (since
the yk must be a probability distribution, this implies that all other
yk must go to zero).
However, the loss also works for probabilistic targets.
The neural network gracefully handles probabilistic outputs and
multi-class classification.
Remark: For efficiency and numerical stability, one should merge
softmax loss and cross-entropy criterion into one function.
Assume that for a given sample x, we have the error E (y) = E (z(L) ).
We must compute the gradients of E w.r.t. the weights.
We prepare
ourselves
Pby doing two simple computations: Since
(ℓ) (ℓ) (ℓ) (ℓ−1) (ℓ)
zn = f un = f w
m mn m z + bn , we have (chain rule!)2
(ℓ)
∂zn
(ℓ)
= f ′ un(ℓ) zm(ℓ−1)
∂wmn
(ℓ)
∂zn
(ℓ)
= f ′ un(ℓ)
∂bn
(ℓ)
∂zn
(ℓ−1)
= f ′ un(ℓ) wmn
(ℓ)
∂zm
for any ℓ = 1, . . . , L.
2 This assumes that the nonlinearity f is computed independently for each neuron,
which in practice is true except for the softmax nonlinearity. We will remove this
restriction later on.
Neural Networks - Foundations 32
Backpropagation Training
For the last layer, we can now immediately compute the gradients:
(L)
∂E ∂E ∂zn ∂E ′
(L)
(L−1)
(L)
= (L) (L)
= (L)
f u n zm
∂wmn ∂zn ∂wmn ∂zn
(L)
∂E ∂E ∂zn ∂E ′
(L)
(L)
= (L) (L)
= (L)
f un .
∂bn ∂zn ∂bn ∂zn
This computation is easiest for the last layer, since there is only one
(L)
“path” in which the weight wmn influences the error3 .
Let us now consider the general case.
3 Again, this is not correct when the nonlinearity is computed on the entire layer.
Neural Networks - Foundations 33
Backpropagation Training
(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2
E
(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)
The situation is slightly more complicated for the lower layers, since we
need to consider all paths which lead to a certain weight. In how many
ways does the indicated weight influence the loss?
Neural Networks - Foundations 34
Backpropagation Training
We write
∂E ∂E ∂E (ℓ)
= , . . . , ∈ R1×M ;
∂z(ℓ) (ℓ) (ℓ)
∂z1 ∂z (ℓ)
M
(ℓ) (ℓ) (ℓ)
∂z ∂z ∂z
1 ... 1 1
∂z (ℓ−1) ∂z
(ℓ−1) (ℓ)
1 ∂w
M (ℓ−1)
ij
∂z(ℓ) ∂z(ℓ)
. .. . (ℓ)
∈ RM ×M
(ℓ−1) . (ℓ)
∈ RM × 1.
= . . . ; ..
=
∂z(ℓ−1) . . (ℓ)
∂wij
(ℓ) (ℓ)
(ℓ)
∂z ∂z ∂z
M (ℓ) ... M (ℓ) Mℓ
(ℓ−1) (ℓ−1) (ℓ)
∂z ∂z ∂w
1 M (ℓ−1) ij
Remember the rules for derivatives of multivariate functions: Input variables go into columns, output
components go into rows, i.e. for f : RD → RK ,
∂f1
∂x
∂f2
∂f
∂f ∂f
∂f = ∂x
K ×D
= ··· ∈R
∂x ∂x1 ∂x2 ∂xD
···
∂fK
∂x
(ℓ+1)
We furthermore decompose ∂z∂z(ℓ) into the gradient of the
nonlinearity and the network part:
∂z(ℓ+1) ∂z(ℓ+1) ∂u(ℓ+1)
(ℓ)
= .
∂z ∂u(ℓ+1) ∂z(ℓ)
(ℓ+1)
∂z
We define F(ℓ+1) := ∂u (ℓ+1) . In the case of a component-wise
(ℓ+1)
nonlinearity, F is a diagonal matrix.
Also note that because of u(ℓ+1) = W(ℓ+1) z(ℓ) + b(ℓ+1) , the second
(ℓ+1)
factor ∂u∂z(ℓ) is just the weight matrix W(ℓ+1) !
wnew = w − η∇w
(1)
δ1 (3)
δ1
(2)
(1) δ1
δ2 (3)
δ2
(2)
δ2
E
(2)
δM (2)
(3)
(1)
δK
δM (1)
W (1) W (2) W (3)
(ℓ)
The errors δi assign credit or blame to each node in each layer.
Thus we have quantified the contribution of each node to the network loss.
In order to finish the picture, let us google the derivatives of the nonlinearities:
Function formula derivative
Sigmoid σ(x) = 1
σ ′ (x) = σ(x)(1 − σ(x))
1+e −x
x −x
Tanh tanh(x) = e x −e−x tanh′ (x) = 1 − tanh2 (x)
e +e (
1 if x > 0
ReLU f (x) = max(0, x) f ′ (x) =
0 otherwise
Linear f (x) = x f ′ (x) = (
1
S(x) = (S1 , . . . , SK ) Si (1 − Si ) i = j
Softmax x ∂i Sj =
with Si = Pe ei xk − Si Sj i ̸= j
k
The basic algorithm requires to fix a learning rate (and batch size)
The optimal learning rate depends on the task, the data quality, the
batch size, the error function, . . .
The optimal batch size depends on the task, the data quality, the
learning rate, the error function, . . .
Trial and error: If you see very small error reduction, the learning
rate might be too low, if the error fluctuates wildly (or even
increases), the learning rate may be too high
You could also use a learning rate schedule (e.g. higher learning rate
in the beginning, smaller learning rate for final finetuning)
If you observe high fluctuation, you may force smoother gradients by
averaging the gradient over several batches (momentum)
Several methods have been proposed to adapt the learning rate
based on the observed convergence, a well-known one is the Adam
optimizer (Kingma and Ba 2015).
Neural Networks - Foundations 47
Network Topologies
The optimal network topology depends on the task, the data quality
and the amount of data, . . .
No general rule, but note that if you have more than a few layers,
training quality decreases (i.e. the trained network does not perform
well).
: This is due to the structure of backpropagation (ultimately due to
the chain rule), where errors are computed by iterative
multiplications: The error norm follows a power law, with gradients
either vanishing or exploding.
: This can be avoided by gating techniques, including Highway
Networks (Srivastava et al. 2015) and Residual Networks (He et
al. 2016, a special case of Highway networks).
: The original gated neural network was the LSTM (Hochreiter &
Schmidhuber 1997), which we will get to know in the context of
recurrent neural networks.
If the network is too shallow and/or too small (i.e. the number of
layers, or the number of neurons per layer is too small), the network
tends to underfit.
If the network is too large, it can overfit the training data, but in
practice this is not such a great problem.
You will usually make the network perform well on the training data,
and then use regularization to improve generalization,