0% found this document useful (0 votes)

15 views59 pages

Slides NN

slides nn

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views59 pages

Slides NN

slides nn

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Neural Networks - Foundations

Machine Learning
Michael Wand
TA: Vincent Herrmann

{michael.wand, vincent.herrmann}@idsia.ch

Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI

Fall Semester 2024

Contents

The Perceptron
: ...and training by gradient descent
Multi-layer neural networks
Training by backpropagation
Design choices and tricks
Attribution: Some information about the perceptron
taken from Bishop’s Machine Learning textbook
https://www.microsoft.com/en-us/research/
people/cmbishop/prml-book. The description of
classical neural network theory is excellent, the
practical design and training considerations is
somewhat outdated.

Neural Networks - Foundations 2

The Perceptron
Introduction

Again consider a two-class problem for

classification.
1
The classes may or may not be linearly
separable. ϕ(1)
0

-1

-1 0 ϕ(0) 1

Neural Networks - Foundations 4

Introduction

Again consider a two-class problem for

classification.
The classes may or may not be linearly 1
separable.
(1)
ϕ
We know the SVM and the maximum
margin criterion, as well as Logistic 0
Regression. Which other ways can we
think of to
1. parametrize a classifier, -1
2. define a criterion for training it,
3. actually compute the optimal
solution? -1 0 ϕ(0) 1

Neural Networks - Foundations 4

The Perceptron Model

We consider a linear model of the form

y (x) = f (wT ϕ(x))

with the usual fixed feature transformation ϕ and an activation

function (
+1, a ≥ 0
f (a) =
−1, a < 0
For simplicity, the bias is included in the feature transformation as a
fixed value ϕ0 (x) = 1.
The class targets are encoded as t = ±1 to match the possible
values of y (x) = f (wT ϕ(x)).

Neural Networks - Foundations 5

Gradient Descent

We wish to use Gradient Descent to

minimize the loss of a classifier.
Idea:
1. Start at any place x = x0 in the
“parameter space”.
2. Consider the local shape of the loss
function by computing the gradient
at the current position x. Note that
the gradient points in the direction
of steepest ascent.
3. Take a “step” in the direction of the
Img src: Wikipedia, Gradient Descent
negative gradient to decrease the
loss, arriving at a new position x.
4. Repeat steps 2 and 3 until satisfied.

Neural Networks - Foundations 6

Gradient Descent

Advantages of gradient descent:

: Conceptually simple and flexible
: Works for any underlying function, only constraint: gradient must be
defined and computable
: Works in any dimensionality (even in infinite-dimensional spaces)
: May offer a computationally tractable solution when other methods
fail (e.g. for large amounts of training samples, high-dimensional
spaces)
: Iterative approach allows a lot of flexible engineering where necessary

Neural Networks - Foundations 7

Gradient Descent

Disadvantages of gradient descent:

: May get stuck in local minimum (or on a plateau)
: Convergence may be slow
: No (general) rule to determine step size
: When the underlying function is not well known, no theoretical
guarantees about quality of the solution, speed, etc.

Image source: Belkin, Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation,
arXiv:2105.14368

Neural Networks - Foundations 8

Gradient Descent

In many cases, gradient descent requires some trial-and-error and

some heuristics to work
A lot of engineering has been done to fix fundamental issues of
gradient descent (particularly for neural networks)
But to the present day, it remains the method of choice for neural
network training!
There is some advanced, but really interesting research on why
gradient descent works well in the specific case of neural networks
We start with applying gradient descent to the perceptron, which is a
linear classifier.

Neural Networks - Foundations 9

The Perceptron Criterion

Assume a linear classifier.

Can we simply minimize the number of classification errors by
gradient descent?

Neural Networks - Foundations 10

The Perceptron Criterion

Assume a linear classifier.

Can we simply minimize the number of classification errors by
gradient descent?
: No, because the number of classification errors is not differentiable
(it only takes integers as values).

Neural Networks - Foundations 10

The Perceptron Criterion

Assume a linear classifier.

Can we simply minimize the number of classification errors by
gradient descent?
: No, because the number of classification errors is not differentiable
(it only takes integers as values).
We derive a differentiable criterion, the Perceptron Criterion
(Rosenblatt, 1962).

Neural Networks - Foundations 10

The Perceptron Criterion

A sample ϕn with target tn is correctly classified if wT ϕn tn > 0 (in

other words, wT ϕn and tn must have matching signs).
We assign
: zero error to any correctly classified pattern
: the error en = −wT ϕn tn ≥ 0 to any wrongly classified pattern.
The Perceptron Criterion is thus
X X
EP (w) = en = − wT ϕn tn
n∈M n∈M

where M is the set of all misclassified patterns.

Clearly, M can change in each gradient step.
The error EP (w) is always nonnegative, we want to minimize it.

Neural Networks - Foundations 11

The Perceptron Criterion

We apply Stochastic Gradient Descent to the error EP (w).

This means that we evaluate the error gradient for a single,
randomly selected (misclassified) sample xn :

∇w en = −∇w wT ϕn tn = −ϕn tn

w is changed by taking a gradient step with learning rate η:

wnew = w − η∇w en = w + ηϕn tn .

It is easy to see that this reduces the error for this particular sample
(but not necessarily the total error).
Nonetheless, if the training set is linearly separable, the algorithm
finishes in finitely many steps (there are no more misclassified
samples). Yet even then, convergence can be very slow.

Neural Networks - Foundations 12

Perceptron Problems

The algorithm has a variety of shortcomings:

If the data set is not linearly separable, the
algorithm never converges (and may not
find a good solution if stopped randomly).
The method does not generalize to more
than two classes.
Convergence can be very slow.
Despite these limitations, the perceptron re-
mains a major milestone in the theory and prac-
tice of neural networks (and of machine learning
in general).

Image source: Wikipedia, Perceptron, original

image from Cornell University

Neural Networks - Foundations 13

Multi-layer Neural Networks
About Nonlinearities

We have seen that in many cases, classes

(in classification) are not linearly
separable, but may be better separable
with a nonlinear function.
Also for regression, we have seen that we
may need a nonlinear function of the data.
Conclusion: We need to perform some
kind of nonlinear calculation.
We have done this by using nonlinear basis
functions to model the data, then applying
a linear model in feature space.
Are there better ways to parametrize a
nonlinear model? Image modified from Jeroen Kools (2020).
6 functions for generating artificial datasets
(https://www.mathworks.com/matlabcentral/fileexchange/
41459-6-functions-for-generating-artificial-
datasets), MATLAB Central File Exchange.

Neural Networks - Foundations 15

About Iterated Computations

Linear classification consists of exactly two computational steps

Neural Networks - Foundations 16

About Iterated Computations

Linear classification consists of exactly two computational steps

Neural Networks - Foundations 16

About Iterated Computations

Linear classification consists of exactly two computational steps

(computation of features, computation of scalar product).
(Formulation is a bit different for the SVM, but the situation is
fundamentally the same.)
Yet, the optimization was already quite complex.
In order to be better, we want to allow the classifier to take even
more complex functional forms—but how?
Idea: allow multiple computational steps
: each one of which may be simple
From mathematics (dynamical systems, complexity theory): Iterated
application of simple rules can generate very complex behavior.
Take inspiration from the human brain, a network of neurons: Each
neuron has very simple behavior (and is somewhat understood), but
the behavior of the whole brain, with billions of interconnected
neurons, is extremely complex (and terribly hard to understand)!
Neural Networks - Foundations 16
Feedforward Fully-Connected Neural Networks

We define a feedforward fully-connected neural network as follows.

Let x = x1 , . . . , xD be the D-dimensional input vector.
M (1) neurons perform a perceptron-like computation
(1) (1) T (1) (1) (1)
um = (wm ) x + bm , zm = f (um ), m = 1, . . . , M (1)

with a differentiable activation function f (for gradient descent).

This step is iterated multiple times, taking the outputs
(ℓ−1)
z(ℓ−1) = (zm )m=1,...,M (ℓ−1) of the previous step as input:
(ℓ) (ℓ) T (ℓ−1) (ℓ) (ℓ) (ℓ)
um = (wm ) z + bm , zm = f (um ),
m = 1, . . . , M (ℓ) and ℓ = 2, . . . , L.

(note that the weights are usually independent for each step).
The output of the entire network is then y = z(L) .

Neural Networks - Foundations 17

Feedforward Fully-Connected Neural Networks

We additionally define z(0) to be the input, i.e.

z(0) = x.

For each layer ℓ ∈ 1, . . . , L, the computation is

(ℓ) (ℓ) T (ℓ−1) (ℓ)
zm = f (wm ) z + bm

which can be written as a matrix multiplication:

z(ℓ) = f W(ℓ) z(ℓ−1) + b(ℓ) .

The activation function is usually applied component-wise, but can

also be applied to the output vector as a whole.

Neural Networks - Foundations 18

Representation of a Neural Network

(ℓ)
The zm are neurons, each of which takes its input values and
computes a single output value from them
The inputs x1 , . . . , xD are occasionally called input neurons (even
though they do not compute anything)
The neurons are organized in layers 1, . . . , L. (Some people consider
the input the zeroth layer.)
The weights w are directed connections between the neurons, e.g.
the neurons of layer 2 are connected to the ones of layer 1 by the
(2)
weights wmn , m = 1, . . . , M (1) , n = 1, . . . , M (2) .

Neural Networks - Foundations 19

Representation of a Neural Network

(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2

(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)

The image graphically represents a neural network with three layers, or

two hidden layers. Computation runs from the left to the right. Note
that M (0) = D and M (3) = K .
Neural Networks - Foundations 20
Feedforward Fully-Connected Neural Networks

Each neuron computes the weighted sum of the connected inputs,

followed by a differentiable activation function. The activation
function should be nonlinear (why?); it can differ for each layer.
The neurons are organized in layers to allow parallel computation, to
avoid cyclic dependencies (we will discuss later how to implement
cycles, or recurrence, in NNs), and to simplify reasoning about the
system: a feedforward network.
There is a full set of connections between successive layers: a fully
connected network.
The process of computing NN outputs from inputs is called forward
propagation.
This kind of network is also called a Multi-layer perceptron.

Neural Networks - Foundations 21

Feedforward Fully-Connected Neural Networks

Activation functions need to be differentiable (because we wish to

apply gradient descent training).
For the hidden layers of the network, the activation function must be
nonlinear, because multiple linear computations can be collapsed to
a single one: In order to gain power from iterative computation, we
thus need nonlinear steps.
The activation function of the last layer usually depends on the task
(e.g. classification or regression).
Finally, in supervised training, we compare the output y = y(x) with
a target t and compute a scalar error E = E (y, t).
The error allows us to measure the performance of the network, and
to derive a criterion for training.

Neural Networks - Foundations 22

Feedforward Fully-Connected Neural Networks

Many possible activation functions for the hidden layers of a neural

network exist:
: Sigmoid, Hyperbolic Tangent: Monotonic, squeeze output to a fixed
range
: ReLU: “Almost linear” (a clipped identity function), works very well.
Encourages sparsity of representations. Currently state-of-the-art.
A large number of variants (not covered here) has been proposed.

Neural Networks - Foundations 23

Feedforward Fully-Connected Neural Networks

We see that the forward step comprises as many computation steps

as there are layers.
Thus we have achieved the goal of creating a “complex” calculation
from multiple simple steps (matrix multiplication + nonlinearity).
We now discuss how to set up the NN for a practical task.
Then we will derive the standard training method for the neural
network.

Neural Networks - Foundations 24

NN Setup for Regression

Assume a regression task: compute a mapping RD → RK .

Since the output of the last layer can have arbitrary range, one
usually chooses a linear activation function (for the last layer only!):
f (x) = x.
The hidden layers can have any nonlinear activation function.
Use the well-known squared error: E = 12 k (tk − yk )2 , where the
P
sum runs over the K components of the vectors1 .
Note that the NN naturally handles multi-dimensional targets.

1 this formula is for one sample only, for multiple samples take the mean
Neural Networks - Foundations 25
NN Setup for Classification

For a classification task with K classes, we use a K -dimensional

output layer.
A sample x ∈ RD is classified as belonging to class k if the output
neuron yk has the maximal value:

ĉ = arg max yk .
k

Problem: The arg max function has a degenerate gradient!

Neural Networks - Foundations 26

NN Setup for Classification

For a classification task with K classes, we use a K -dimensional

output layer.
A sample x ∈ RD is classified as belonging to class k if the output
neuron yk has the maximal value:

ĉ = arg max yk .
k

Problem: The arg max function has a degenerate gradient!

This is solved by letting the neural network output a probability
distribution over classes, i.e.
X
y = (yk )k=1,...,K with yk ≥ 0, yk = 1.
k

Advantage: We can derive a (differentiable) measure of the quality

of the output on theoretical grounds, using probability theory.

Neural Networks - Foundations 26

NN Setup for Classification

In order to make the network output a probability distribution, we

take exponentials and normalize. This is the softmax nonlinearity:
y1
e yK

e
S(y) = P y , . . . , P y .
ke ke
k k

Note that in constrast to other activation functions, it is applied to

the full last layer of the network, not to each independent
component.
The hidden layers can have any nonlinear activation function (just as
for regression).

Neural Networks - Foundations 27

NN Setup for Classification

Assume a neural network with softmax output.

We compute the loss by measuring the cross-entropy between the
output distribution and the target distribution.
: We encode the targets in one-hot style, e.g. if a sample belongs to
class k, the target is

t = (0, . . . , 0, 1, 0, . . . , 0)
↑
k-th element

: Consider this a probability distribution: obviously, a perfect

hypothesis y would exactly match this t, assigning probability 1 to
the correct class, and probability 0 otherwise.
: The cross-entropy loss is defined as
X
ECrossent = − (tk log yk ).
k

Intuition: The cross-entropy corresponds to the number of additional bits needed to encode the
correct output, given that we have access to the (possibly wrong) prediction of the network.

Neural Networks - Foundations 28

NN Setup for Classification

We note some properties of the cross-entropy loss:

It is always nonnegative (do you see why?)?
In the case of deterministic targets (exactly one tk = 1, all others
are zero), the formula simplifies to

ECrossent = − log ykcorrect ,

and we see that the loss goes to zero if ykcorrect approaches one (since
the yk must be a probability distribution, this implies that all other
yk must go to zero).
However, the loss also works for probabilistic targets.
The neural network gracefully handles probabilistic outputs and
multi-class classification.
Remark: For efficiency and numerical stability, one should merge
softmax loss and cross-entropy criterion into one function.

Neural Networks - Foundations 29

Training Neural Networks by Backpropagation
Gradient Descent by Backpropagation

We will use Gradient Descent to train a neural network.

: Remark 1: there are ways to perform gradient descent training even
in unsupervised or semi-supervised scenarios (training targets
unavailable or partially available).
: Remark 2: it is also possible to optimize neural networks without
gradient descent (e.g. by evolution http:
//people.idsia.ch/~juergen/compressednetworksearch.html).
This requires to compute the gradient of the neural network error
w.r.t. each weight.
We will first derive the Backpropagation algorithm which allows
performing this computation in an efficient way.

Neural Networks - Foundations 31

Backpropagation Training

Assume that for a given sample x, we have the error E (y) = E (z(L) ).
We must compute the gradients of E w.r.t. the weights.
We prepare
ourselves
Pby doing two simple computations: Since
(ℓ) (ℓ) (ℓ) (ℓ−1) (ℓ)
zn = f un = f w
m mn m z + bn , we have (chain rule!)2

(ℓ)
∂zn
(ℓ)
= f ′ un(ℓ) zm(ℓ−1)
∂wmn
(ℓ)
∂zn
(ℓ)
= f ′ un(ℓ)
∂bn
(ℓ)
∂zn
(ℓ−1)
= f ′ un(ℓ) wmn
(ℓ)

∂zm
for any ℓ = 1, . . . , L.
2 This assumes that the nonlinearity f is computed independently for each neuron,

which in practice is true except for the softmax nonlinearity. We will remove this
restriction later on.
Neural Networks - Foundations 32
Backpropagation Training

For the last layer, we can now immediately compute the gradients:
(L)
∂E ∂E ∂zn ∂E ′

(L)

(L−1)
(L)
= (L) (L)
= (L)
f u n zm
∂wmn ∂zn ∂wmn ∂zn
(L)
∂E ∂E ∂zn ∂E ′

(L)

(L)
= (L) (L)
= (L)
f un .
∂bn ∂zn ∂bn ∂zn
This computation is easiest for the last layer, since there is only one
(L)
“path” in which the weight wmn influences the error3 .
Let us now consider the general case.

3 Again, this is not correct when the nonlinearity is computed on the entire layer.
Neural Networks - Foundations 33
Backpropagation Training

(1)
z1
y1
x1 → z1(0)
(2)
(1) z1
z2
y2
x2 → z2(0)
(2)
z2
E

(2)
zM (2)
xD → z (0)(0)
M yK
(1)
zM (1)
W (1) W (2) W (3)

The situation is slightly more complicated for the lower layers, since we
need to consider all paths which lead to a certain weight. In how many
ways does the indicated weight influence the loss?
Neural Networks - Foundations 34
Backpropagation Training

We write
 
∂E ∂E ∂E (ℓ)
=  , . . . ,  ∈ R1×M ;
∂z(ℓ) (ℓ) (ℓ)
∂z1 ∂z (ℓ)
M
 (ℓ) (ℓ)   (ℓ)

∂z ∂z ∂z
1 ... 1 1
 ∂z (ℓ−1) ∂z
(ℓ−1)   (ℓ) 
 1 ∂w
M (ℓ−1)
  ij 
∂z(ℓ) ∂z(ℓ)
   
 . .. . (ℓ)
 ∈ RM ×M
 (ℓ−1)  . (ℓ)
 ∈ RM × 1.

= . . . ;  ..
=
∂z(ℓ−1) . . (ℓ)
∂wij
  
(ℓ) (ℓ)
   (ℓ) 
 ∂z ∂z   ∂z 
 M (ℓ) ... M (ℓ)   Mℓ 
(ℓ−1) (ℓ−1) (ℓ)
∂z ∂z ∂w
1 M (ℓ−1) ij

Remember the rules for derivatives of multivariate functions: Input variables go into columns, output
components go into rows, i.e. for f : RD → RK ,
 ∂f1

  ∂x
∂f2
 
∂f 
∂f ∂f
 
∂f  =  ∂x

K ×D
= ··· ∈R

∂x  ∂x1 ∂x2 ∂xD  

 ··· 

∂fK
∂x

Neural Networks - Foundations 35

Backpropagation Training

(ℓ+1)
We furthermore decompose ∂z∂z(ℓ) into the gradient of the
nonlinearity and the network part:
∂z(ℓ+1) ∂z(ℓ+1) ∂u(ℓ+1)
(ℓ)
= .
∂z ∂u(ℓ+1) ∂z(ℓ)

(ℓ+1)
∂z
We define F(ℓ+1) := ∂u (ℓ+1) . In the case of a component-wise
(ℓ+1)
nonlinearity, F is a diagonal matrix.
Also note that because of u(ℓ+1) = W(ℓ+1) z(ℓ) + b(ℓ+1) , the second
(ℓ+1)
factor ∂u∂z(ℓ) is just the weight matrix W(ℓ+1) !

Neural Networks - Foundations 36

Backpropagation Training

Then, by the chain rule,

∂E ∂E ∂z(L) ∂z(ℓ+1) ∂z(ℓ)

(ℓ)
= · · ·
∂wmn ∂z(L) ∂z(L−1) ∂z(ℓ) ∂wmn(ℓ)

where the multiplications are matrix multiplications.

We leave out the formulas for updating the bias since they are very
similar.

Neural Networks - Foundations 37

Backpropagation Training

This gives us a straightforward way to compute the gradients for all

weights in the network.
Let δ (ℓ) be the gradient of the loss w.r.t. the activation of the ℓ-th
layer:
∂E ∂z(L) ∂z(ℓ+1) (ℓ)
δ (ℓ) = (L) (L−1)
··· (ℓ)
∈ R1×M
∂z ∂z ∂z
and note that it can be computed recursively:
∂z(ℓ+1) ∂E
δ (ℓ) = δ (ℓ+1) = δ (ℓ+1) F(ℓ+1) W(ℓ+1) and δ (L) = .
∂z(ℓ) ∂z(L)
Combining prior results, we also see
∂E ∂z(ℓ)
(ℓ)
= δℓ (ℓ)
= δℓ F(ℓ) zm(ℓ−1) .
∂wmn ∂wmn
. . . and that’s all we need.

Neural Networks - Foundations 38

Gradient Descent by Backpropagation

Here is the complete algorithm to perform a gradient step in neural

network training, using the backpropagation algorithm to compute the
gradients:
Perform the forward pass, save intermediate results
for ℓ = L, . . . , 1,
: compute δ (ℓ) from δ (ℓ+1) (except for ℓ = L, the recursion start)
: compute (and save) the weight gradients for layer ℓ
: all required formulas are on the previous slide.
Update all weights simultaneously:

wnew = w − η∇w

where η is the learning rate, and ∇w collects the gradients.

Neural Networks - Foundations 39

Gradient Descent by Backpropagation

Here is the complete algorithm to perform a gradient step in neural

network training, using the backpropagation algorithm to compute the
gradients:
The name backpropagation for this implementation of gradient
descent stems from the way the error is propagated from the
network output to its layers, in backwards order.
Every partial derivative can be computed by a local computation
(i.e. using the δ (ℓ) from the backward pass, and the z(ℓ−1) from the
forward pass).
The δ (ℓ) are also called errors, they assign credit or blame to each
node in each layer. Thus the error of the entire network (which we
want to minimize) is distributed over its components.
Such credit assignment is a fundamental problem in machine
learning.

Neural Networks - Foundations 40

Gradient Descent by Backpropagation

(1)
δ1 (3)
δ1
(2)
(1) δ1
δ2 (3)
δ2
(2)
δ2
E

(2)
δM (2)
(3)
(1)
δK
δM (1)
W (1) W (2) W (3)

(ℓ)
The errors δi assign credit or blame to each node in each layer.
Thus we have quantified the contribution of each node to the network loss.

Neural Networks - Foundations 41

Gradient Descent by Backpropagation

At this point, you should have learned about backpropagation:

that it is very similar to reverse forward propagation!
In the forward case, we compute neuron activations from layer 1 to
layer L
In the backward case, we compute errors from layer L to layer 1.
(Clearly, this makes only sense after a forward pass.)
Note that we need to collect intermediate results in both passes in order
to train the network.
Also note that backward and forward pass have the same complexity.
Finally, distinguish (Stochastic) Gradient Descent (an optimization
method) and backpropagation (which is used to compute gradients which
are required for gradient descent).

Neural Networks - Foundations 42

Backpropagation Training

In order to finish the picture, let us google the derivatives of the nonlinearities:
Function formula derivative
Sigmoid σ(x) = 1
σ ′ (x) = σ(x)(1 − σ(x))
1+e −x
x −x
Tanh tanh(x) = e x −e−x tanh′ (x) = 1 − tanh2 (x)
e +e (
1 if x > 0
ReLU f (x) = max(0, x) f ′ (x) =
0 otherwise
Linear f (x) = x f ′ (x) = (
1
S(x) = (S1 , . . . , SK ) Si (1 − Si ) i = j
Softmax x ∂i Sj =
with Si = Pe ei xk − Si Sj i ̸= j
k

And here are the derivatives of the errors:

Function formula derivative
∂EMSE
1
− yk ) 2
P
MSE EMSE = 2 k (tk ∂y = (yk − tk )k

∂ECrossent t
= − yk
P
Cross-Entropy ECrossent = − k (tk log yk ) y k k
P ∂ECE + SM P
Cross-Entropy and ECE + SM = − k (tk log Sk (y)) y = i ti yk − tk k
Softmax
where in the latter case y is network output before the softmax nonlinearity.
Exercise: Which simple form does the combined softmax + cross-entropy error take if the target is
deterministic (only one ti is nonzero)?

Neural Networks - Foundations 43

Training Setup

We now have defined (and proved) the complete algorithm for

backpropagation.
In practical setups, one usually accumulates gradient information
from a mini-batch of several samples (say, 32 or 64)
: makes the gradient steps more stable
: parallelizes better.
A full iteration over all training samples is called an epoch.
The learning rate can be determined experimentally, but there are
also algorithms which adapt it automatically.
A simple stopping criterion could be derived by checking the error on
the training dataset: When the change is small for a few steps, we
have reached convergence and stop
: but this is not how we usually do it.
In the next section, you will learn more on how network training is
performed in practical situations.

Neural Networks - Foundations 44

Network Initialization

At the beginning of the training, the NN parameters must be

initialized with random values.
In particular, if all the weights have identical initial values (e.g.
zero), all neurons will learn the exact same input weights, causing
the whole learning to fail.
Several strategies have been proposed, for a simple network,
initialization from a uniform distribution is usually OK.
Usually, the mean over all weights should be zero. The standard
deviation should not be too high (often depends on the layer size).
Too high/low values could lead to exploding/vanishing gradients.

Neural Networks - Foundations 45

Network Design Considerations
Advanced backpropagation

The basic algorithm requires to fix a learning rate (and batch size)
The optimal learning rate depends on the task, the data quality, the
batch size, the error function, . . .
The optimal batch size depends on the task, the data quality, the
learning rate, the error function, . . .
Trial and error: If you see very small error reduction, the learning
rate might be too low, if the error fluctuates wildly (or even
increases), the learning rate may be too high
You could also use a learning rate schedule (e.g. higher learning rate
in the beginning, smaller learning rate for final finetuning)
If you observe high fluctuation, you may force smoother gradients by
averaging the gradient over several batches (momentum)
Several methods have been proposed to adapt the learning rate
based on the observed convergence, a well-known one is the Adam
optimizer (Kingma and Ba 2015).
Neural Networks - Foundations 47
Network Topologies

The optimal network topology depends on the task, the data quality
and the amount of data, . . .
No general rule, but note that if you have more than a few layers,
training quality decreases (i.e. the trained network does not perform
well).
: This is due to the structure of backpropagation (ultimately due to
the chain rule), where errors are computed by iterative
multiplications: The error norm follows a power law, with gradients
either vanishing or exploding.
: This can be avoided by gating techniques, including Highway
Networks (Srivastava et al. 2015) and Residual Networks (He et
al. 2016, a special case of Highway networks).
: The original gated neural network was the LSTM (Hochreiter &
Schmidhuber 1997), which we will get to know in the context of
recurrent neural networks.

Neural Networks - Foundations 48

Network Topologies

If the network is too shallow and/or too small (i.e. the number of
layers, or the number of neurons per layer is too small), the network
tends to underfit.
If the network is too large, it can overfit the training data, but in
practice this is not such a great problem.
You will usually make the network perform well on the training data,
and then use regularization to improve generalization,

Neural Networks - Foundations 49

Early Stopping

One of the simplest ways to prevent overfitting the network is to

control the error on a separate validation set (we know that from the
first lecture).
When the validation error starts to rise, stop training (Early
Stopping).
Note that the error fluctuates a bit: Usually one defines a patience
(say, 10 epochs) to wait if the validation error might fall again. If
the validation error does not fall, select the best performing network
so far.
In practice, if your task is small to medium-sized, train with a small
number of hidden nodes, then keep doubling until no more significant
improvement on the validation set.

Neural Networks - Foundations 50

Regularization

The network can be regularized in various ways.

For example, one can penalize the absolute value of the weights, or
the sum of their squares (we know that from linear regression):
X X
Ẽ (w) = E (w) + |wλ | or Ẽ (w) = E (w) + |wλ |2
λ λ

Another large class of regularization ideas comes from augmenting

data by adding noise:
: Input Noise (e.g. white noise) can be added to the input data
: Noise can also be injected into the network in the form of Dropout
: If we have knowledge of the underlying data, we can use
domain-specific noise (e.g. image transformation).
In all cases, the idea is to artificially create more input samples
(which should make sense, of course).

Neural Networks - Foundations 51

Special Layers

There exist a variety of methods to help training the neural network.

Often, they can be described as special layers (even though they are
not really layers).
As an example, Batch Normalization standardizes the input for each
layer for each mini-batch.
: can improve the quality of the solution
: often speeds up the training process (less epochs needed)
It also makes a lot of sense to standardize the input data.

Neural Networks - Foundations 52

Summary

In this lecture, you should have learned

the intuition behind a neural network
the practical implementation (as a series of matrix operations)
training by backpropagation (it is easier than it looks)

Neural Networks - Foundations 53

Main
No ratings yet
Main
25 pages
Unit V
No ratings yet
Unit V
49 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
CS4442 - CS9542 - Part 2 - Lecture 5 - DNN - Intro
No ratings yet
CS4442 - CS9542 - Part 2 - Lecture 5 - DNN - Intro
113 pages
Unit 3
No ratings yet
Unit 3
8 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
NN 2
No ratings yet
NN 2
42 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
Neural Network
No ratings yet
Neural Network
82 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
Neural Deep Learning
No ratings yet
Neural Deep Learning
221 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
An Introduction To Mathematics Behind Neural Networks - Towards Data Science
No ratings yet
An Introduction To Mathematics Behind Neural Networks - Towards Data Science
14 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
17 pages
Deep Learning
No ratings yet
Deep Learning
180 pages
Slide 2
No ratings yet
Slide 2
35 pages
855597620
No ratings yet
855597620
44 pages
Bim309 Ai Week13
No ratings yet
Bim309 Ai Week13
53 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Networks (Basics)
No ratings yet
Neural Networks (Basics)
30 pages
NN Bnu2
No ratings yet
NN Bnu2
47 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
87 pages
Learning Algorithm
No ratings yet
Learning Algorithm
58 pages
Unit 1
No ratings yet
Unit 1
72 pages
NNDL
No ratings yet
NNDL
96 pages
01 Neural Nets
No ratings yet
01 Neural Nets
15 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
DL Unit-1 San
No ratings yet
DL Unit-1 San
58 pages
Wk. 12. Artificial Neural Networks (12!05!2021)
No ratings yet
Wk. 12. Artificial Neural Networks (12!05!2021)
48 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Unit 5
No ratings yet
Unit 5
219 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
Data Mining Techniques: Presentation On Neural Network
No ratings yet
Data Mining Techniques: Presentation On Neural Network
55 pages
Neural Network
No ratings yet
Neural Network
55 pages
COMP3411 Week 3 - NN
No ratings yet
COMP3411 Week 3 - NN
70 pages
Neural Networks
No ratings yet
Neural Networks
28 pages
Pattern Recognition & Analysis Assignment - Ii
No ratings yet
Pattern Recognition & Analysis Assignment - Ii
19 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
Module 2
No ratings yet
Module 2
44 pages
Artificial Neural Networks Basics
No ratings yet
Artificial Neural Networks Basics
50 pages
AIMLB PGP 2025 Session 13 14
No ratings yet
AIMLB PGP 2025 Session 13 14
44 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Chapter 2. Training NN
No ratings yet
Chapter 2. Training NN
50 pages
09 CSE358 Intro To Machine Leaning III
No ratings yet
09 CSE358 Intro To Machine Leaning III
64 pages
Convolutional Networks 2024
No ratings yet
Convolutional Networks 2024
44 pages
Slides Foundations
No ratings yet
Slides Foundations
81 pages
Amanote
No ratings yet
Amanote
33 pages
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
No ratings yet
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
11 pages
4th Unit Test MATH 4
No ratings yet
4th Unit Test MATH 4
3 pages
Rat-Tail Joint Splicing: I. Introductory Information
No ratings yet
Rat-Tail Joint Splicing: I. Introductory Information
17 pages
Arena User's Guide
No ratings yet
Arena User's Guide
154 pages
MS WORD Icons and Uses
No ratings yet
MS WORD Icons and Uses
12 pages
Salesforce Apex Methods
No ratings yet
Salesforce Apex Methods
8 pages
Logcat 1712788829066
No ratings yet
Logcat 1712788829066
33 pages
4.OA.5 Practice
No ratings yet
4.OA.5 Practice
23 pages
POS Product Catalog PDF
No ratings yet
POS Product Catalog PDF
15 pages
Roche Elecsys - TSH - FactSheet
No ratings yet
Roche Elecsys - TSH - FactSheet
2 pages
Verifyconnectivity or Check - Connectivity Reports Some Nets As Open and Some Nets As Special Open
No ratings yet
Verifyconnectivity or Check - Connectivity Reports Some Nets As Open and Some Nets As Special Open
1 page
Cryptocurrency Basics & Investment Guide
No ratings yet
Cryptocurrency Basics & Investment Guide
1 page
Webex Services - Port Numbers and Protocols
No ratings yet
Webex Services - Port Numbers and Protocols
3 pages
Ce411 Ce42s1 Experiment3 PDF
No ratings yet
Ce411 Ce42s1 Experiment3 PDF
12 pages
Change in PV, DV C DP: ENM200 Tutorial Solutions Reservoir Rock Properties 2010
No ratings yet
Change in PV, DV C DP: ENM200 Tutorial Solutions Reservoir Rock Properties 2010
3 pages
Lightning & Surge Protection: /... Class I+II Series
No ratings yet
Lightning & Surge Protection: /... Class I+II Series
14 pages
January 2015 MS - Paper 2C Edexcel Chemistry IGCSE
No ratings yet
January 2015 MS - Paper 2C Edexcel Chemistry IGCSE
16 pages
Technical Data - Epoxy Dotted Paper
No ratings yet
Technical Data - Epoxy Dotted Paper
1 page
Introduction To Python
No ratings yet
Introduction To Python
9 pages
Linear & Angular Mechanics Test
No ratings yet
Linear & Angular Mechanics Test
5 pages
(Ebook) Mathematical Olympiad Treasures by Titu Andreescu, Bogdan Enescu (Auth.) ISBN 9780817682521, 081768252X Download Full Chapters
100% (2)
(Ebook) Mathematical Olympiad Treasures by Titu Andreescu, Bogdan Enescu (Auth.) ISBN 9780817682521, 081768252X Download Full Chapters
184 pages
Eps 201 Educational Measurement and Evaluation
No ratings yet
Eps 201 Educational Measurement and Evaluation
4 pages
Bin Washer-Oq
100% (1)
Bin Washer-Oq
11 pages
Form 3 Computer Studies Exam
No ratings yet
Form 3 Computer Studies Exam
9 pages
Calcium Carbonate To The Rescue! How Antacids Relieve Heartburn - Science Project
No ratings yet
Calcium Carbonate To The Rescue! How Antacids Relieve Heartburn - Science Project
11 pages
Research Assignment
No ratings yet
Research Assignment
3 pages
Wada 1988 2
No ratings yet
Wada 1988 2
7 pages
C 100 Dev
No ratings yet
C 100 Dev
10 pages
Anika Raj Class 4 TEA-2-Math-RWS 5-Perimter and Area-2021-22
No ratings yet
Anika Raj Class 4 TEA-2-Math-RWS 5-Perimter and Area-2021-22
3 pages
Guitar String Tuning 101
No ratings yet
Guitar String Tuning 101
2 pages
NaviPac Sensor Interface Guide
100% (1)
NaviPac Sensor Interface Guide
51 pages