Deep Learning book, by Ian Goodfellow,
Yoshua Bengio and Aaron Courville
Chapter 6 :Deep Feedforward Networks
Benoit Massé Dionyssos Kounades-Bastian
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b
W b
x u y
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b
x1 W11 . . . W23 b1
u1 y1
x2 b2
u2 y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Input vector x
Output vector y
Parameters Weight W and bias b
Prediction : y = W> x + b
x1 W, b
y1
x2
y2
x3
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25
Linear regression (and classication)
Advantages
Easy to use
Easy to train, low risk of overtting
Drawbacks
Some problems are inherently non-linear
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25
Solving XOR
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25
Solving XOR
Linear regressor
x1 W, b
y
x2
There is no value for W and b
such that ∀(x1 , x2 ) ∈ {0, 1}2
x1
W >
+ b = xor (x1 , x2 )
x2
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25
Solving XOR
What about... ?
W, b
x1 u1 V, c
y
x2 u2
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25
Solving XOR
What about... ?
W, b
x1 u1 V, c
y
x2 u2
Strictly equivalent :
The composition of two linear operation is still a linear
operation
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25
Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25
Solving XOR
And about... ?
W, b φ
x1 u1 h1 V, c
y
x2 u2 h2
In which φ(x ) = max {0, x }
It is possible !
1 1 0 1
With W = ,b= ,V= and c = 0,
1 1 −1 −2
Vφ(Wx + b) = xor (x1 , x2 )
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25
Neural network with one hidden layer
Compact representation
W , b, φ V, c
x h y
Neural network
Hidden layer with non-linearity
→ can represent broader class of function
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25
Universal approximation theorem
Theorem
A neural network with one hidden layer can approximate any
continuous function
More formally, given a continuous function f : Cn 7→ Rm where
n
Cn is a compact subset of R ,
K
:x→ vi φ(wi > x + bi ) + c
X
ε
∀ε, ∃fNN
i =1
such that
ε
∀x ∈ Cn , ||f (x ) − fNN (x )|| < ε
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25
Problems
Obtaining the network
The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25
Problems
Obtaining the network
The universal theorem gives no information about HOW to
obtain such a network
Size of the hidden layer h
Values of W and b
Using the network
Even if we nd a way to obtain the network, the size of the
hidden layer may be prohibitively large.
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25
Deep neural network
Why Deep ?
Let's stack l hidden layers one after the other; l is called the
length of the network.
W 1 , b1 , φ W 2 , b2 , φ W l , bl , φ V, c
x h1 ... hl y
Properties of DNN
The universal approximation theorem also apply
Some functions can be approximated by a DNN with N
hidden unit, and would require O(e N ) hidden units to be
represented by a shallow network.
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25
Summary
Comparison
Linear classier
− Limited representational power
+ Simple
Shallow Neural network (Exactly one hidden layer)
+ Unlimited representational power
− Sometimes prohibitively wide
Deep Neural network
+ Unlimited representational power
+ Relatively small number of hidden units needed
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25
Summary
Comparison
Linear classier
− Limited representational power
+ Simple
Shallow Neural network (Exactly one hidden layer)
+ Unlimited representational power
− Sometimes prohibitively wide
Deep Neural network
+ Unlimited representational power
+ Relatively small number of hidden units needed
Remaining problem
How to get this DNN ?
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25
On the path of getting my own DNN
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25
On the path of getting my own DNN
Hyperparameters
First, we need to dene the architecture of the DNN
The depth l
The size of the hidden layers n1 , . . . , nl
The activation function φ
The output unit
Parameters
When the architecture is dened, we need to train the DNN
W1 , b1 , . . . , Wl , bl
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25
Hyperparameters
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25
Hyperparameters
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
The activation function φ
g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0
Sigmoid 1
− −
Many others : tanh, RBF, softplus...
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25
Hyperparameters
The depth l
The size of the hidden layers n1 , . . . , nl
Strongly depend on the problem to solve
The activation function φ
g : x 7→ max { , xx } 1
σ : x 7→ ( + e )
ReLU 0
Sigmoid 1
− −
Many others : tanh, RBF, softplus...
The output unit
Linear outputE[y] = V> hl + c
For regression with Gaussian distribution y ∼ N (E[y], I)
ŷ
Sigmoid output = σ(w> hl + ) b
For classication with Bernouilli distribution P (y = 1|x) = ŷ
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25
Parameters Training
Objective
Let's dene θ = (W1 , b1 , . . . , Wl , bl ).
We suppose we have a set of inputs X = (x1 , . . . , xN ) and a
set of expected outputs Y = (y1 , . . . , yN ). The goal is to nd
a neural network fNN such that
∀i , fNN (x i , θ) ' yi .
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25
Parameters Training
Cost function
To evaluate the error that our current network makes, let's
dene a cost function L(X, Y, θ). The goal becomes to nd
argmin L(X, Y, θ)
θ
Loss function
Should represent a combination of the distances between every
yi and the corresponding fNN (xi , θ)
Mean square error (rare)
Cross-entropy
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25
Parameters Training
Find the minimum
The basic idea consists in computing θ̂ such that
∇θ L(X, Y, θ̂) = 0.
This is dicult to solve analytically e.g. when θ have millions
of degrees of freedom.
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25
Parameters Training
Gradient descent
Let's use a numerical way to optimize θ, called the gradient
descent (section 4.3). The idea is that
f (θ − εu) ' f (θ) − εu> ∇f (θ)
So if we take u = ∇f (θ), we have u> u > 0 and then
f (θ − εu) ' f (θ) − εu> u < f (θ).
If f is a function to minimize, we have an update rule that
improves our estimate.
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25
Parameters Training
Gradient descent algorithm
1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25
Parameters Training
Gradient descent algorithm
1
Have an estimate θ̂ of the parameters
2
Compute ∇θ L(X, Y, θ̂)
3
Update θ̂ ←− θ̂ − ε∇θ L
4
Repeat step 2-3 until ∇θ L < threshold
Problem
How to estimate eciently ∇θ L(X, Y, θ̂) ?
Back-propagation algorithm
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25
Back-propagation for Parameter Learning
Consider the architecture:
w 2 w 1
y φ φ
x
with function:
y =φ w2 φ(w1 x ) ,
N
some training pairs T = x̂n , ŷn n=1 , and
an activation-function φ().
Learn w1 , w2 so that: Feeding x̂n results ŷn .
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25
Prerequisite: dierentiable activation function
For learning to be possible φ() has to be dierentiable.
Let φ0 (x ) = ∂φ(x ) denote the derivative of φ(x ).
∂x
For example when φ(x ) = Relu(x ) we have:
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25
Gradient-based Learning
Minimize the loss function L(w1 , w2 , T ).
We will learn the weights by iterating:
updated ∂L
∂ w1
w1 w1
= −γ , (1)
w2 w2 ∂L
∂ w2
L is the loss function (must be dierentiable): In detail is
L(w1 , w2 , T ) and we want to compute the gradient(s) at
w1 , w2 .
γ is the learning rate (a scalar typically known).
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25
Back-propagation
Calculate intermediate The partial derivatives are:
values on all units:
∂L(d )
∂d = L0 ( ).d
a = w1x̂n
6
1 .
∂d
= φ0 ( w2φ(w1x̂n ))
b = φ(w1x̂n ) ∂c
7 .
.
∂c
w2
2
c = w2φ(w1x̂n )
8
∂b = .
w1x̂n )
3 .
∂b
d = φ w2φ(w1x̂n ) ∂a = φ0 ( .
9
4 .
L(d ) = L φ w2 φ(w1 x̂n )
5 .
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25
Calculating the Gradients I
Apply chain rule:
∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25
Calculating the Gradients I
Apply chain rule:
∂L ∂L ∂ d ∂ c ∂ b ∂ a
= ,
∂ w1 ∂ d ∂ c ∂ b ∂ a w1
∂L(d ) ∂L(d ) ∂ d ∂ c
= .
∂ w2 ∂ d ∂ c w2
Start the calculation from left-to-right.
We propage the gradients (partial products) from the last
layer towards the input.
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25
Calculating the Gradients
And because we have N training pairs:
N
∂L X ∂L(dn ) ∂ dn ∂ cn ∂ bn ∂ an
= ,
∂ w1 n=1 ∂ dn ∂ cn ∂ bn ∂ an w1
N
∂L X ∂L(dn ) ∂ dn ∂ cn
= .
∂ w2 n=1 ∂ dn ∂ cn w2
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 24 / 25
Thank you !
Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 25 / 25