7.
Multilayer Perceptrons Neural Networks
7.1 Introduction
IIn this
hi part, we are to study d the h multilayer
l il f df
feed-forward
d
network, which is an important class of neural networks.
Typically, the network consists of a set of inputs (source nodes)
that constitute the input layer, one or more hidden layers of
computational nodes, and an output layer of computational
nodes. The input signals propagates through the network in a
forward direction,
direction on a layer-by-layer
layer by layer basis.
basis These neural
networks are commonly referred to as multilayer perceptrons
(MLP).
A MLP neural network has three distinctive characteristics:
(1) The model of each neuron in the network includes a
nonlinear activation function. The important point here is that
the nonlinearity is smooth. Sigmoid function is often used.
(2). The network contains one or more layers of hidden
neurons that are not part of the input or output of the network.
These hidden neurons enable the network to learn complex
tasks.
(3) The network exhibits a high degree of connectivity,
determined by the synapses of the network.
1
7.2 Some Preliminaries
The architecture of a MLP
The following diagram shows the architectural graph of a MLP
neural network with two hidden layers and an output layer.
Signal flowing through the network progresses in a forward
direction, from left to the right, on a layer-by-layer basis.
2
1-D gradient descent method
Suppose we have a 1-D minimization problem:
J (x) ( a bx ) 2
a and b are parameters known. The goal of the optimization
problem is to find a value of x so that J(x) is minimized.
To minimize the cost function J(x), we take the partial
derivative:
J (x )
2 b ( a bx )
x
Solving the equation:
J (x )
0
x
We obtain the optimal solution:
x0 a / b
The following is the graphical illustration:
J(x)=(a-bx)2
J(x) is a parabola and
its minimum is at a/b.
x a /b x
o
3
J(x)=(a-bx)2
The directional derivative lies
on the parameter axis
and points towards the
direction that increases J(x)
J (x )
x x
x
a/b
The negative of the
J(x)=(a-bx)2 directional derivative
points inwards.
J (x ) The following iterative
x equation can be used to
search for the optimal
solution xo:
J ( x )
x ( n 1) x ( n )
x
x
η is the step-size parameter.
x
a/b
4
The object of MLP neural network learning
Given a set of training (learning) samples:
{ x (1), d (1) } , { x ( 2 ), d ( 2 ) } , , { x ( N ) , d ( N ) }
where
x ( i ) [ x1 ( i ) , x 2 ( i ) , , x n 1 ( i ) ]
d (i ) [ d 1 (i ), d 2 (i ), , d n 2 (i )]
The objective of learning is to construct a MLP neural network so that the
network response f[x(i)] approximates the desired response d(i) well:
f [x(i )] d(i )
7.3 Back propagation algorithm for MLP training
Two types of signals used in the back propagation algorithm
(1) Function signals
A function signal is an input signal that comes in at the input layer of the
network, propagates forward through the network and emerges at the
output layer as output signal. We refer to such a signal as a function
signal.
((2)) Error signal
g
An error signal originates at the output neurons of the network and
propagates backward, layer by layer, through the network. We refer to
such a signal as an error signal because its computation by every neuron
of the network involves an error-dependent function.
5
The error signal of the output neuron j at iteration n (i.e. when the n-th
training sample is presented to the network) is defined by:
e j (n) d j (n) y j (n) (7.1)
Where dj(n) and yj(n) denote desired response and actual response of
neuron j att n-th
th iteration.
it ti
Define the instantaneous value of the error energy for neuron j as:
1
e 2
j (n ) (7.2)
2
Then the instantaneous value of the total error energy is obtained by
summing the error energy of all neurons in the output layer:
1
E (n )
2
j C
e 2j ( n ) (7.3)
C is the set of output neurons.
Assume there are N training samples, the average squared error energy is
obtained by summing E(n) over all n and then normalising with respect to
the number of samples N:
N
1
E ave
N
n 1
E (n ) (7.4)
The instantaneous error energy E(n) and therefore the average error
energy Eave is a function of all free parameters, i.e. synaptic weights of
the network. For a given training data set, Eave represents the cost
function as a measure of learning performance.
The objective of learning is to adjust the parameters (i.e. the weights) of
the network to minimize Eave. To perform this minimization, we consider a
simple method of training, in which the weights are updated on a sample-
by-sample basis until one complete presentation of the entire training
samples have been used.
6
The average of these individual weight changes over the training set is
therefore an estimate of the true change that would result from modifying
the weights based on minimizing the cost function Eave over the entire
training samples.
C
Consider
id the
th following
f ll i di
diagram, which
hi h d
depicts
i t neuron j being
b i ffed
d by
b a
set of function signals produced by a layer of neurons to its left. The
activation of the neuron j is therefore:
m
v j (n) w
i0
ji ( n ) yi ( n ) (7.5)
Where m+1 is the total number of inputs applied to neuron j. Hence, the
function signal yj(n), i.e. the output of neuron j at iteration n is:
y j ( n ) j [ v j ( n )] (7.6)
7
The back-propagation algorithm applies a correction Δwji(n) to the
synaptic weight wji(n), which is proportional to the partial derivative of
E(n) with respect to wji(n).
According to the chain rule of calculus, we may express this derivative
(gradient) as:
E (n ) E (n ) e j (n ) y j (n ) v j (n )
(7.7)
w ji ( n ) e j ( n ) y j ( n ) v j ( n ) w ji ( n )
The partial derivative represents a sensitivity factor that determines the
direction of search in weight space for the synaptic weight wji(n).
Differentiating both sides of Eqn (7.3) with respect to ej(n), we get:
E (n )
e j (n ) (7.8)
e j (n )
Differentiating both sides of Eqn (7.1) with respect to yj(n), we get:
e j (n )
1 (7.9)
y j (n )
Differentiating Eqn (7.6) with respect to vj(n), we obtain:
y j (n )
'j [ v j ( n ) ] (7.10)
v j (n)
Differentiating
g Eqn
q ((7.5)) with respect
p to wji((n),
) we obtain:
v j (n)
yi (n ) (7.11)
w ji ( n )
8
Substituting Eqns (7.8)-(7.11) into Eqn (7.7), yields:
E (n ) (7.12)
e j ( n ) 'j [ v j ( n )] y i ( n )
w ji ( n )
The correction applied
pp to wji((n)) is defined byy the delta rule:
E ( n )
w ji ( n ) j ( n ) y i ( n ) (7.13)
w ji ( n )
Where η is the learning-rate parameter of the back-propagation algorithm,
and δj(n) is the local gradient defined by:
E (n )
j (n)
v j (n )
E (n ) e j (n ) y j (n )
e j (n ) y j (n ) v j (n )
(7.14)
e j ( n ) 'j [ v j ( n )]
The above equation shows that the local gradient points to the required
changes in synaptic weights. The local gradient is equal to the product
of the corresponding error signal for that neuron and the derivative of
the associated activation function.
I the
In th above,
b we note
t that
th t a key
k factor
f t involved
i l d in
i the
th calculation
l l ti off the
th
weight adjustment Δwji(n) is the error signal ej(n) at the output neuron j.
In this context, we may identify two distinct cases, depending on where
in the network neuron j is located.
Case 1: neuron j is at the output layer
When neuron j is located in the output layer of the network, it is
supplied with a desired response of its own. Thus, it is straightforward
to compute the error signal ej(n).
9
Case 2: neuron j is at the hidden layer
When neuron j is located in a hidden layer of the network, there is no
supplied desired response for the neuron. Thus, the error signal for a
hidden layer neuron has to be determined recursively in terms of the
error signals of all neurons to which that hidden layer neuron is directly
connected. This makes the back-propagation algorithm very
complicated.
The local gradient for a hidden neuron j is defined as:
E (n) y j (n)
j (n)
y j (n) v j (n)
E (n) '
j [v j ( n )] (7.15)
y j (n)
neuron j is hidden
10
To calculate the partial derivative ∂E(n)/∂yj(n), we may proceed as
follows:
1
E (n )
2
kC
e k2 ( n ) (7.16)
Where C is the set of output
p neurons.
Differentiating Eqn (7.16) with respect to yj(n), we get:
E (n ) ek (n )
y j (n )
k
ek
y j (n )
(7.17)
Next,
N t we use the
th chain
h i rule
l for
f the
th partial
ti l derivative
d i ti ∂E(n)/∂y
∂E( )/∂ j(n),
( ) and
d
rewrite Eqn (7.17) as:
E (n ) ek (n ) vk (n )
y j (n )
k
ek
vk (n ) y j (n )
(7.18)
We note that:
ek (n ) d k (n ) yk (n )
d k ( n ) k [v k ( n )] (7.19)
Hence, we have:
ek (n )
k' [ v k ( n ) ] (7.20)
vk (n )
We also note from the above diagram that for neuron k, the activation is:
m
vk (n )
j0
w kj ( n ) y j ( n ) (7 21)
(7.21)
Differentiating (7.21) with respect to yj(n), yields:
vk (n )
w kj ( n ) (7.22)
y j (n )
11
Substituting Eqns (7.20) and (7.22) into Eqn (7.18), yields:
E (n)
e k ( n ) k' ( v k ( n ) ) w k j ( n )
y j (n) k
k ( n ) w kj ( n ) (7.23)
k
Substituting Eqn (7.23) into Eqn (7.15), we get the back-propagation
formula for the local gradient of hidden neuron j:
j ( n ) 'j ( v j ( n ) ) k ( n ) w k j ( n ) (7.24)
k
The factor φk’[vj(n)] involved in the computation of the local gradient in the
above equation depends solely on the activation functions associated with
hidden neuron j. The remaining factor δk(n) requires knowledge of the
error signal ek(n), for all neurons that lie in the layer to the immediate right
of hidden neuron j, and wkj(n) are synaptic weights associated.
Now, we summarize the relations that we have derived for the back-
propagation algorithm.
(1) First, the correction Δwji(n) applied to the synaptic weight connecting
neuron i to neuron j is defined by the delta rule:
w e ig h t le a r n in g lo c a l in p u t
c o r r e c tio n r a te g r a d ie n t s ig n a l
w ji ( k ) (n) y (n)
j i
(2) Second, the local gradient depends on whether neuron j is an output
or hidden neuron:
(i) If neuron j is hidden, the gradient is computed using Eqn (7.24)
(ii) If neuron j is an output neuron, the gradient is computed using Eqn
(7.14)
12
Two passes of the computation
In the application of the back-propagation algorithm, two distinct passes
of computations are distinguished. The first pass is referred to as the
forward pass, and the second is referred to as the backward pass.
(1) Forward pass
In the forward pass, the synaptic weights remain unchanged throughout
the network, and the function signals of the network are computed on a
neuron-by-neuron, layer-by-layer basis.
(2) Backward pass
The backward pass starts from the output layer, passes error signals
leftward through the network, layer-by-layer, and recursively computes
the local gradient for each neuron. This recursive process permits the
weights undergo changes, in accordance with the delta rule.
Activation function
The computation of the local gradient and hence weight correction of each
neuron of the MLP neural network requires knowledge of the derivative of
the activation function. For this derivative to exist, we require the
activation function to be continuous and hence differentiable. An example
of a continuously differentiable nonlinear activation function commonly
used in MLP neural network is the sigmoidal nonlinear function, which has
two typical forms:
(1) Logistic function.
This form of sigmoidal nonlinearity in its general form is defined by:
1
j [ v j ( n )]
1 exp[ av j ( n )] (7.25)
Where a>0, and vj(n) is the activation signal of neuron j.
13
According to this nonlinearity, the amplitude of the output lies in the range
of:
0 y j ( n) 1
Differentiating Eqn (7.25) with respect to vj(n), we get:
a exp[ av j ( n )]
'j [ v j ( n )] (7.26)
[1 exp[ av j ( n )]] 2
Considering yj(n)=φj[vj(n)], we may express Eqn (7-26) into:
'j [ v j ( n )] ay j ( n )[ 1 y j ( n )]
For a neuron j located in the output layer, yj(n)=oj(n). Hence, the local
gradient maybe expressed as:
j ( n ) e j ( n ) 'j [ v j ( n )]
a [ d j ( n ) o j ( n )] o j ( n )[ 1 o j ( n )] (7.27)
One the other hand, for an arbitrary hidden neuron j, we may express the
local gradient as:
j ( n ) 'j [ v j ( n )] k ( n ) w kj ( n )
k
ay j ( n )[ 1 y j ( n )] k ( n ) w kj ( n ) (7.28)
k
(2) Hyperbolic tangent function
Hyperbolic tangent function is another commonly used form of sigmoidal
nonlinearity, and its general form is defined by:
exp[bv j ( n)] exp[ bv j ( n )]
j [v j ( n)] a tanh[[bv j ( n)] a ((7.29))
exp[bv ) exp[ bv
b j ( n )] b j ( n ))]
Suitable values for a and b are:
a 1 .7 1 5 9
b 2 /3
14
The derivative of the hyperbolic tangent function with respect to vj(n) is
given by:
' [v j ( n )] ab sech 2 [bv j ( n )]
j
ab 1 tanh 2 [bv j ( n )]
b
[ a y j ( n )][ a y j ( n )] (7.30)
a
Where sech denotes hyperbolic secant:
2
sec h( z )
exp( z ) exp( z )
For a neuron j located in the output layer, the local gradient is:
j ( n ) e j ( n ) ' [ v j ( n )]
j
b
[ d j ( n ) o j ( n )][ a o j ( n )][ a o j ( n )] (7.31)
a
For a neuron j located in a hidden layer, we have
j ( n ) 'j [ v j ( n )] k ( n ) w kj ( n )
k
b
[ a y j ( n )][ a y j ( n )] k ( n ) w kj ( n ) (7.32)
a k
15
Learn-rate parameter η
The smaller we make the learning rate parameter η, the smaller the
changes to synaptic weights in the network will be from one iteration to
the next. This improvement, however, is attained at the cost of a slower
rate of learning. If on the other hand, we make the learning-rate
parameter too large, in order to speed up the rate of learning, the
resulting large changes in the synaptic weights might make the network
become unstable.
A simple method of increasing the rate of learning, yet avoiding the
danger of instability, is to modify the delta rule by including a momentum
term:
w ji ( n ) w ji ( n 1) j ( n ) y i ( n )
Where α is usually a positive number called the momentum constant. It
controls the feedback loop acting around Δwji(n). For the convergence of
the weight estimation, the momentum constant must be restricted to the
range:
0 1
Modes of training
In a practical application of the back-propagation algorithm, learning results
from the many presentations of a prescribed set of training samples to the
MLP. One complete presentation of the entire training set during the
learning process is called an epoch. The learning process is maintained on
an epoch-by-epoch
epoch by epoch basis,
basis until the synaptic weights of the network
stabilize.
For a give set of training samples, back-propagation learning may proceed
in one of the following two basic ways.
(i) Sequential mode
The sequential mode of the back-propagation learning is also referred to
as on-line learning. In this mode of operation, weight updating is performed
after the presentation of each training sample. To be specific, consider an
epoch consisting of N training samples (pairs):
{ x (1), d (1)} , { x ( 2 ), d ( 2 )} , , { x ( N ), d ( N )}
16
The first training sample pair {x(1),d(1)} in the epoch is presented to the
network, and the sequence of forward and backward computations
described previously is performed, resulting in certain adjustments to the
synaptic weights of the network. Then the second training sample pair
{x(2),d(2)} in the epoch is presented, and the forward and backward
computations are repeated, resulting further adjustments to the weights.
Thi process is
This i continued
ti d until
til the
th last
l t training
t i i sample l pair
i {x(N),d(N)}
{ (N) d(N)} in
i
the epoch is presented.
(2) Batch mode
In the batch model, the weight updating is performed when all the samples
in the epoch are presented to the network. Details are not described here,
since the sequential mode is the commonly used method.
method
Summary of the back-propagation algorithm
The signal flow graph for back-propagation learning, incorporating both
the forward and backward phases of the computations involved in the
learning process, is shown in the following diagram:
This graph is for the case of 1 hidden layer. The top part of the signal flow
graph accounts for the forward pass, while the lower part of the graph
accounts for the backward pass.
Often, the sequential learning of weights is the preferred method for on-
line implementation of the back-propagation algorithm. For this mode of
operation,
p , the algorithm
g cycles
y as follows:
Assume there are N the training samples:
{ x (1), d (1)} ,{ x (2), d (2)} , ,{ x ( N ), d ( N )}
17
(1) Initialization.
Assuming that no prior information is available, pick the synaptic weights
from a uniform distribution whose mean is zero and whose variance is
chosen to make the standard deviation of the activation signals lie at the
transition between linear and saturated parts of the sigmoidal activation
function.
(2) Presentation of training samples
Present the network with an epoch of training samples. For each sample in
the set, perform the sequence of forward and backward computations.
(3) Forward computation
Let a training sample in the epoch be denoted by {x(n),d(n)}, with the x(n)
applied to the input layer and the desired response d(n) presented to the
output layer. Compute the activation signals and function signals of the
network by proceeding through the network, layer by layer.
18
(4) Backward computation
Compute the local gradient of the network, and adjust the synaptic
weights of network.
(5) Iteration
Iterate the forward and backward computations in steps (3)-(4) by
representing new epochs of training samples to the network until the
stopping criterion is met.
Note, the order of presentation of training samples should be
randomized from epoch to epoch.
The stopping criterion can be the number of iterations, or the rate of
change of the average error small enough.
19