Lecture15 NeuronNetworks
Lecture15 NeuronNetworks
1
Artificial neural networks
• Neuron
• Has an input/output (I/O) characteristic
• Implements a local computation
2
Artificial neural networks
3
Applications of ANNs
• Image processing and computer vision
• E.g., image matching, preprocessing, segmentation and analysis, computer
vision, image compression, stereo vision, and processing and understanding of
time-varying images
• Signal processing
• E.g., seismic signal analysis and morphology
• Pattern recognition
• E.g., feature extraction, radar signal classification and analysis, speech
recognition and understanding, fingerprint identification, character recognition,
face recognition, and handwriting analysis
• Medicine
• E.g., electrocardiographic signal analysis and understanding, diagnosis of various
diseases, and medical image processing
4
Applications of ANNs
• Military systems
• E.g., undersea mine detection, radar clutter classification, and tactical speaker
recognition
• Financial systems
• E.g., stock market analysis, real estate appraisal, credit card authorization, and
securities trading
• Planning, control, and search
• E.g., parallel implementation of constraint satisfaction problems, solutions to
Traveling Salesman, and control and robotics
• Power systems
• E.g., system state estimation, transient detection and classification, fault detection
and recovery, load forecasting, and security assessment
• ...
5
Structure and operation of a neuron
• The input signals to the
neuron (xi, i = 1..m) x0=1
• Each input xi associates
with a weight wi x1 w0
w1
• The bias w0 (with the input x2 Output
x0=1) …
w2
of the
• Net input is an integration wm neuron
xm (Out)
function of the inputs –
Net(w,x)
• Activation (transfer)
function computes the
Inputs Net Activation
output of the neuron –
f(Net(w,x)) to the input (transfer)
neuron (Net) function
• Output of the neuron: (x) (f)
Out=f(Net(w,x))
6
Net input and The bias
• The net input is typically computed using a linear function
m m
Net = w0 + w1 x1 + w2 x2 + ... + wm xm = w0 .1 + wi xi = wi xi
i =1 i =0
• The importance of the bias (w0)
→ The family of separation functions Net=w1x1 cannot separate the
instances into two classes
→ The family of functions Net=w1x1+w0 can
Net Net
Net = w1x1
Net = w1x1 + w0
x1 x1
7
Activation function – Hard-limiter
• Also called the threshold function 1, if Net
Out ( Net ) = hl1( Net , ) =
• The output of the hard-limiter is either 0, if otherwise
of the two values
• is the threshold value
• Disadvantage: neither continuous nor Out ( Net ) = hl 2( Net , ) = sign( Net , )
continuously differentiable
Binary
Out
Bipolar
Out
hard-limiter
hard-limiter
1
1
0 Net 0 Net
-1
8
Activation function – Threshold logic
0, if Net −
1
Out ( Net ) = tl ( Net , , ) = ( Net + ), if − Net −
1, if 1
Net −
(α >0)
= max(0, min(1, ( Net + ))) Out
9
Activation function – Sigmoidal
1
Out ( Net ) = sf ( Net , , ) =
1 + e − ( Net + )
10
Activation function – Hyperbolic tangent
1 − e − ( Net + ) 2
Out ( Net ) = tanh( Net , , ) = − ( Net + )
= − ( Net + )
−1
1+ e 1+ e
11 11
Network structure
bias
◼ Topology of an ANN is composed by:
❑ The number of input signals and input
output signals
❑ The number of layers hidden
❑ The number of neurons in each layer layer
❑ The number of weights in each neuron
output
❑ The way the weights are linked layer
together within or between the layer(s)
output
❑ Which neurons receive the (error)
correction signals • An ANN with one hidden layer
◼ Every ANN must have • Input space: 3-dimensional
❑ exactly one input layer • Output space: 2-dimensional
❑ exactly one output layer • In total, there are 6 neurons
❑ zero, one, or more than one hidden - 4 in the hidden layer
12
Network structure
• A layer is a group of neurons
• A hidden layer is any layer between the input and the output layers
• Hidden nodes do not directly interact with the external environment
• An ANN is said to be fully connected if every output from one layer is
connected to every node in the next layer
• An ANN is called feed-forward network if no node output is an input to a
node in the same layer or in a preceding layer
• When node outputs can be directed back as inputs to a node in the same (or
a preceding) layer, it is a feedback network
• If the feedback is directed back as input to the nodes in the same layer, then it is
called lateral feedback
• Feedback networks that have closed loops are called recurrent networks
13
Network structure – Example
single layer single node with
feed-forward feedback to itself
network
single layer
recurrent
network
multilayer
feed-forward
network
multilayer
recurrent
network
14
Learning rules
• Two kinds of learning in neural networks
• Parameter learning
→ Focus on the update of the connecting weights in an ANN
• Structure learning
→ Focus on the change of the network structure, including the number of
processing elements and their connection types
15
General weight learning rule
• At a learning step (t) the adjustment
of the weight vector w is x0= 1
w0
proportional to the product of the
x1 a neuron
learning signal r(t) and the input x(t) w1
Out
w(t) ~ r(t).x(t) x ... wj
w
xj
w(t) = .r(t).x(t) ...
wm x Learning d
where (>0) is the learning rate xm
signal
generator
• The learning signal r is a function of
w, x, and the desired output d
r = g(w,x,d) Note that xj can be either:
• The general weight learning rule • an (external) input signal, or
• an output from another neuron
w(t) = .g(w(t),x(t),d(t)).x(t)
16
Perceptron
• A perceptron is the
simplest type of ANNs x0=1
x1 w0
• Use the hard-limit w1
activation function x2 Out
…
w2
m
Out = sign(Net ( w, x) ) = sign w j x j xm
wm
j =0
• For an instance x, the
perceptron output is
• 1, if Net(w,x)>0
• -1, otherwise
17
Perceptron – Illustration
Output=1
Output=-1
x2
18
Perceptron – Learning
• Given a training set D= {(x,d)}
• x is the input vector
• d is the desired output value (i.e., -1 or 1)
• The perceptron learning is to determine a weight vector that makes the
perceptron produce the correct output (-1 or 1) for every training
instance
• If a training instance x is correctly classified, then no update is needed
• If d=1 but the perceptron outputs -1, then the weight w should be
updated so that Net(w,x) is increased
• If d=-1 but the perceptron outputs 1, then the weight w should be
updated so that Net(w,x) is decreased
19
Perceptron_incremental(D, η)
Initialize w (wi ← an initial (small) random value)
do
for each training instance (x,d)D
Compute the real output value Out
if (Outd)
w ← w + η(d-Out)x
end for
until all the training instances in D are correctly classified
return w
20 20
Perceptron_batch(D, η)
Initialize w (wi ← an initial (small) random value)
do
∆w ← 0
for each training instance (x,d)D
Compute the real output value Out
if (Outd)
∆w ← ∆w + η(d-Out)x
end for
w ← w + ∆w
until all the training instances in D are correctly classified
return w
21 21
Perceptron - Limitation
• The perceptron learning procedure is proven
to converge if
A perceptron cannot correctly
• The training instances are linearly separable
classify this training set!
• With a sufficiently small η used
• The perceptron may not converge if the
training instances are not linearly separable
• We need to use the delta rule
• Converges toward a best-fit approximation of
the target function
• The delta rule uses gradient descent to
search the hypothesis space (of possible
weight vectors) to find the weight vector that
best fits the training instances
22
Error (cost) function
• Let’s consider an ANN that has n output neurons
• Given a training instance (x,d), the training error made by the
currently estimated weights vector w:
n 2
Ex (w ) = (d i − Outi )
1
2 i =1
• The training error made by the currently estimated weights
vector w over the entire training set D:
1
ED (w ) =
D
E (w )
xD
x
23
Gradient descent
• Gradient of E (denoted as E) is a vector
• The direction points most uphill
• The length is proportional to steepness of hill
• Hence, the direction that produces the steepest decrease is the negative of
the gradient of E
w = -.E(w); E
wi = − , i = 1..N
wi must be
• Requirement: The activation functions used in the network
continuous functions of the weights, differentiable everywhere
24
Gradient descent – Illustration
One-dimensional Two-dimensional
E(w) E(w1,w2)
25
Gradient_descent_incremental (D, η)
Initialize w (wi ← an initial (small) random value)
do
for each training instance (x,d)D
Compute the network output
for each weight component wi
wi ← wi – η(∂Ex/∂wi)
end for
end for
until (stopping criterion satisfied)
return w
26 26
Multi-layer NNs and Back-propagation alg.
• As we have seen, a perceptron can only express a linear decision
surface
• A multi-layer NN learned by the back-propagation (BP) algorithm
can represent highly non-linear decision surfaces
• The BP learning algorithm is used to learn the weights of a multi-
layer NN
• Fixed structure (i.e., fixed set of neurons and interconnections)
• For every neuron the activation function must be continuously
differentiable
• The BP algorithm employs gradient descent in the weight update rule
• To minimize the error between the actual output values and the desired
output ones, given the training instances
27
Back-propagation algorithm (1)
• Back-propagation algorithm searches for the weights vector
that minimizes the total error made over the training set
• Back-propagation consists of the two phases
• Signal forward phase. The input signals (i.e., the input vector) are
propagated (forwards) from the input layer to the output layer (through
the hidden layers)
• Error backward phase
• Since the desired output value for the current input vector is known,
the error is computed
• Starting at the output layer, the error is propagated backwards through
the network, layer by layer, to the input layer
• The error back-propagation is performed by recursively computing the
local gradient of each neuron
28
Back-propagation algorithm (2)
29
Derivation of BP alg. – Network structure
• Let’s use this 3-layer NN to
illustrate the details of the BP
learning algorithm Input xj x1 ... xj ... xm
(j=1..m)
• m input signals xj (j=1..m) wqj
• l hidden neurons zq (q=1..l)
Hidden
• n output neurons yi (i=1..n) ... ...
neuron zq Outq
• wqj is the weight of the (q=1..l)
interconnection from input signal wiq
xj to hidden neuron zq
• wiq is the weight of the Output ... ...
interconnection from hidden neuron yi
neuron zq to output neuron yi (i=1..n) Outi
• Outq is the (local) output value of
hidden neuron zq
• Outi is the network output w.r.t.
the output neuron yi
30
BP algorithm – Forward phase (1)
• For each training instance x
• The input vector x is propagated from the input layer to the output layer
• The network produces an actual output Out (i.e., a vector of Outi, i=1..n)
31
BP algorithm – Forward phase (2)
• The net input for a neuron yi in the output layer is
l l m
Neti = wiq Out q = wiq f wqj x j
q =1 q =1 j =1
• Neuron yi produces the output value (i.e., an output of the
network)
l l m
Outi = f ( Neti ) = f wiq Out q = f wiq f wqj x j
q =1
q =1 j =1
• The vector of output values Outi (i=1..n) is the actual network
output, given the input vector x
32
BP algorithm – Backward phase (1)
• For each training instance x
• The error signals resulting from the difference between the desired output
d and the actual output Out are computed
• The error signals are back-propagated from the output layer to the
previous layers to update the weights
1 n 1 n
E (w) = (d i − Outi ) = d i − f ( Neti )
2 2
2 i =1 2 i =1
2
1 n l
= d i − f wiq Out q
2 i =1 q =1
33
BP algorithm – Backward phase (2)
• According to the gradient-descent method, the weights in the hidden-
to-output connections are updated by
E
wiq = −
wiq
• Using the derivative chain rule for E/wiq, we have
E Outi Neti
wiq = −
= d i − Outi f ' (Neti ) Out q = i Out q
Outi Neti wiq
(note that the negative sign is incorporated in E/Outi)
• i is the error signal of neuron yi in the output layer
E E Out
i = − = − = d i − Outi f ' ( Neti )
i
Neti Out Netyi in the output layer, and
where Neti is the net input to ineuron i
f'(Neti)=f(Neti)/Neti
34
BP algorithm – Backward phase (3)
• To update the weights of the input-to-hidden connections, we
also follow gradient-descent method and the derivative chain
rule
E E Out q Net q
wqj = − = −
wqj Out q
Net q
wqj
• From the equation of the error function E(w), it is clear that
each error term (di-yi) (i=1..n) is a function of Outq
2
1 n l
E (w ) = d i − f wiq Out q
2 i =1 q =1
35
BP algorithm – Backward phase (4)
• Evaluating the derivative chain rule, we have
wqj = (d i − Outi ) f ' ( Neti ) wiq f ' (Net q ) x j
n
i =1
= i wiq f ' (Net q ) x j = q x j
n
i =1
where Netq is the net input to neuron zq in the hidden layer, and
f'(Netq)=f(Netq)/Netq
36
BP algorithm – Backward phase (5)
• According to the error equations i and q above, the error signal of
a neuron in a hidden layer is different from the error signal of a
neuron in the output layer
• Because of this difference, the derived weight update procedure is
called the generalized delta learning rule
• The error signal q of a hidden neuron zq can be determined
• in terms of the error signals i of the neurons yi (i.e., that zq connects to)
in the output layer
• with the coefficients are just the weights wiq
• The important feature of the BP algorithm: the weights update
rule is local
• To compute the weight change for a given connection, we need only the
quantities available at both ends of that connection!
37
BP algorithm – Backward phase (6)
• The discussed derivation can be easily extended to the network
with more than one hidden layer by using the chain rule
continuously
• The general form of the BP update rule is
wab = axb
• b and a refer to the two ends of the (b→a) connection (i.e., from neuron
(or input signal) b to neuron a)
• xb is the output of the hidden neuron (or the input signal) b,
• a is the error signal of neuron a
38
Back_propagation_incremental(D, η)
A network with Q feed-forward layers, q = 1,2,...,Q
qNet and qOuti are the net input and output of the ith neuron in the qth layer
i
39
Step 3 (Output error measure)
Compute the error and error signals Qi for every neuron in the output layer
1 n
E = E + (d i( k ) − QOuti ) 2
2 i =1
Q
δi = (di(k) −QOuti )f '( QNeti )
Step 4 (Error back-propagation)
Propagate the error backward to update the weights and compute the error
signals q-1i for the preceding layers
qwij = .(qi).(q-1Outj); qw
ij = qwij + qwij
q −1
δi = f '( q −1Neti ) q w ji q δ j ; for all q = Q, Q − 1,...,2
j
Step 5 (One epoch check)
Check whether the entire training set has been exploited (i.e., one epoch)
If the entire training set has been exploited, then go to step 6; otherwise, go to step 1
Step 6 (Total error check)
If the current total error is acceptable (E<Ethreshold) then the training process terminates
and output the final weights;
Otherwise, reset E=0, and initiate the new training epoch by going to step 1
40
BP illustration – Forward phase (1)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net3)
41
BP illustration – Forward phase (2)
f(Net1)
w1x1 x1
x1 w1x2 x2 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
42
BP illustration – Forward phase (3)
f(Net1)
x1 f(Net4)
w2 x1 x1
Out6
f(Net6)
f(Net2)
w2 x2 x2
x2 f(Net5)
f(Net3)
Out2 = f (w2 x1 x1 + w2 x2 x2 )
43
BP illustration – Forward phase (4)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 w3 x1 x1 f(Net5)
w3 x2 x2 f(Net3)
Out3 = f ( w3 x1 x1 + w3 x2 x2 )
44
BP illustration – Forward phase (5)
f(Net1)
w41Out1
x1 w42 Out 2 f(Net4)
Out6
f(Net2)
w43 Out3 f(Net6)
x2 f(Net5)
f(Net3)
Out 4 = f ( w41Out1 + w42 Out 2 + w43Out3 )
45
BP illustration – Forward phase (6)
f(Net1)
x1 w51Out1 f(Net4)
Out6
f(Net6)
f(Net2)
w52 Out 2
x2 f(Net5)
w53 Out3
f(Net3)
Out5 = f ( w51Out1 + w52 Out 2 + w53Out3 )
46
BP illustration – Forward phase (7)
f(Net1)
x1 f(Net4)
w 64 Out4
Out6
f(Net6)
f(Net2)
w65 Out5
x2 f(Net5)
f(Net3)
Out6 = f ( w64 Out 4 + w65 Out5 )
47
BP illustration – Compute the error
f(Net1)
x1
6
f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
d is the desired
output value
f(Net3) E Out6
E
6 = − = − = d − Out6 f ' (Net 6 )
Net 6 Out6 Net6
48
BP illustration – Backward phase (1)
f(Net1)
4
x1 f(Net4)
w64 6
Out6
f(Net6)
f(Net2)
x2 f(Net5)
f(Net3)
δ4 = f ' (Net 4 )(w64 δ6 )
49
BP illustration – Backward phase (2)
f(Net1)
x1 f(Net4)
6
Out6
f(Net6)
5
f(Net2)
w65
x2 f(Net5)
f(Net3)
δ5 = f '(Net 5 )(w65 δ6 )
50
BP illustration – Backward phase (3)
1
f(Net1)
w41 4
x1 w51 f(Net4)
Out6
f(Net6)
5
f(Net2)
x2 f(Net5)
51
BP illustration – Backward phase (4)
4
f(Net1)
x1 w42 f(Net4)
2
Out6
f(Net6)
5
f(Net2) w52
x2 f(Net5)
f(Net3)
δ2 = f '(Net 2 )(w42 δ4 + w52 δ5 )
52
BP illustration – Backward phase (5)
4
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2) w43
5
x2 w53 f(Net5)
3
f(Net3) δ3 = f '(Net 3 )(w43 δ4 + w53 δ5 )
53
BP illustration – Weight update (1)
1
w1x1 f(Net1)
x1 w1x2 f(Net4)
Out6
f(Net6)
f(Net2)
x2 f(Net5)
x1 f(Net4)
w2 x1 2 Out6
f(Net6)
f(Net2)
w2x2
x2 f(Net5)
w2 x1 = w2 x1 + 2 x1
f(Net3)
w2 x2 = w2 x2 + 2 x2
55
BP illustration – Weight update (3)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
f(Net2)
x2 w3x1
3
f(Net5)
w3x2 w3 x1 = w3 x1 + 3 x1
f(Net3)
w3 x2 = w3 x2 + 3 x2
56
BP illustration – Weight update (4)
f(Net1)
w41 4
x1 w42 f(Net4)
Out6
w43 f(Net6)
f(Net2)
x2 f(Net5)
57
BP illustration – Weight update (5)
f(Net1)
x1 f(Net4)
Out6
f(Net6)
5
f(Net2)
w 51
w52
x2 f(Net5)
58
BP illustration – Weight update (6)
f(Net1)
x1 f(Net4)
w64 6
Out6
f(Net6)
f(Net2)
w65
x2 f(Net5)
f(Net3)
w64 = w64 + ηδ6Out4
w65 = w65 + ηδ6Out5
59
Advantages vs. Disadvantages
• Advantages
• Massively parallel in nature
• Fault (noise) tolerant because of parallelism
• Can be designed to be adaptive
• Disadvantages
• No clear rules or design guidelines for arbitrary applications
• No general way to assess the internal operation of the network
(therefore, an ANN system is seen as a “black-box”)
• Difficult to predict future network performance (generalization)
60
When using ANNs?
• Input is high-dimensional discrete or real-valued
• The target function is real-valued, discrete-valued or vector-
valued
• Possibly noisy data
• The form of the target function is unknown
• Human readability of result is not (very) important
• Long training time is accepted
• Short classification/prediction time is required
61