Input Output
layer layer
x1
Introduction to Machine Learning x2
Neural Networks x3 Output
x4
Varun Chandola x5
March 8, 2019
– Why not work with thresholded perceptron?
Outline ∗ Not differentiable
– How to learn non-linear surfaces?
– How to generalize to multiple outputs, numeric output?
Contents
1 Extending Perceptrons 1 The reason we do not use the thresholded perceptron is because the objec-
tive function is not differentiable. To understand this, recall that to compute
2 Multi Layered Perceptrons 2 the gradient for perceptron learning we compute the partial derivative of the
2.1 Generalizing to Multiple Labels . . . . . . . . . . . . . . . . . 2 objective function with respect to every component of the weight vector.
2.2 Properties of Sigmoid Function . . . . . . . . . . . . . . . . . 4
∂E ∂ 1X
2.3 Motivation for Using Non-linear Surfaces . . . . . . . . . . . . 4 = (yj − w> xj )2
∂wi ∂wi 2 j
3 Feed Forward Neural Networks 5
Now if we use the thresholded perceptron, we need to replace w> xj with o in
4 Backpropagation 5 the above equation, where o is −1 if w> xj < 0 and 1, otherwise. Obviously,
4.1 Derivation of the Backpropagation Rules . . . . . . . . . . . . 7 given that o is not smooth, the function is not differentiable. Hence we work
with the unthresholded perceptron unit.
5 Final Algorithm 10
6 Wrapping up Neural Networks 11 2 Multi Layered Perceptrons
7 Bias Variance Tradeoff 11
2.1 Generalizing to Multiple Labels
• Distinguishing between multiple categories
1 Extending Perceptrons
• Solution: Add another layer - Multi Layer Neural Networks
• Questions?
2
Hidden Output 2.2 Properties of Sigmoid Function
Inputs
layer layer 1 1 1
x1 o1
fh (x)
fs (x)
ft (x)
0 0
x2 o2
x3 o3
−1 −1 0
x4 o4
x0 = 1 x x x
1
The threshold output in the case of the sigmoid unit is continuous and
smooth, as opposed to a perceptron unit or a linear unit. A useful property
Multi-class classification is more applicable than binary classification. of sigmoid is that its derivative can be easily expressed as:
Applications include, handwritten digit recognition, robotics, etc.
dσ(y)
= σ(y)(1 − σ(y))
• Linear Unit dy
• Perceptron Unit One can also use e−ky instead of e−y , where k controls the “steepness” of the
threshold curve.
• Sigmoid Unit
– Smooth, differentiable threshold function 2.3 Motivation for Using Non-linear Surfaces
1
σ(net) =
1 + e−net
– Non-linear output
x0
x1
x2 Output
x3 net = w> x o = σ(net)
x4
As mentioned ear-
lier, the perceptron unit cannot be used as it is not differentiable. The linear The learning problem
unit is differentiable but only learns linear discriminating surfaces. So to is to recognize 10 different vowel sounds from the audio input. The raw sound
learn non-linear surfaces, we need to use a non-linear unit such as the sig- signal is compressed into two features using spectral analysis.
moid.
3 4
Hidden Output • Objective function for N training examples:
Inputs
layer layer
N
X N k
1 XX
J= Ji = (yil − oil )2
x1 o1 2 i=1 l=1
i=1
x2 o2
x3 o3 • yil - Target value associated with lth class for input (xi )
x4 o4
• yil = 1 when k is true class for xi , and 0 otherwise
x0 = 1
1
• oil - Predicted output value at lth output node for xi
What are we learning?
3 Feed Forward Neural Networks Weight vectors for all output and hidden nodes that minimize J
• d + 1 input nodes (including bias) The first question that comes to mind is, why not use a standard gradient
descent based minimization as the one that we saw in single perceptron unit
• m hidden nodes learning. The reason is that the output at every output node (ol ) is directly
• k output nodes dependent on the weights associated with the output nodes but not with
weights at hidden nodes. But the input values are “used” by the hidden
• At hidden nodes: wj , 1 ≤ j ≤ m, wj ∈ Rd+1 nodes and are not “visible” to the output nodes. To learn all the weights
• At output nodes: wl , 1 ≤ l ≤ k, wl ∈ Rm+1 simultaneously, direct minimization is not possible. Advanced methods such
as Backpropagation need to be employed.
The multi-layer neural network shown above is used in a feed forward mode,
i.e., information only flows in one direction (forward). Each hidden node 1. Initialize all weights to small values
“collects” the inputs from all input nodes and computes a weighted sum
of the inputs and then applies the sigmoid function to the weighted sum. 2. For each training example, hx, yi:
The output of each hidden node is forwarded to every output node. The
(a) Propagate input forward through the network
output node “collects” the inputs (from hidden layer nodes) and computes a
weighted sum of its inputs and then applies the sigmoid function to obtain (b) Propagate errors backward through the network
the final output. The class corresponding to the output node with the largest
output value is assigned as the predicted class for the input. Gradient Descent
For implementation, one can even represent the weights as two matrices,
W (1) (m × d + 1) and W (2) (k × m + 1). • Move in the opposite direction of the gradient of the objective function
• −η∇J
N
X
4 Backpropagation ∇J = ∇Ji
i=1
• Assume that the network structure is predetermined (number of hidden
nodes and interconnections)
5 6
• What is the gradient computed with respect to? Observation 1 P
Weight wrq is connected to J through netr = i wrq urq .
– Weights - m at hidden nodes and k at output nodes
∂J ∂J ∂netr ∂J
– wj (j = 1 . . . m) = = urq
∂wrq ∂netr ∂wrq ∂netr
– wl (l = 1 . . . k)
∂J
PN ∂Ji Observation 2
• wj ← wj − η ∂w = wj − η i=1 ∂wj
j netl for an output node is connected to J only through the output value of
∂J
PN ∂J the node (or ol )
• wl ← wl − η ∂w l
= wl − η i=1 ∂wl
∂Ji ∂J ∂J ∂ol
∂w1
=
∂Ji ∂netl ∂ol ∂netl
∂w2 The first term above can be computed as:
∇Ji = ..
.
k
∂Ji
∂J ∂ 1X
∂wm+k
= (yl − ol )2
∂ol ∂ol 2 l=1
∂Ji
∂wr1 The entries in the summation in the right hand side will be non zero only for
∂Ji ∂Ji
= ∂wr2 l. This results in:
∂wr ..
. ∂J ∂ 1
= (yl − ol )2
∂ol ∂ol 2
• Need to compute ∂Ji = −(yl − ol )
∂wrq
Moreover, the second term in the chain rule above can be computed as:
• Update rule for the q th entry in the rth weight vector:
∂ol ∂σ(netl )
X ∂JiN =
∂J ∂netl ∂netl
wrq ← wrq − η = wrq − η
∂wrq ∂wrq = ol (1 − ol )
i=1
The last result arises from the fact ol is a sigmoid function. Using the above
4.1 Derivation of the Backpropagation Rules results, one can compute the following.
Assume that we only one training example, i.e., i = 1, J = Ji . Dropping the ∂J
= −(yl − ol )ol (1 − ol )
subscript i from here onwards. ∂netl
Let
• Consider any weight wrq δl = (yl − ol )ol (1 − ol )
• Let urq be the q th element of the input vector coming in to the rth unit. Therefore,
∂J
= −δl
∂netl
7 8
Finally we can compute the partial derivative of the error with respect to the Thus, the gradient becomes:
weight wlj as:
∂J ∂J ∂J
= −δl ulj = ujp
∂wlj ∂wjp ∂netj
Xk
Update Rule for Output Units = −zj (1 − zj )( δl wlj )ujp
wlj ← wlj + ηδl ulj l=1
= −δj ujp
where δl = (yl − ol )ol (1 − ol ).
• Question: What is ulj for the lth output node? Update Rule for Hidden Units
• ulj is the j th input to lth output node, which will be the output coming wjp ← wjp + ηδj ujp
from the j th hidden node.
k
X
Observation 3 δj = oj (1 − oj ) δl wlj
netj for a hidden node is connected to J through all output nodes l=1
k δl = (yl − ol )ol (1 − ol )
∂J X ∂J ∂netl
=
∂netj l=1
∂netl ∂netj
Remember that we have already computed the first term on the right hand • Question: What is ujp for the j th hidden node?
side for output nodes:
∂J • ujp is the pth input to j th hidden node, which will be pth attribute value
= −δl for the input, i.e., xp .
∂netl
where δl = (yl − ol )ol (1 − ol ). This result gives us:
∂J
k
X ∂netl 5 Final Algorithm
= −δl
∂netj l=1
∂netj • While not converged:
k
X ∂netl ∂zj – Move forward to compute outputs at hidden and output nodes
= −δl
∂zj ∂netj
l=1 – Move backward to propagate errors back
Xk
∂zj ∗ Compute δ errors at output nodes (δl )
= −δl wlj
l=1
∂netj ∗ Compute δ errors at hidden nodes (δj )
Xk
– Update all weights according to weight update equations
= −δl wlj zj (1 − zj )
l=1
k
X
= −zj (1 − zj ) δl wlj
l=1
9 10
6 Wrapping up Neural Networks • Poor performance on unseen data
• Error function contains many local minima
Low Variance - High Bias
• No guarantee of convergence
• Less sensitive to training data
– Not a “big” issue in practical deployments
• Higher training error
• Improving backpropagation
• Better performance on unseen data
– Adding momentum
– Using stochastic gradient descent • General rule of thumb – If two models are giving similar training error,
– Train multiple times using different initializations choose the simpler model
Adding momentum to the learning process refers to adding an “inertia” term • What is simple for a neural network?
which tries to keep the current value of a weight value similar to the one taken • Low weights in the weight matrices?
in the previous round.
– Why?
7 Bias Variance Tradeoff – The simple answer to this is that if the weights in the weight
vectors at each node are high, the resulting discriminating sur-
• Neural networks are universal function approximators face learnt by the neural network will be highly non-linear. If
the weights are smaller, the surface will be smoother (and hence
– By making the model more complex (increasing number of hidden simpler).
layers or m) one can lower the error
• Penalize solutions in which the weights are high
• Is the model with least training error the best model?
• Can be done by introducing a penalty term in the objective function
– The simple answer is no!
– Risk of overfitting (chasing the data) – Regularization
– Overfitting ⇐ High generalization error
Regularization for Backpropagation !
m d+1 k m+1
λ X X (1) 2 X X (2) 2
High Variance - Low Bias Je = J + (wji ) + (wlj )
2n j=1 i=1 l=1 j=1
• “Chases the data”
• Model parameters change significantly when the training data is changed,
hence the term high variance.
• Very low training error
11 12
Other Extensions?
• Use a different loss function (why)?
– Quadratic (Squared), Cross-entropy, Exponential, KL Divergence,
etc.
• Use a different activation function (why)?
– Sigmoid
1
f (z) =
1 + exp(−z)
– Tanh
ez − e−z
f (z) =
ez + e−z
– Rectified Linear Unit (ReLU)
f (z) = max(0, z)
References
13