Deep Learning
BITS Pilani
Pilani Campus
Deep Neural Network
Disclaimer and Acknowledgement
• The content for these slides has been obtained from books and various other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the course
BITS Pilani, Pilani Campus
Session Agenda
• Back Propagation
BITS Pilani, Pilani Campus
Training: Go Forwards, then Backwards…
Step 3: Calculate partial derivatives
(using Backpropagation)
Step 1: Calculate 𝒚̂ using
computation graph.
Step 2: Determine the cost.
Step 4: Update each
parameter
Brad Quinton Scott Chin
Training Neural Networks: optimizing
parameters
• We are given an architecture though its weights 𝐖.
• Also given a training data 𝐷 = {(𝐱𝑖, 𝑦i)}
• We are given a loss function ℒ(𝐷; 𝐖)
• We can use gradient descent to minimizes the loss.
• At each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface.
Algorithm for Gradient Computation –
Backpropagation
𝜕ℒ
• For each sub-parameter 𝑊𝑖 ∈ 𝐖: 𝑤𝑖𝑡+1 = 𝑤𝑖𝑡 −∝ 𝜕𝑤 𝑖
𝜕ℒ
It all comes down to effectively computing 𝜕𝑤𝑖
𝜕ℒ
• How to efficiently compute 𝜕𝑤 for all parameters?
𝑖
• The calculus just gets a bit more complicated for a neural
network
• Depth gives more representational capacity to neural networks.
• However, training deep nets is not trivial.
• The solution is “Backpropagation” algorithm!
• Backpropagation is a systematic numerical method to calculate
the partial derivatives (i.e. partial derivative of the cost w.r.t. each
parameter)
Rumelhart, Hinton, Williams, “Learning Representations by Back-Propagating Errors”, 1986
Key Intuitions Required for
Backpropagation
1. Gradient Descent
• Change the weights 𝐖 in the direction of gradient to minimize the
error function.
2. Chain Rule
• Use the chain rule to calculate the weights of the intermediate
weights
3. Dynamic Programming (Memorization)
• Parameter gradients depend on the gradients of the earlier layers!
• So, when computing gradients at each layer, we can reuse
gradients computed for higher layers for lower layers (i.e.,
memorization).
The Computation Graph of A Neural
Network
• we can represent any neural network in terms of a computation
graph
• The loss function can be computed by moving from left to right
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]
The corresponding computation graph
Cross entropy cost function:
Cost 𝒚, 𝑦 = −𝑦 ∗ 𝒍𝒐𝒈𝒚 − 1 − y log 1 − 𝒚
Backward Propagation
𝒃[𝟐]
𝒚 = 𝒂[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]
a[2] → 𝓛
𝝏𝓛 𝝏 [𝟐] [ 𝟐]
[𝟐]
= [𝟐] −𝒚𝒍𝒐𝒈 𝒂 − 𝟏 − 𝒚 𝒍𝒐𝒈 𝟏 − 𝒂
𝝏𝒂 𝝏𝒂
−𝒚 (𝟏−𝒚)
= 𝒂[𝟐] + (𝟏−𝒂[𝟐])
Backward Propagation
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
𝑾[𝟏]
z[2] →a[2] → 𝓛
𝒂[𝟐] = 𝝈 𝒛 𝟐 𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐]
= × = 𝒂[𝟐] −𝒚
𝝏𝒛[𝟐] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐]
𝝏𝒂[𝟐] [𝟐] 𝟏 − 𝒂[𝟐]
= 𝒂
𝝏𝒛[𝟐]
Backward Propagation
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
𝑾[𝟏]
b[2] →z[2] →a[2] → 𝓛
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝓛
𝝏𝒃[𝟐]
=
𝝏𝒂[𝟐]
×
𝝏𝒛[𝟐]
×
𝝏𝒃[𝟐]
=
𝝏𝒛[𝟐]
𝝏𝓛
1
𝝏𝒛[𝟐]
Backward Propagation
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
𝑾[𝟏]
w[2] →z[2] →a[2] → 𝓛
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝓛 𝟏 𝑻
[𝟐]
= [𝟐]
× [𝟐] × [𝟐]
= [𝟐] × 𝒂
𝝏𝑾 𝝏𝒂 𝝏𝒛 𝝏𝑾 𝝏𝒛
Dimension of dz[2] is (n[2],1)
Dimension of a[1] is (n[1],1)
𝝏𝓛 Hence transpose of a[1]
𝝏𝒛[𝟐] Dimension of dW[2] is (n[2], n[1])
Backward Propagation
𝒃[𝟏] 𝑾[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
𝑾[𝟏] 𝒃[𝟐]
a[1] →z[2] →a[2] → 𝓛
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝟐 𝑻 × 𝝏𝓛
= × × = 𝒘
𝝏𝒂[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟐]
𝝏𝓛
Dimension of dz[2] is (n[2],1)
𝝏𝒛[𝟐] Dimension of W[2] is (n[2], n[1])
Dimension of da[1] is (n[1],1)
But, the calculation of da[1] is not required.
Backward Propagation Dimension of dz[2] is (n[2],1)
Dimension of W[2] is (n[2], n[1])
Dimension of da[1] is (n[1],1)
𝒃[𝟐] Dimension of dz[1] is (n[1],1)
𝒃[𝟏] 𝑾[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
𝑾[𝟏]
z[1] →a[1] →z[2] →a[2] → 𝓛
Element-wise product
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝟐 𝑻 𝝏𝓛 [𝟏] (𝟏 − 𝒂[𝟏] )
= × × × = 𝒘 × ∗ 𝒂
𝝏𝒛[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝒛[𝟐]
𝒂[𝟏] = 𝝈 𝒛 𝟏
𝝏𝓛 𝝏𝒂[𝟏]
𝝏𝒂[𝟏] [𝟏]
= 𝒂[𝟏] 𝟏 − 𝒂[𝟏]
𝝏𝒛
Backward Propagation
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]
b[1] → z[1] →a[1] →z[2] →a[2] → 𝓛
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝓛
[𝟏]
= [𝟐]
× [𝟐] × [𝟏] × [𝟏] × [𝟏] = [𝟏]
𝝏𝒃 𝝏𝒂 𝝏𝒛 𝝏𝒂 𝝏𝒛 𝝏𝒃 𝝏𝒛
𝟏
Backward Propagation
𝒃[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]
W[1]→ z[1] →a[1] →z[2] →a[2] → 𝓛
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝓛 𝑻
= × × × × = 𝒙
𝝏𝒘[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝒘[𝟏] 𝝏𝒛[𝟏]
x
Backward Propagation: Summary
One training example For all training example
For cross entropy loss and sigmoid activation in the last layer
[𝟐]
𝝏𝓛 [𝟐]
𝝏𝓛 𝒅𝒁 = = 𝑨 −𝒀
𝒅𝒛 = [𝟐] = 𝒂[𝟐] −𝒚
[𝟐] 𝝏𝒁[𝟐]
𝝏𝒛 𝝏𝓛 𝟏
𝒅𝒃 = [𝟐] = 𝒅𝒁[𝟐]
[𝟐]
𝝏𝓛 𝝏𝒃 𝒎
𝒅𝒃 = [𝟐] = 𝒅𝒛[𝟐]
[𝟐]
𝝏𝓛 𝟏
𝝏𝒃 [𝟐]
𝒅𝑾 = = 𝒅𝒁 [𝟐] . 𝑨 𝟏 𝑻
[𝟐]
𝝏𝓛 [𝟐] 𝟏 𝑻 𝝏𝑾[𝟐] 𝒎
𝒅𝒘 = [𝟐]
= 𝒅𝒛 . 𝒂
𝝏𝑾 𝝏𝓛 𝑻
𝒅𝒁[𝟏] = = 𝑾 𝟐 𝒅𝒁[𝟐] ∗ 𝒈[𝟏]` 𝒁[𝟏]
𝝏𝒁[𝟏]
𝝏𝓛 𝑻
𝒅𝒛[𝟏] = = 𝒘 𝟐 𝒅𝒛[𝟐] ∗ 𝒈` 𝒛[𝟏]
𝝏𝒛[𝟏]
𝝏𝓛 𝟏
𝝏𝓛 𝒅𝒃[𝟏] = [𝟏] = 𝒅𝒁[𝟏]
𝒅𝒃 [𝟏]
= [𝟏] = 𝒅𝒛[𝟏] 𝝏𝒃 𝒎
𝝏𝒃
𝝏𝓛 𝟏
𝝏𝓛 𝒅𝑾[𝟏] = = 𝒅𝒁 [𝟏] . 𝑿𝑻
𝒅𝒘[𝟏] = = 𝒅𝒛[𝟏] . 𝒙𝑻 𝝏𝑾[𝟏] 𝒎
𝝏𝒘[𝟏]
Equations for layer l
One training example For all training example
Input da[l]
Output da[l-1], dw[l] , db[l]
𝒅𝒁[𝒍] = 𝒅𝑨[𝒍] ∗ 𝒈[𝒍]` 𝒁[𝒍]
𝒅𝒛[𝒍] = 𝒅𝒂[𝒍] ∗ 𝒈[𝒍]` 𝒛[𝒍]
𝟏
𝒅𝒃 = 𝒅𝒁[𝒍]
[𝒍]
𝒅𝒃[𝒍] = 𝒅𝒛[𝒍] 𝒎
𝒍 [𝒍]
𝟏 [𝒍] 𝒍−𝟏 𝑻
𝒅𝑾[𝒍] = 𝒅𝒛 . 𝒂 𝒍−𝟏 𝒅𝑾 = 𝒅𝒁 . 𝑨
𝒎
𝑻
𝒅𝒂[𝒍−𝟏] =𝑾 𝒍 𝒅𝒛[𝒍] [𝒍−𝟏] 𝒍 𝑻
𝒅𝑨 = 𝑾 𝒅𝒁[𝒍]
Scaling up for L layers and all training examples in NN
Forward Propagation
Algorithm Forward propagation:
Output
Input
Cache output
Backward propagation:
Cache input
Output
Input
Scaling up for L layers and all training examples in NN
Update the
Parameters
Scaling up for L layers and all training examples in NN
Example
Calculate all matrix dimensions and total number of parameters
Array broadcasting
Source:numpy.org
Array broadcasting
Source:numpy.org
Array broadcasting
Source:numpy.org
Exercise - MSE Loss
Consider the neural network with two inputs x1 and x2 and the initial weights are
w0 = 0.5, w1 = 0.8, w2 = 0.3. Draw the network, compute the output, mean
squared loss function and weight updation when the input is (1, 0), the learning rate
is 0.01 and target output is 1. Assume any other relevant information.
1 w0
w1
Σ
x1 σ ŷ
w2
x2
Solution
E X E RCISE - B C E
Consider the neural network with two inputs x1 and x2 and the initial weights are
w0 = 0.5, w1 = 0.8, w2 = 0.3. Draw the network, compute the output, binary cross
entropy loss function and weight updation when the input is (1, 0), the learning rate
is 0.01 and target output is 1. Assume any other relevant information.
1 w0
w1
Σ
x1 σ ŷ
w2
x2
Solution
Exercise (without vectorization)
W11[1] =0.2
W12[1] =0.4
[𝟏]
𝒘𝟏𝟏
[𝟏]
W13[1] = -0.5
[𝟏]
𝒛𝟏 𝒂𝟏 [𝟐] W21[1] = -0.3
𝒙𝟏 [𝟏]
𝒘𝟏𝟐
𝒘𝟏𝟏
[𝟏] W22[1] =0.1
𝒘𝟏𝟑
[𝟏]
[𝟐]
𝒛𝟏 [𝟐] W23[1] =0.2
𝒃𝟏 𝒂𝟏 ෝ
𝒚
𝒙𝟐 W11[2] =-0.3
[𝟏] [𝟐] [𝟐]
𝒃𝟏 W12[2] = -0.2
𝒘𝟐𝟏 𝒘𝟏𝟐
[𝟏]
𝒘𝟐𝟐 [𝟏]
𝒛𝟐 [𝟏] b1[1] = -0.4
𝒙𝟑 𝒂𝟐
[𝟏] b2[1] = -0.2
𝒘𝟐𝟑
b1[2] = 0.1
[𝟏]
𝒃𝟐
=0.9
For X= { 1, 0, 1} and y=1
Find the cross entropy loss and weight updates after 1st iteration
Continued……
For the layer 2
[𝟐]
𝝏𝓛 [𝟐] −𝒚
𝒅𝒛𝟏 = = 𝒂 𝟏 For Cross entropy cost function
𝝏𝒛𝟏[𝟐]
𝝏𝓛
𝒅𝒛𝟏 [𝟐] = = 𝒂 𝟏
𝟐 𝟏 − 𝒂𝟏 𝟐 (𝒚 − 𝒂𝟏 𝟐 ) For MSE cost function
𝝏𝒛𝟏 𝟐
Exercise – With vectorization
1 4 5
x W [1]
0 3 6
−1
prod = W [1]T x b[1] −2
1 1
-1 7
-2 y=1
z [1] = prod + b[1]
4 -3
x1 a[10] a[11] a[12] ŷ a[1] = σ(z[1]) −3 −2
5
W [2]
3 -2
6 prod = W [2]T a[1] b[2] 7
x2 [0] [1]
a2 a2
z [2] = prod + b[2]
a[2] = σ(z[2]) yˆ
Computation Graph for Forward Pass
1 4 5
x W [1]
0 3 6
4 −1
prod = W [1]T x b[1] −2
3
3
z [1] = prod + b[1]
1
0.95
a[1] = σ(z[1]) W [2] −3 −2
0.73
−4.31 prod = W [2]T a[1] b[2] 7
2.69 z [2] = prod + b[2]
0.94 a[2] = σ(z[2]) yˆ 1
Computation Graph for Cost Function
Computation Graph for Backward Pass
Demo
https://playground.tensorflow.org/
Practice problems
Example 1 Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Example 2
59
60
61
Optional
Derivative of cost function w.r.t final layer
linear function
Derivative of cost function w.r.t final layer
linear function
(Derivative of sigmoid activation function)
Derivative of cost function w.r.t final layer
linear function
Thank You All !
BITS Pilani, Pilani Campus