0% found this document useful (0 votes)
9 views66 pages

Part 2 Module 2 DL BP

Uploaded by

geetapillai1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views66 pages

Part 2 Module 2 DL BP

Uploaded by

geetapillai1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Deep Learning

BITS Pilani
Pilani Campus
Deep Neural Network

Disclaimer and Acknowledgement

• The content for these slides has been obtained from books and various other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the course

BITS Pilani, Pilani Campus


Session Agenda

• Back Propagation

BITS Pilani, Pilani Campus


Training: Go Forwards, then Backwards…
Step 3: Calculate partial derivatives
(using Backpropagation)
Step 1: Calculate 𝒚̂ using
computation graph.

Step 2: Determine the cost.

Step 4: Update each


parameter

Brad Quinton Scott Chin


Training Neural Networks: optimizing
parameters
• We are given an architecture though its weights 𝐖.
• Also given a training data 𝐷 = {(𝐱𝑖, 𝑦i)}
• We are given a loss function ℒ(𝐷; 𝐖)
• We can use gradient descent to minimizes the loss.
• At each step, the weight vector is modified in the direction that
produces the steepest descent along the error surface.
Algorithm for Gradient Computation –
Backpropagation
𝜕ℒ
• For each sub-parameter 𝑊𝑖 ∈ 𝐖: 𝑤𝑖𝑡+1 = 𝑤𝑖𝑡 −∝ 𝜕𝑤 𝑖
𝜕ℒ
It all comes down to effectively computing 𝜕𝑤𝑖
𝜕ℒ
• How to efficiently compute 𝜕𝑤 for all parameters?
𝑖

• The calculus just gets a bit more complicated for a neural


network
• Depth gives more representational capacity to neural networks.
• However, training deep nets is not trivial.
• The solution is “Backpropagation” algorithm!
• Backpropagation is a systematic numerical method to calculate
the partial derivatives (i.e. partial derivative of the cost w.r.t. each
parameter)
Rumelhart, Hinton, Williams, “Learning Representations by Back-Propagating Errors”, 1986
Key Intuitions Required for
Backpropagation
1. Gradient Descent
• Change the weights 𝐖 in the direction of gradient to minimize the
error function.

2. Chain Rule
• Use the chain rule to calculate the weights of the intermediate
weights

3. Dynamic Programming (Memorization)


• Parameter gradients depend on the gradients of the earlier layers!
• So, when computing gradients at each layer, we can reuse
gradients computed for higher layers for lower layers (i.e.,
memorization).
The Computation Graph of A Neural
Network
• we can represent any neural network in terms of a computation
graph
• The loss function can be computed by moving from left to right
𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]

The corresponding computation graph

Cross entropy cost function:


Cost 𝒚, 𝑦 = −𝑦 ∗ 𝒍𝒐𝒈𝒚 − 1 − y log 1 − 𝒚
Backward Propagation

𝒃[𝟐]
𝒚 = 𝒂[𝟐]
𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]
a[2] → 𝓛
𝝏𝓛 𝝏 [𝟐] [ 𝟐]
[𝟐]
= [𝟐] −𝒚𝒍𝒐𝒈 𝒂 − 𝟏 − 𝒚 𝒍𝒐𝒈 𝟏 − 𝒂
𝝏𝒂 𝝏𝒂
−𝒚 (𝟏−𝒚)
= 𝒂[𝟐] + (𝟏−𝒂[𝟐])
Backward Propagation
𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]

𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)

𝑾[𝟏]

z[2] →a[2] → 𝓛

𝒂[𝟐] = 𝝈 𝒛 𝟐 𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐]
= × = 𝒂[𝟐] −𝒚
𝝏𝒛[𝟐] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐]
𝝏𝒂[𝟐] [𝟐] 𝟏 − 𝒂[𝟐]
= 𝒂
𝝏𝒛[𝟐]
Backward Propagation
𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]

𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)

𝑾[𝟏]

b[2] →z[2] →a[2] → 𝓛

𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝓛
𝝏𝒃[𝟐]
=
𝝏𝒂[𝟐]
×
𝝏𝒛[𝟐]
×
𝝏𝒃[𝟐]
=
𝝏𝒛[𝟐]

𝝏𝓛
1
𝝏𝒛[𝟐]
Backward Propagation

𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]

𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)

𝑾[𝟏]

w[2] →z[2] →a[2] → 𝓛

𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝓛 𝟏 𝑻

[𝟐]
= [𝟐]
× [𝟐] × [𝟐]
= [𝟐] × 𝒂
𝝏𝑾 𝝏𝒂 𝝏𝒛 𝝏𝑾 𝝏𝒛

Dimension of dz[2] is (n[2],1)


Dimension of a[1] is (n[1],1)
𝝏𝓛 Hence transpose of a[1]
𝝏𝒛[𝟐] Dimension of dW[2] is (n[2], n[1])
Backward Propagation

𝒃[𝟏] 𝑾[𝟐]

𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)

𝑾[𝟏] 𝒃[𝟐]

a[1] →z[2] →a[2] → 𝓛

𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝟐 𝑻 × 𝝏𝓛
= × × = 𝒘
𝝏𝒂[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟐]

𝝏𝓛
Dimension of dz[2] is (n[2],1)
𝝏𝒛[𝟐] Dimension of W[2] is (n[2], n[1])
Dimension of da[1] is (n[1],1)
But, the calculation of da[1] is not required.
Backward Propagation Dimension of dz[2] is (n[2],1)
Dimension of W[2] is (n[2], n[1])
Dimension of da[1] is (n[1],1)
𝒃[𝟐] Dimension of dz[1] is (n[1],1)
𝒃[𝟏] 𝑾[𝟐]

𝒙 𝒛[𝟏] = 𝒘[𝟏] 𝒙 + 𝒃[𝟏] 𝒂[𝟏] = 𝝈(𝒛[𝟏] ) 𝒛[𝟐] = 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐] 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)

𝑾[𝟏]

z[1] →a[1] →z[2] →a[2] → 𝓛


Element-wise product
𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝟐 𝑻 𝝏𝓛 [𝟏] (𝟏 − 𝒂[𝟏] )
= × × × = 𝒘 × ∗ 𝒂
𝝏𝒛[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝒛[𝟐]

𝒂[𝟏] = 𝝈 𝒛 𝟏

𝝏𝓛 𝝏𝒂[𝟏]
𝝏𝒂[𝟏] [𝟏]
= 𝒂[𝟏] 𝟏 − 𝒂[𝟏]
𝝏𝒛
Backward Propagation
𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]

b[1] → z[1] →a[1] →z[2] →a[2] → 𝓛

𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝓛


[𝟏]
= [𝟐]
× [𝟐] × [𝟏] × [𝟏] × [𝟏] = [𝟏]
𝝏𝒃 𝝏𝒂 𝝏𝒛 𝝏𝒂 𝝏𝒛 𝝏𝒃 𝝏𝒛

𝟏
Backward Propagation
𝒃[𝟐]

𝒃[𝟏] 𝑾[𝟐]
𝒛[𝟐]
𝒙 𝒛 [𝟏]
=𝒘 [𝟏]
𝒙+𝒃 [𝟏] 𝒂 [𝟏] [𝟏]
= 𝝈(𝒛 ) 𝒂[𝟐] = 𝝈(𝒛[𝟐] ) 𝓛(𝒂[𝟐] , 𝒚)
= 𝒘[𝟐] 𝒂[𝟏] + 𝒃[𝟐]
𝑾[𝟏]

W[1]→ z[1] →a[1] →z[2] →a[2] → 𝓛

𝝏𝓛 𝝏𝓛 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝓛 𝑻


= × × × × = 𝒙
𝝏𝒘[𝟏] 𝝏𝒂[𝟐] 𝝏𝒛[𝟐] 𝝏𝒂[𝟏] 𝝏𝒛[𝟏] 𝝏𝒘[𝟏] 𝝏𝒛[𝟏]

x
Backward Propagation: Summary
One training example For all training example
For cross entropy loss and sigmoid activation in the last layer

[𝟐]
𝝏𝓛 [𝟐]
𝝏𝓛 𝒅𝒁 = = 𝑨 −𝒀
𝒅𝒛 = [𝟐] = 𝒂[𝟐] −𝒚
[𝟐] 𝝏𝒁[𝟐]
𝝏𝒛 𝝏𝓛 𝟏
𝒅𝒃 = [𝟐] = ෍ 𝒅𝒁[𝟐]
[𝟐]
𝝏𝓛 𝝏𝒃 𝒎
𝒅𝒃 = [𝟐] = 𝒅𝒛[𝟐]
[𝟐]
𝝏𝓛 𝟏
𝝏𝒃 [𝟐]
𝒅𝑾 = = 𝒅𝒁 [𝟐] . 𝑨 𝟏 𝑻

[𝟐]
𝝏𝓛 [𝟐] 𝟏 𝑻 𝝏𝑾[𝟐] 𝒎
𝒅𝒘 = [𝟐]
= 𝒅𝒛 . 𝒂
𝝏𝑾 𝝏𝓛 𝑻
𝒅𝒁[𝟏] = = 𝑾 𝟐 𝒅𝒁[𝟐] ∗ 𝒈[𝟏]` 𝒁[𝟏]
𝝏𝒁[𝟏]
𝝏𝓛 𝑻
𝒅𝒛[𝟏] = = 𝒘 𝟐 𝒅𝒛[𝟐] ∗ 𝒈` 𝒛[𝟏]
𝝏𝒛[𝟏]
𝝏𝓛 𝟏
𝝏𝓛 𝒅𝒃[𝟏] = [𝟏] = ෍ 𝒅𝒁[𝟏]
𝒅𝒃 [𝟏]
= [𝟏] = 𝒅𝒛[𝟏] 𝝏𝒃 𝒎
𝝏𝒃
𝝏𝓛 𝟏
𝝏𝓛 𝒅𝑾[𝟏] = = 𝒅𝒁 [𝟏] . 𝑿𝑻
𝒅𝒘[𝟏] = = 𝒅𝒛[𝟏] . 𝒙𝑻 𝝏𝑾[𝟏] 𝒎
𝝏𝒘[𝟏]
Equations for layer l
One training example For all training example

Input da[l]
Output da[l-1], dw[l] , db[l]
𝒅𝒁[𝒍] = 𝒅𝑨[𝒍] ∗ 𝒈[𝒍]` 𝒁[𝒍]
𝒅𝒛[𝒍] = 𝒅𝒂[𝒍] ∗ 𝒈[𝒍]` 𝒛[𝒍]
𝟏
𝒅𝒃 = ෍ 𝒅𝒁[𝒍]
[𝒍]
𝒅𝒃[𝒍] = 𝒅𝒛[𝒍] 𝒎
𝒍 [𝒍]
𝟏 [𝒍] 𝒍−𝟏 𝑻

𝒅𝑾[𝒍] = 𝒅𝒛 . 𝒂 𝒍−𝟏 𝒅𝑾 = 𝒅𝒁 . 𝑨
𝒎
𝑻
𝒅𝒂[𝒍−𝟏] =𝑾 𝒍 𝒅𝒛[𝒍] [𝒍−𝟏] 𝒍 𝑻
𝒅𝑨 = 𝑾 𝒅𝒁[𝒍]
Scaling up for L layers and all training examples in NN
Forward Propagation
Algorithm Forward propagation:
Output
Input

Cache output

Backward propagation:

Cache input
Output

Input
Scaling up for L layers and all training examples in NN
Update the
Parameters
Scaling up for L layers and all training examples in NN
Example
Calculate all matrix dimensions and total number of parameters
Array broadcasting

Source:numpy.org
Array broadcasting

Source:numpy.org
Array broadcasting

Source:numpy.org
Exercise - MSE Loss

Consider the neural network with two inputs x1 and x2 and the initial weights are
w0 = 0.5, w1 = 0.8, w2 = 0.3. Draw the network, compute the output, mean
squared loss function and weight updation when the input is (1, 0), the learning rate
is 0.01 and target output is 1. Assume any other relevant information.

1 w0

w1
Σ
x1 σ ŷ
w2
x2
Solution
E X E RCISE - B C E

Consider the neural network with two inputs x1 and x2 and the initial weights are
w0 = 0.5, w1 = 0.8, w2 = 0.3. Draw the network, compute the output, binary cross
entropy loss function and weight updation when the input is (1, 0), the learning rate
is 0.01 and target output is 1. Assume any other relevant information.

1 w0

w1
Σ
x1 σ ŷ
w2
x2
Solution
Exercise (without vectorization)
W11[1] =0.2
W12[1] =0.4
[𝟏]
𝒘𝟏𝟏
[𝟏]
W13[1] = -0.5
[𝟏]
𝒛𝟏 𝒂𝟏 [𝟐] W21[1] = -0.3
𝒙𝟏 [𝟏]
𝒘𝟏𝟐
𝒘𝟏𝟏
[𝟏] W22[1] =0.1
𝒘𝟏𝟑
[𝟏]
[𝟐]
𝒛𝟏 [𝟐] W23[1] =0.2
𝒃𝟏 𝒂𝟏 ෝ
𝒚

𝒙𝟐 W11[2] =-0.3
[𝟏] [𝟐] [𝟐]
𝒃𝟏 W12[2] = -0.2
𝒘𝟐𝟏 𝒘𝟏𝟐
[𝟏]
𝒘𝟐𝟐 [𝟏]
𝒛𝟐 [𝟏] b1[1] = -0.4
𝒙𝟑 𝒂𝟐
[𝟏] b2[1] = -0.2
𝒘𝟐𝟑
b1[2] = 0.1
[𝟏]
𝒃𝟐

=0.9
For X= { 1, 0, 1} and y=1

Find the cross entropy loss and weight updates after 1st iteration
Continued……
For the layer 2

[𝟐]
𝝏𝓛 [𝟐] −𝒚
𝒅𝒛𝟏 = = 𝒂 𝟏  For Cross entropy cost function
𝝏𝒛𝟏[𝟐]
𝝏𝓛
𝒅𝒛𝟏 [𝟐] = = 𝒂 𝟏
𝟐 𝟏 − 𝒂𝟏 𝟐 (𝒚 − 𝒂𝟏 𝟐 )  For MSE cost function
𝝏𝒛𝟏 𝟐
Exercise – With vectorization

1 4 5
x W [1]
0 3 6

−1
prod = W [1]T x b[1] −2
1 1
-1 7
-2 y=1
z [1] = prod + b[1]
4 -3
x1 a[10] a[11] a[12] ŷ a[1] = σ(z[1]) −3 −2
5
W [2]

3 -2
6 prod = W [2]T a[1] b[2] 7
x2 [0] [1]
a2 a2

z [2] = prod + b[2]

a[2] = σ(z[2]) yˆ
Computation Graph for Forward Pass

1 4 5
x W [1]
0 3 6

4 −1
prod = W [1]T x b[1] −2
3

3
z [1] = prod + b[1]
1

0.95
a[1] = σ(z[1]) W [2] −3 −2
0.73

−4.31 prod = W [2]T a[1] b[2] 7

2.69 z [2] = prod + b[2]

0.94 a[2] = σ(z[2]) yˆ 1


Computation Graph for Cost Function
Computation Graph for Backward Pass
Demo

https://playground.tensorflow.org/
Practice problems
Example 1 Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Computational Graph for Back
Propagation
Example 2

59
60
61
Optional
Derivative of cost function w.r.t final layer
linear function
Derivative of cost function w.r.t final layer
linear function
(Derivative of sigmoid activation function)
Derivative of cost function w.r.t final layer
linear function
Thank You All !

BITS Pilani, Pilani Campus

You might also like