0% found this document useful (0 votes)

64 views24 pages

Advanced Matrix Gradients & Backpropagation

The document discusses gradients of matrices and automatic differentiation. It explains how to compute the gradient of a matrix with respect to another matrix by treating them as vectors. Backpropagation is introduced as an efficient method to compute gradients through neural networks. The chain rule is used to backpropagate gradients layer by layer. An example demonstrates how to compute gradients using the chain rule and backpropagation approach.

Uploaded by

Prasana Venkatesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views24 pages

Advanced Matrix Gradients & Backpropagation

Uploaded by

Prasana Venkatesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Lecture 7

Math Foundations Team

Introduction

▶ In last lecture, we discussed about differentiation of univariate

functions, partial differentiation, gradients and gradients of
vector valued functions.
▶ Now we will look into gradients of matrices and some useful
identities for computing gradients.
▶ Finally, we will discuss back propagation and automatic
differentiation.
Gradients of Matrices

The gradient of an m × n matrix A with respect to a p × q matrix

B, the resulting Jacobian would be an (m × n) × (p × q), i.e., a
four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl =
∂Bkl

Since, we can consider Rm×n as Rmn , we can shape our matrix

into vectors of length mn and pq respectively. The gradient using
mn vectors results in a Jacobian of size mn × pq
Gradients of Matrices
Gradients of Matrices

Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then

∂f
∈ Rm×(m×n)
∂A
By definition  ∂f 
1
∂A
∂f   ∂fi
=  ...  , ∈ R1×(m×n)
∂A ∂f
∂A
m
∂A
Gradients of Matrices

Now, we have
n
X
fi = Aij xj , i = 1, · · · , m.
j=1

Therefore, by taking partial derivatives with respect to Aiq

∂fi
= xq .
∂Aiq

Hence, i th row becomes

∂fi
= x T ∈ R1×1×n
∂Ai,:
∂fi
= 0T ∈ R1×1×n , fork ̸= i
∂Ak,:
Hence, by stacking the partial derivatives, we get
 T
0
 .. 
 . 
∂fi  T 1×m×n
=
∂Ak,: 
x ∈R
 .. 
 . 
0T
Gradients of Matrices with respect to Matrices

Let B ∈ Rm×n and f : Rm×n → Rn×n with

f (B) = B T B =: K ∈ Rn×n

Then, we have
∂K
∈ R(n×n)×(m×n) .
∂B
Moreover
∂Kpq
∈ R1×(m×n) , forp, q = 1, · · · n
∂B
where Kpq is the (p, q)th entry of K = f (B)
Gradients of Matrices with respect to Matrices

Let i th column of B be bi , then

m
X
Kpq = rpT rq = Blp Blq
l=1

Computing the partial derivative, we get

m
∂Kpq X ∂
= Blp Blq = ∂pqij
∂Bij ∂Bij
l=1
Gradients of Matrices with respect to Matrices

Clearly, we have

∂pqij = Biq if j = p, p ̸= q
∂pqij = Bip if j = q, p ̸= q
∂pqij = 2Biq if j = p, p = q
∂pqij = 0 otherwise

where p, q, j = 1, · · · , n i = 1, · · · , m
Useful Identities for Computing Gradients

∂ T ∂f (X ) T
▶ ∂X f (X ) = ( ∂X )
∂ ∂f (X )
▶ ∂X tr (f (X )) = tr ( ∂X )
▶ ∂ −1 ∂f (X ) )
∂X det(f (X )) = det(f (X ))tr (f (X ) ∂X
∂ −1 −1 ∂f (X ) −1
▶ ∂X f (X ) = −f (X ) ∂X f (X )
Useful Identities for Computing Gradients

∂aT X −1 b
▶ ∂X = −(X −1 )T ab T (X −1 )T
▶ ∂x T a T
∂x = a
▶ ∂aT x T
∂x = a
T
∂a Xb
▶ T
∂X = ab
▶ ∂x T B T T
∂x = x (B + B )
∂ T
▶ ∂s (x − As) W (x − As) = −2(x − As)T WA
for symmetric W .
Backpropagation and Automatic Differentiation

Consider the function

q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))

Taking dervatives

df 2x + 2xexp(x 2 )
= p − sin(x 2 + exp(x 2 ))(2x + 2xexp(x 2 ))
dx 2 x 2 + exp(x 2 )
1
= 2x( p − sin(x 2 + exp(x 2 )))(1 + exp(x 2 ))
2 x + exp(x 2 )
2
Motivation

▶ The implementation of the gradient could be significantly

more expensive than computing the function, which imposes
unnecessary overhead where we get such lengthy expressions.
▶ We need an efficient way to compute the gradient of an error
function with respect to the parameters of the model.
▶ For training deep neural network models, the backpropagation
algorithm is one such method.
Backpropagation and Automatic Differentiation

In neural networks with multiple layers

fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )

where xi−1 is the output of layer i − 1 and σ is an activation

function.
Backpropagation

To train these model, the gradient of the loss function L with

respect to all model parameters θj = {Aj , bj }, j = 1, · · · , K and
inputs of each layer needs to be computed. Consider,

f0 := x
fi := σi (Ai−1 fi−1 + bi−1 ), i = 1, · · · , K .

We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that

L(θ) = ||y − fK (θ, x)||2

is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }

Backpropagation

Using the chain rule, we get

∂L ∂L ∂fK
=
∂θK −1 ∂fK ∂θK −1
∂L ∂L fK ∂fK −1
=
∂θK −2 ∂fK fK −1 ∂θK −2
∂L ∂L fK ∂fK −1 ∂fK −2
=
∂θK −3 ∂fK fK −1 ∂fK −2 ∂θK −3
∂L ∂L fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK fK −1 ∂fi+1 ∂θi
Backpropagation

∂L
If the partial derivatives
∂θi+1 are computed, then the computation
∂L
can be reused to compute ∂θ i
.
Example

Consider the function

q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))

Let
√
a = x 2 , b = exp(a), c = a + b, d = c, e = cos(c) ⇒ f = d + e
Example

∂a
⇒ = 2x
∂x
∂b
= exp(a)
∂a
∂c ∂c
= 1=
∂a ∂b
∂d 1
= √
∂c 2 c
∂e
= −sin(c)
∂c
∂f ∂f
= 1=
∂d ∂e
Example

Thus, we have
∂f ∂f ∂d ∂f ∂e
= +
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
=
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= +
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
=
∂x ∂a ∂x
Example

Substituting the results, we get

∂f 1
= 1.( √ + 1).(−sin(c))
∂c 2 c
∂f ∂f
= .1
∂b ∂c
∂f ∂f ∂f
= exp(a) + .1
∂a ∂b ∂c
∂f ∂f
= 2x
∂x ∂a

Thus, the computation for calculating the derivative is of similar

complexity as the computation of the function itself.
Formalization of Automatic Differentiation

Let x1 , · · · , xd : input variables.

xd+1 , · · · , xD−1 : intermediate variables.
xD : output variable, then we have,

xi = gi (xPa(xi ) )

Note that gi s are elementary functions and are also called as

forward propagation function and xPa(xi ) is the set of parent nodes
of variable xi in the graph.
Formalization of Automatic Differentiation

Now,
∂f
f = xD ⇒ =1
∂D
For other variables, using chain rule, we get
∂f X ∂f ∂xj X ∂f ∂gi
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj :xi ∈Pa(xj ) xj :xi ∈Pa(xj )

The last equation is the back propagation of the gradient through

the computation graph. For neural network training, we back
propagate the error of the prediction with respect to the label.

State Transition Matrix
No ratings yet
State Transition Matrix
33 pages
PMA307: Metric Spaces Solutions 5
No ratings yet
PMA307: Metric Spaces Solutions 5
2 pages
Chapter 7 Solutions
No ratings yet
Chapter 7 Solutions
92 pages
Quadratic Equations Guide
No ratings yet
Quadratic Equations Guide
8 pages
Logarithms and Exponentials
No ratings yet
Logarithms and Exponentials
47 pages
Integration Techniques Summary v3
No ratings yet
Integration Techniques Summary v3
2 pages
Inverse Trigonometric Functions Questions Mathongo
No ratings yet
Inverse Trigonometric Functions Questions Mathongo
3 pages
Numerical Analysis: Root Finding Methods
50% (2)
Numerical Analysis: Root Finding Methods
2 pages
Hotelling - Stability in Competition
No ratings yet
Hotelling - Stability in Competition
6 pages
Integral Calculus Guide
No ratings yet
Integral Calculus Guide
8 pages
DPP (English) - Relations and Functions
No ratings yet
DPP (English) - Relations and Functions
5 pages
Introduction To Algorithms Caltech CS 38, Spring 2004 Leonard J. Schulman Problem Set 8 Due Friday May 28 at 5pm in The Intel Lab
No ratings yet
Introduction To Algorithms Caltech CS 38, Spring 2004 Leonard J. Schulman Problem Set 8 Due Friday May 28 at 5pm in The Intel Lab
2 pages
The Fourier Transform of A Gaussian Function: Kalle Rutanen 25.3.2007, 25.5.2007
No ratings yet
The Fourier Transform of A Gaussian Function: Kalle Rutanen 25.3.2007, 25.5.2007
5 pages
Lecture Wk4 Ch4
No ratings yet
Lecture Wk4 Ch4
49 pages
Lecture 5 - Algorithmic Approaches To Layouting
No ratings yet
Lecture 5 - Algorithmic Approaches To Layouting
33 pages
Unctional Analysis: Georg-August University Göttingen Summer Semester 2019 Prof. Dr. Max Wardetzky
No ratings yet
Unctional Analysis: Georg-August University Göttingen Summer Semester 2019 Prof. Dr. Max Wardetzky
50 pages
1 Notes and HW On Even Odd and Neither
No ratings yet
1 Notes and HW On Even Odd and Neither
3 pages
Functions Vs Relation
No ratings yet
Functions Vs Relation
3 pages
Matrix Types and Operations Guide
No ratings yet
Matrix Types and Operations Guide
6 pages
Lab Manual AI Lab VI Sem
No ratings yet
Lab Manual AI Lab VI Sem
34 pages
Short Notes Class 12 Maths
No ratings yet
Short Notes Class 12 Maths
22 pages
Introduction To The Topology of Continuous Dynamical Systems Andries Smith 1. Continuous General Dynamical Systems
No ratings yet
Introduction To The Topology of Continuous Dynamical Systems Andries Smith 1. Continuous General Dynamical Systems
13 pages
Introduction To Matrices
No ratings yet
Introduction To Matrices
18 pages
Algebra 03 OK
No ratings yet
Algebra 03 OK
61 pages
Final Lesson No. 6 (Graphs of Sine and Cosine Functions)
100% (1)
Final Lesson No. 6 (Graphs of Sine and Cosine Functions)
39 pages
State Space Analysis of Digital Control System
No ratings yet
State Space Analysis of Digital Control System
30 pages
Transpose of Matrix and Determinant
No ratings yet
Transpose of Matrix and Determinant
14 pages
Fractional Fourier Transform PDF
No ratings yet
Fractional Fourier Transform PDF
22 pages
Numerical Methods for Engineers
No ratings yet
Numerical Methods for Engineers
19 pages
Fourier Series: Periodic Functions
No ratings yet
Fourier Series: Periodic Functions
31 pages

Advanced Matrix Gradients & Backpropagation

Uploaded by

Advanced Matrix Gradients & Backpropagation

Uploaded by

Lecture 7

Math Foundations Team

▶ In last lecture, we discussed about differentiation of univariate

The gradient of an m × n matrix A with respect to a p × q matrix

Since, we can consider Rm×n as Rmn , we can shape our matrix

Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then

Therefore, by taking partial derivatives with respect to Aiq

Hence, i th row becomes

Let B ∈ Rm×n and f : Rm×n → Rn×n with

Let i th column of B be bi , then

Computing the partial derivative, we get

Consider the function

▶ The implementation of the gradient could be significantly

In neural networks with multiple layers

fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )

where xi−1 is the output of layer i − 1 and σ is an activation

To train these model, the gradient of the loss function L with

We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that

L(θ) = ||y − fK (θ, x)||2

is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }

Using the chain rule, we get

Consider the function

Substituting the results, we get

Thus, the computation for calculating the derivative is of similar

Let x1 , · · · , xd : input variables.

Note that gi s are elementary functions and are also called as

The last equation is the back propagation of the gradient through

You might also like