Lecture 7
Math Foundations Team
Introduction
▶ In last lecture, we discussed about differentiation of univariate
functions, partial differentiation, gradients and gradients of
vector valued functions.
▶ Now we will look into gradients of matrices and some useful
identities for computing gradients.
▶ Finally, we will discuss back propagation and automatic
differentiation.
Gradients of Matrices
The gradient of an m × n matrix A with respect to a p × q matrix
B, the resulting Jacobian would be an (m × n) × (p × q), i.e., a
four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl =
∂Bkl
Since, we can consider Rm×n as Rmn , we can shape our matrix
into vectors of length mn and pq respectively. The gradient using
mn vectors results in a Jacobian of size mn × pq
Gradients of Matrices
Gradients of Matrices
Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then
∂f
∈ Rm×(m×n)
∂A
By definition ∂f
1
∂A
∂f ∂fi
= ... , ∈ R1×(m×n)
∂A ∂f
∂A
m
∂A
Gradients of Matrices
Now, we have
n
X
fi = Aij xj , i = 1, · · · , m.
j=1
Therefore, by taking partial derivatives with respect to Aiq
∂fi
= xq .
∂Aiq
Hence, i th row becomes
∂fi
= x T ∈ R1×1×n
∂Ai,:
∂fi
= 0T ∈ R1×1×n , fork ̸= i
∂Ak,:
Hence, by stacking the partial derivatives, we get
T
0
..
.
∂fi T 1×m×n
=
∂Ak,:
x ∈R
..
.
0T
Gradients of Matrices with respect to Matrices
Let B ∈ Rm×n and f : Rm×n → Rn×n with
f (B) = B T B =: K ∈ Rn×n
Then, we have
∂K
∈ R(n×n)×(m×n) .
∂B
Moreover
∂Kpq
∈ R1×(m×n) , forp, q = 1, · · · n
∂B
where Kpq is the (p, q)th entry of K = f (B)
Gradients of Matrices with respect to Matrices
Let i th column of B be bi , then
m
X
Kpq = rpT rq = Blp Blq
l=1
Computing the partial derivative, we get
m
∂Kpq X ∂
= Blp Blq = ∂pqij
∂Bij ∂Bij
l=1
Gradients of Matrices with respect to Matrices
Clearly, we have
∂pqij = Biq if j = p, p ̸= q
∂pqij = Bip if j = q, p ̸= q
∂pqij = 2Biq if j = p, p = q
∂pqij = 0 otherwise
where p, q, j = 1, · · · , n i = 1, · · · , m
Useful Identities for Computing Gradients
∂ T ∂f (X ) T
▶ ∂X f (X ) = ( ∂X )
∂ ∂f (X )
▶ ∂X tr (f (X )) = tr ( ∂X )
▶ ∂ −1 ∂f (X ) )
∂X det(f (X )) = det(f (X ))tr (f (X ) ∂X
∂ −1 −1 ∂f (X ) −1
▶ ∂X f (X ) = −f (X ) ∂X f (X )
Useful Identities for Computing Gradients
∂aT X −1 b
▶ ∂X = −(X −1 )T ab T (X −1 )T
▶ ∂x T a T
∂x = a
▶ ∂aT x T
∂x = a
T
∂a Xb
▶ T
∂X = ab
▶ ∂x T B T T
∂x = x (B + B )
∂ T
▶ ∂s (x − As) W (x − As) = −2(x − As)T WA
for symmetric W .
Backpropagation and Automatic Differentiation
Consider the function
q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Taking dervatives
df 2x + 2xexp(x 2 )
= p − sin(x 2 + exp(x 2 ))(2x + 2xexp(x 2 ))
dx 2 x 2 + exp(x 2 )
1
= 2x( p − sin(x 2 + exp(x 2 )))(1 + exp(x 2 ))
2 x + exp(x 2 )
2
Motivation
▶ The implementation of the gradient could be significantly
more expensive than computing the function, which imposes
unnecessary overhead where we get such lengthy expressions.
▶ We need an efficient way to compute the gradient of an error
function with respect to the parameters of the model.
▶ For training deep neural network models, the backpropagation
algorithm is one such method.
Backpropagation and Automatic Differentiation
In neural networks with multiple layers
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of layer i − 1 and σ is an activation
function.
Backpropagation
To train these model, the gradient of the loss function L with
respect to all model parameters θj = {Aj , bj }, j = 1, · · · , K and
inputs of each layer needs to be computed. Consider,
f0 := x
fi := σi (Ai−1 fi−1 + bi−1 ), i = 1, · · · , K .
We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that
L(θ) = ||y − fK (θ, x)||2
is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }
Backpropagation
Using the chain rule, we get
∂L ∂L ∂fK
=
∂θK −1 ∂fK ∂θK −1
∂L ∂L fK ∂fK −1
=
∂θK −2 ∂fK fK −1 ∂θK −2
∂L ∂L fK ∂fK −1 ∂fK −2
=
∂θK −3 ∂fK fK −1 ∂fK −2 ∂θK −3
∂L ∂L fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK fK −1 ∂fi+1 ∂θi
Backpropagation
∂L
If the partial derivatives
∂θi+1 are computed, then the computation
∂L
can be reused to compute ∂θ i
.
Example
Consider the function
q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Let
√
a = x 2 , b = exp(a), c = a + b, d = c, e = cos(c) ⇒ f = d + e
Example
∂a
⇒ = 2x
∂x
∂b
= exp(a)
∂a
∂c ∂c
= 1=
∂a ∂b
∂d 1
= √
∂c 2 c
∂e
= −sin(c)
∂c
∂f ∂f
= 1=
∂d ∂e
Example
Thus, we have
∂f ∂f ∂d ∂f ∂e
= +
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
=
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= +
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
=
∂x ∂a ∂x
Example
Substituting the results, we get
∂f 1
= 1.( √ + 1).(−sin(c))
∂c 2 c
∂f ∂f
= .1
∂b ∂c
∂f ∂f ∂f
= exp(a) + .1
∂a ∂b ∂c
∂f ∂f
= 2x
∂x ∂a
Thus, the computation for calculating the derivative is of similar
complexity as the computation of the function itself.
Formalization of Automatic Differentiation
Let x1 , · · · , xd : input variables.
xd+1 , · · · , xD−1 : intermediate variables.
xD : output variable, then we have,
xi = gi (xPa(xi ) )
Note that gi s are elementary functions and are also called as
forward propagation function and xPa(xi ) is the set of parent nodes
of variable xi in the graph.
Formalization of Automatic Differentiation
Now,
∂f
f = xD ⇒ =1
∂D
For other variables, using chain rule, we get
∂f X ∂f ∂xj X ∂f ∂gi
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj :xi ∈Pa(xj ) xj :xi ∈Pa(xj )
The last equation is the back propagation of the gradient through
the computation graph. For neural network training, we back
propagate the error of the prediction with respect to the label.