0% found this document useful (0 votes)
42 views30 pages

Orthogonal Projections Explained

Math for Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views30 pages

Orthogonal Projections Explained

Math for Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

3.

8 Orthogonal Projections 85
x Figure 3.11
Projection onto a
two-dimensional
subspace U with
basis b1 , b2 . The
projection πU (x) of
x − πU (x) x ∈ R3 onto U can
be expressed as a
linear combination
U of b1 , b2 and the
b2
displacement vector
x − πU (x) is
πU (x) orthogonal to both
b1 and b2 .
0 b1

Example 3.10 (Projection onto a Line)


Find the projection matrix P π onto the line through the origin spanned
 ⊤
by b = 1 2 2 . b is a direction and a basis of the one-dimensional
subspace (line through origin).
With (3.46), we obtain
   
1 1 2 2
bb⊤ 1    1
Pπ = ⊤ = 2 1 2 2 = 2 4 4 . (3.47)
b b 9 2 9 2 4 4

Let us now choose a particular x and see whether it lies in the subspace
 ⊤
spanned by b. For x = 1 1 1 , the projection is
      
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10,
we know that a projection matrix P π satisfies P 2π x = P π x for all x.

Remark. With the results from Chapter 4, we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♢

3.8.2 Projection onto General Subspaces


If U is given by a set
In the following, we look at orthogonal projections of vectors x ∈ Rn of spanning vectors,
onto lower-dimensional subspaces U ⊆ Rn with dim(U ) = m ⩾ 1. An which are not a
basis, make sure
illustration is given in Figure 3.11.
you determine a
Assume that (b1 , . . . , bm ) is an ordered basis of U . Any projection πU (x) basis b1 , . . . , bm
onto U is necessarily an element of U . Therefore, they can be represented before proceeding.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


86 Analytic Geometry

as linear Pcombinations of the basis vectors b1 , . . . , bm of U , such that


m
The basis vectors πU (x) = i=1 λi bi .
form the columns of As in the 1D case, we follow a three-step procedure to find the projec-
B ∈ Rn×m , where
tion πU (x) and the projection matrix P π :
B = [b1 , . . . , bm ].
1. Find the coordinates λ1 , . . . , λm of the projection (with respect to the
basis of U ), such that the linear combination
m
X
πU (x) = λi bi = Bλ , (3.49)
i=1

B = [b1 , . . . , bm ] ∈ Rn×m , λ = [λ1 , . . . , λm ]⊤ ∈ Rm , (3.50)


is closest to x ∈ Rn . As in the 1D case, “closest” means “minimum
distance”, which implies that the vector connecting πU (x) ∈ U and
x ∈ Rn must be orthogonal to all basis vectors of U . Therefore, we
obtain m simultaneous conditions (assuming the dot product as the
inner product)
⟨b1 , x − πU (x)⟩ = b⊤1 (x − πU (x)) = 0 (3.51)
..
.
⟨bm , x − πU (x)⟩ = b⊤
m (x − πU (x)) = 0 (3.52)
which, with πU (x) = Bλ, can be written as
b⊤
1 (x − Bλ) = 0 (3.53)
..
.
b⊤
m (x − Bλ) = 0 (3.54)
such that we obtain a homogeneous linear equation system
 ⊤ 
b1 
 ..   ⊤
 .  x − Bλ = 0 ⇐⇒ B (x − Bλ) = 0 (3.55)
b⊤
m

⇐⇒ B ⊤ Bλ = B ⊤ x . (3.56)
normal equation The last expression is called normal equation. Since b1 , . . . , bm are a
basis of U and, therefore, linearly independent, B ⊤ B ∈ Rm×m is reg-
ular and can be inverted. This allows us to solve for the coefficients/
coordinates
λ = (B ⊤ B)−1 B ⊤ x . (3.57)

pseudo-inverse The matrix (B ⊤ B)−1 B ⊤ is also called the pseudo-inverse of B , which


can be computed for non-square matrices B . It only requires that B ⊤ B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” ϵI to

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 87

B ⊤ B to guarantee increased numerical stability and positive definite-


ness. This “ridge” can be rigorously derived using Bayesian inference.
See Chapter 9 for details.
2. Find the projection πU (x) ∈ U . We already established that πU (x) =
Bλ. Therefore, with (3.57)
πU (x) = B(B ⊤ B)−1 B ⊤ x . (3.58)

3. Find the projection matrix P π . From (3.58), we can immediately see


that the projection matrix that solves P π x = πU (x) must be

P π = B(B ⊤ B)−1 B ⊤ . (3.59)

Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B ⊤ B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B ⊤ B)−1 B ⊤ as

P π = BB
B⊤ B
, which is exactly the projection matrix in (3.46). ♢

Example 3.11 (Projectiononto  a Two-dimensional


 Subspace)
 
1 0 6
For a subspace U = span[1 , 1] ⊆ R3 and x = 0 ∈ R3 find the
1 2 0
coordinates λ of x in terms of the subspace U , the projection point πU (x)
and the projection matrix P π .
First, we see that the generating set of U is a basis (linear
 indepen-

1 0
dence) and write the basis vectors of U into a matrix B = 1 1.
1 2
Second, we compute the matrix B ⊤ B and the vector B ⊤ x as
   
  1 0     6  
1 1 1  3 3 1 1 1   6
B⊤B = 1 1 = , B⊤x = 0 = .
0 1 2 3 5 0 1 2 0
1 2 0
(3.60)

Third, we solve the normal equation B ⊤ Bλ = B ⊤ x to find λ:


      
3 3 λ1 6 5
= ⇐⇒ λ = . (3.61)
3 5 λ2 0 −3
Fourth, the projection πU (x) of x onto U , i.e., into the column space of
B , can be directly computed via
 
5
πU (x) = Bλ =  2  . (3.62)
−1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


88 Analytic Geometry

projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
is also called the  ⊤ √
reconstruction error. ∥x − πU (x)∥ = 1 −2 1 = 6. (3.63)

Fifth, the projection matrix (for any x ∈ R3 ) is given by


 
5 2 −1
1
P π = B(B ⊤ B)−1 B ⊤ =  2 2 2  . (3.64)
6 −1 2 5

To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , and (b) verify that
P π = P 2π (see Definition 3.10).

Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♢
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♢
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach to derive principal
component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
are satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB ⊤ x (3.65)

since B ⊤ B = I with coordinates


λ = B⊤x . (3.66)
This means that we no longer have to compute the inverse from (3.58),
which saves computation time. ♢

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 89

3.8.3 Gram-Schmidt Orthogonalization


Projections are at the core of the Gram-Schmidt method that allows us to
constructively transform any basis (b1 , . . . , bn ) of an n-dimensional vector
space V into an orthogonal/orthonormal basis (u1 , . . . , un ) of V . This
basis always exists (Liesen and Mehrmann, 2015) and span[b1 , . . . , bn ] =
span[u1 , . . . , un ]. The Gram-Schmidt orthogonalization method iteratively Gram-Schmidt
constructs an orthogonal basis (u1 , . . . , un ) from any basis (b1 , . . . , bn ) of orthogonalization
V as follows:
u1 := b1 (3.67)
uk := bk − πspan[u1 ,...,uk−1 ] (bk ) , k = 2, . . . , n . (3.68)
In (3.68), the k th basis vector bk is projected onto the subspace spanned
by the first k − 1 constructed orthogonal vectors u1 , . . . , uk−1 ; see Sec-
tion 3.8.2. This projection is then subtracted from bk and yields a vector
uk that is orthogonal to the (k − 1)-dimensional subspace spanned by
u1 , . . . , uk−1 . Repeating this procedure for all n basis vectors b1 , . . . , bn
yields an orthogonal basis (u1 , . . . , un ) of V . If we normalize the uk , we
obtain an ONB where ∥uk ∥ = 1 for k = 1, . . . , n.

Example 3.12 (Gram-Schmidt Orthogonalization)

Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) first constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
   
2 1
b1 = , b2 = ; (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
 
2
u1 := b1 = , (3.70)
0
u1 u⊤
      
(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
∥u1 ∥ 1 0 0 1 1
(3.71)

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


90 Analytic Geometry
Figure 3.13 x x
Projection onto an
affine space.
(a) original setting;
(b) setting shifted x − x0
L L
by −x0 so that πL(x)
x − x0 can be x0 x0
projected onto the
b2 b 2 U = L − x0 b2
direction space U ;
πU (x − x0)
(c) projection is
translated back to 0 b1 0 b1 0 b1
x0 + πU (x − x0 ),
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in
which gives the final
jection πU onto vector sub- to get affine projection πL .
orthogonal
space.
projection πL (x).

These steps are illustrated in Figures 3.12(b) and (c). We immediately see
that u1 and u2 are orthogonal, i.e., u⊤1 u2 = 0.

3.8.4 Projection onto Affine Subspaces


Thus far, we discussed how to project a vector onto a lower-dimensional
subspace U . In the following, we provide a solution to projecting a vector
onto an affine subspace.
Consider the setting in Figure 3.13(a). We are given an affine space L =
x0 + U , where b1 , b2 are basis vectors of U . To determine the orthogonal
projection πL (x) of x onto L, we transform the problem into a problem
that we know how to solve: the projection onto a vector subspace. In
order to get there, we subtract the support point x0 from x and from L,
so that L − x0 = U is exactly the vector subspace U . We can now use the
orthogonal projections onto a subspace we discussed in Section 3.8.2 and
obtain the projection πU (x − x0 ), which is illustrated in Figure 3.13(b).
This projection can now be translated back into L by adding x0 , such that
we obtain the orthogonal projection onto an affine space L as
πL (x) = x0 + πU (x − x0 ) , (3.72)
where πU (·) is the orthogonal projection onto the subspace U , i.e., the
direction space of L; see Figure 3.13(c).
From Figure 3.13, it is also evident that the distance of x from the affine
space L is identical to the distance of x − x0 from U , i.e.,
d(x, L) = ∥x − πL (x)∥ = ∥x − (x0 + πU (x − x0 ))∥ (3.73a)
= d(x − x0 , πU (x − x0 )) = d(x − x0 , U ) . (3.73b)
We will use projections onto an affine subspace to derive the concept of
a separating hyperplane in Section 12.1.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.9 Rotations 91

Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.

Figure 3.15 The


robotic arm needs to
rotate its joints in
order to pick up
objects or to place
them correctly.
Figure taken
from (Deisenroth
et al., 2015).

3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
 
−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


92 Analytic Geometry

Figure 3.16 Φ(e2 ) = [− sin θ, cos θ]⊤


Rotation of the cos θ
standard basis in R2
by an angle θ. e2

Φ(e1 ) = [cos θ, sin θ]⊤


sin θ
θ

θ
− sin θ e1 cos θ

3.9.1 Rotations in R2
    
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
   
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
 
  cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ

3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle, we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.9 Rotations 93

e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2

θ e1

Rotation about the e1 -axis


 
  1 0 0
R1 (θ) = Φ(e1 ) Φ(e2 ) Φ(e3 ) = 0 cos θ − sin θ . (3.77)
0 sin θ cos θ

Here, the e1 coordinate is fixed, and the counterclockwise rotation is


performed in the e2 e3 plane.
Rotation about the e2 -axis
 
cos θ 0 sin θ
R2 (θ) =  0 1 0 . (3.78)
− sin θ 0 cos θ

If we rotate the e1 e3 plane about the e2 axis, we need to look at the e2


axis from its “tip” toward the origin.
Rotation about the e3 -axis
 
cos θ − sin θ 0
R3 (θ) =  sin θ cos θ 0 . (3.79)
0 0 1

Figure 3.17 illustrates this.

3.9.3 Rotations in n Dimensions


The generalization of rotations from 2D and 3D to n-dimensional Eu-
clidean vector spaces can be intuitively described as fixing n − 2 dimen-
sions and restrict the rotation to a two-dimensional plane in the n-dimen-
sional space. As in the three-dimensional case, we can rotate any plane
(two-dimensional subspace of Rn ).

Definition 3.11 (Givens Rotation). Let V be an n-dimensional Euclidean


vector space and Φ : V → V an automorphism with transformation ma-

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


94 Analytic Geometry

trix
I i−1 ··· ···
 
0 0
 0
 cos θ 0 − sin θ 0   n×n
Rij (θ) := 
 0 0 I j−i−1 0 0  ∈R , (3.80)
 0 sin θ 0 cos θ 0 
0 ··· ··· 0 I n−j
Givens rotation for 1 ⩽ i < j ⩽ n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.

3.9.4 Properties of Rotations


Rotations exhibit a number of useful properties, which can be derived by
considering them as orthogonal matrices (Definition 3.8):
Rotations preserve distances, i.e., ∥x−y∥ = ∥Rθ (x)−Rθ (y)∥. In other
words, rotations leave the distance between any two points unchanged
after the transformation.
Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
the angle between x and y .
Rotations in three (or more) dimensions are generally not commuta-
tive. Therefore, the order in which rotations are applied is important,
even if they rotate about the same point. Only in two dimensions vector
rotations are commutative, such that R(ϕ)R(θ) = R(θ)R(ϕ) for all
ϕ, θ ∈ [0, 2π). They form an Abelian group (with multiplication) only if
they rotate about the same point (e.g., the origin).

3.10 Further Reading


In this chapter, we gave a brief overview of some of the important concepts
of analytic geometry, which we will use in later chapters of the book.
For a broader and more in-depth overview of some of the concepts we
presented, we refer to the following excellent books: Axler (2015) and
Boyd and Vandenberghe (2018).
Inner products allow us to determine specific bases of vector (sub)spaces,
where each vector is orthogonal to all others (orthogonal bases) using the
Gram-Schmidt method. These bases are important in optimization and
numerical algorithms for solving linear equation systems. For instance,
Krylov subspace methods, such as conjugate gradients or the generalized
minimal residual method (GMRES), minimize residual errors that are or-
thogonal to each other (Stoer and Burlirsch, 2002).
In machine learning, inner products are important in the context of

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.10 Further Reading 95

kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


96 Analytic Geometry

Exercises
3.1 Show that ⟨·, ·⟩ defined for all x = [x1 , x2 ]⊤ ∈ R2 and y = [y1 , y2 ]⊤ ∈ R2 by

⟨x, y⟩ := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )

is an inner product.
3.2 Consider R2 with ⟨·, ·⟩ defined for all x and y in R2 as
 
2 0
⟨x, y⟩ := x⊤ y.
1 2
| {z }
=:A

Is ⟨·, ·⟩ an inner product?


3.3 Compute the distance between
   
1 −1
x = 2 , y = −1
3 0

using
a. ⟨x, y⟩ := x⊤ y
 
2 1 0
b. ⟨x, y⟩ := x⊤ Ay , A := 1 3 −1
0 −1 2
3.4 Compute the angle between
   
1 −1
x= , y=
2 −1

using
a. ⟨x, y⟩ := x⊤ y
 
⊤ 2 1
b. ⟨x, y⟩ := x By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
         
0 1 −3 −1 −1
−1 −3  4  −3 −9
         
U = span[
 2 , −1 .
  1 ,  1  ,  5 ] , x= 
     
0 −1 2 0 4
2 2 1 7 1

a. Determine the orthogonal projection πU (x) of x onto U


b. Determine the distance d(x, U )
3.6 Consider R3 with the inner product
 
2 1 0
⟨x, y⟩ := x⊤ 1 2 −1 y .
0 −1 2

Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 97

a. Determine the orthogonal projection πU (e2 ) of e2 onto


U = span[e1 , e3 ] .

Hint: Orthogonality is defined through the inner product.


b. Compute the distance d(e2 , U ).
c. Draw the scenario: standard basis vectors and πU (e2 )
3.7 Let V be a vector space and π an endomorphism of V .
a. Prove that π is a projection if and only if idV − π is a projection, where
idV is the identity endomorphism on V .
b. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π)
as a function of Im(π) and ker(π).

3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
   
1 −1
b1 := 1 , b2 :=  2  .
1 0

3.9 Let n ∈ N and let x1 , . . . , xn > 0 be n positive real numbers so that x1 +


. . . + xn = 1. Use the Cauchy-Schwarz inequality and show that
Pn
a. x2i ⩾ n1
Pi=1
n 1 2
b. i=1 xi ⩾ n
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y ∈ Rn and apply the Cauchy-Schwarz inequality.
3.10 Rotate the vectors
   
2 0
x1 := , x2 :=
3 −1

by 30◦ .

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


4

Matrix Decompositions

In Chapters 2 and 3, we studied ways to manipulate and measure vectors,


projections of vectors, and linear mappings. Mappings and transforma-
tions of vectors can be conveniently described as operations performed by
matrices. Moreover, data is often represented in matrix form as well, e.g.,
where the rows of the matrix represent different people and the columns
describe different features of the people, such as weight, height, and socio-
economic status. In this chapter, we present three aspects of matrices: how
to summarize matrices, how matrices can be decomposed, and how these
decompositions can be used for matrix approximations.
We first consider methods that allow us to describe matrices with just
a few numbers that characterize the overall properties of matrices. We
will do this in the sections on determinants (Section 4.1) and eigenval-
ues (Section 4.2) for the important special case of square matrices. These
characteristic numbers have important mathematical consequences and
allow us to quickly grasp what useful properties a matrix has. From here
we will proceed to matrix decomposition methods: An analogy for ma-
trix decomposition is the factoring of numbers, such as the factoring of
21 into prime numbers 7 · 3. For this reason matrix decomposition is also
matrix factorization often referred to as matrix factorization. Matrix decompositions are used
to describe a matrix by means of a different representation using factors
of interpretable matrices.
We will first cover a square-root-like operation for symmetric, positive
definite matrices, the Cholesky decomposition (Section 4.3). From here
we will look at two related methods for factorizing matrices into canoni-
cal forms. The first one is known as matrix diagonalization (Section 4.4),
which allows us to represent the linear mapping using a diagonal trans-
formation matrix if we choose an appropriate basis. The second method,
singular value decomposition (Section 4.5), extends this factorization to
non-square matrices, and it is considered one of the fundamental concepts
in linear algebra. These decompositions are helpful, as matrices represent-
ing numerical data are often very large and hard to analyze. We conclude
the chapter with a systematic overview of the types of matrices and the
characteristic properties that distinguish them in the form of a matrix tax-
onomy (Section 4.7).
The methods that we cover in this chapter will become important in

98
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
4.1 Determinant and Trace 99

tests used in Figure 4.1 A mind


Determinant Invertibility Cholesky map of the concepts
introduced in this
chapter, along with
used in

used in
where they are used
in other parts of the
book.

Eigenvalues Chapter 6
Probability
& distributions
determines

used
in

constructs used in
Eigenvectors Orthogonal matrix Diagonalization

n
di
use
us

in
ed

ed

SVD
us
in

used in

Chapter 10
Dimensionality
reduction

both subsequent mathematical chapters, such as Chapter 6, but also in


applied chapters, such as dimensionality reduction in Chapters 10 or den-
sity estimation in Chapter 11. This chapter’s overall structure is depicted
in the mind map of Figure 4.1.

4.1 Determinant and Trace The determinant


Determinants are important concepts in linear algebra. A determinant is notation |A| must
not be confused
a mathematical object in the analysis and solution of systems of linear
with the absolute
equations. Determinants are only defined for square matrices A ∈ Rn×n , value.
i.e., matrices with the same number of rows and columns. In this book,
we write the determinant as det(A) or sometimes as |A| so that
a11 a12 . . . a1n
a21 a22 . . . a2n
det(A) = .. .. .. . (4.1)
. . .
an1 an2 . . . ann
The determinant of a square matrix A ∈ Rn×n is a function that maps A determinant

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


100 Matrix Decompositions

onto a real number. Before providing a definition of the determinant for


general n × n matrices, let us have a look at some motivating examples,
and define determinants for some special matrices.

Example 4.1 (Testing for Matrix Invertibility)


Let us begin with exploring if a square matrix A is invertible (see Sec-
tion 2.2.2). For the smallest cases, we already know when a matrix
is invertible. If A is a 1 × 1 matrix, i.e., it is a scalar number, then
A = a =⇒ A−1 = a1 . Thus a a1 = 1 holds, if and only if a ̸= 0.
For 2 × 2 matrices, by the definition of the inverse (Definition 2.3), we
know that AA−1 = I . Then, with (2.24), the inverse of A is
1
 
a22 −a12
A−1 = . (4.2)
a11 a22 − a12 a21 −a21 a11
Hence, A is invertible if and only if
a11 a22 − a12 a21 ̸= 0 . (4.3)
This quantity is the determinant of A ∈ R2×2 , i.e.,
a11 a12
det(A) = = a11 a22 − a12 a21 . (4.4)
a21 a22

Example 4.1 points already at the relationship between determinants


and the existence of inverse matrices. The next theorem states the same
result for n × n matrices.
Theorem 4.1. For any square matrix A ∈ Rn×n it holds that A is invertible
if and only if det(A) ̸= 0.
We have explicit (closed-form) expressions for determinants of small
matrices in terms of the elements of the matrix. For n = 1,
det(A) = det(a11 ) = a11 . (4.5)
For n = 2,
a11 a12
det(A) = = a11 a22 − a12 a21 , (4.6)
a21 a22
which we have observed in the preceding example.
For n = 3 (known as Sarrus’ rule),
a11 a12 a13
a21 a22 a23 = a11 a22 a33 + a21 a32 a13 + a31 a12 a23 (4.7)
a31 a32 a33
− a31 a22 a13 − a11 a32 a23 − a21 a12 a33 .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.1 Determinant and Trace 101

For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
We call a square matrix T an upper-triangular matrix if Tij = 0 for upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- lower-triangular
angular matrix T ∈ Rn×n , the determinant is the product of the diagonal matrix
elements, i.e.,
n
Y
det(T ) = Tii . (4.8)
i=1
The determinant is
the signed volume
of the parallelepiped
Example 4.2 (Determinants as Measures of Volume) formed by the
columns of the
The notion of a determinant is natural when we consider it as a mapping
matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de- Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped of the parallelogram
formed by columns of the matrix A. (shaded region)
For n = 2, the columns of the matrix form a parallelogram; see Fig- spanned by the
vectors b and g is
ure 4.2. As the angle between vectors gets smaller, the area of a parallel- |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is the
area of the parallelogram with vertices 0, b, g, b + g . In particular, if b, g b
are linearly dependent so that b = λg for some λ ∈ R, they no longer
g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples Figure 4.3 The
  of volume of the
b
the canonical basis vectors e1 , e2 then they can be written as b = and parallelepiped
0 (shaded volume)
 
0 b 0 spanned by vectors
g= , and the determinant is = bg − 0 = bg . r, b, g is
g 0 g
|det([r, b, g])|.
The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. This becomes the familiar formula: area = height ×
b
length. This intuition extends to higher dimensions. In R3 , we consider
r
three vectors r, b, g ∈ R3 spanning the edges of a parallelepiped, i.e., a g
solid with faces that are parallel parallelograms (see Figure 4.3). The ab- The sign of the
determinant
solute value of the determinant of the 3 × 3 matrix [r, b, g] is the volume
indicates the
of the solid. Thus, the determinant acts as a function that measures the orientation of the
signed volume formed by column vectors composed in a matrix. spanning vectors.
Consider the three linearly independent vectors r, g, b ∈ R3 given as
     
2 6 1
r =  0  , g = 1 , b =  4  . (4.9)
−8 0 −1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


102 Matrix Decompositions

Writing these vectors as the columns of a matrix


 
2 6 1
A = [r, g, b] =  0 1 4  (4.10)
−8 0 −1
allows us to compute the desired volume as
V = |det(A)| = 186 . (4.11)

Computing the determinant of an n × n matrix requires a general algo-


rithm to solve the cases for n > 3, which we are going to explore in the fol-
lowing. Theorem 4.2 below reduces the problem of computing the deter-
minant of an n×n matrix to computing the determinant of (n−1)×(n−1)
matrices. By recursively applying the Laplace expansion (Theorem 4.2),
we can therefore compute determinants of n × n matrices by ultimately
computing determinants of 2 × 2 matrices.
Laplace expansion
Theorem 4.2 (Laplace Expansion). Consider a matrix A ∈ Rn×n . Then,
for all j = 1, . . . , n:
det(Ak,j ) is called 1. Expansion along column j
a minor and n
(−1)k+j det(Ak,j )
X
a cofactor.
det(A) = (−1)k+j akj det(Ak,j ) . (4.12)
k=1

2. Expansion along row j


n
X
det(A) = (−1)k+j ajk det(Aj,k ) . (4.13)
k=1

Here Ak,j ∈ R(n−1)×(n−1) is the submatrix of A that we obtain when delet-


ing row k and column j .

Example 4.3 (Laplace Expansion)


Let us compute the determinant of
 
1 2 3
A = 3 1 2 (4.14)
0 0 1
using the Laplace expansion along the first row. Applying (4.13) yields
1 2 3
1 2
3 1 2 = (−1)1+1 · 1
0 1
0 0 1 (4.15)
3 2 3 1
+ (−1)1+2 · 2 + (−1)1+3 · 3 .
0 1 0 0

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.1 Determinant and Trace 103

We use (4.6) to compute the determinants of all 2 × 2 matrices and obtain


det(A) = 1(1 − 0) − 2(3 − 0) + 3(0 − 0) = −5 . (4.16)
For completeness we can compare this result to computing the determi-
nant using Sarrus’ rule (4.7):
det(A) = 1·1·1+3·0·3+0·2·2−0·1·3−1·0·2−3·2·1 = 1−6 = −5 . (4.17)

For A ∈ Rn×n the determinant exhibits the following properties:

The determinant of a matrix product is the product of the corresponding


determinants, det(AB) = det(A)det(B).
Determinants are invariant to transposition, i.e., det(A) = det(A⊤ ).
If A is regular (invertible), then det(A−1 ) = det(A)
1
.
Similar matrices (Definition 2.22) possess the same determinant. There-
fore, for a linear mapping Φ : V → V all transformation matrices AΦ
of Φ have the same determinant. Thus, the determinant is invariant to
the choice of basis of a linear mapping.
Adding a multiple of a column/row to another one does not change
det(A).
Multiplication of a column/row with λ ∈ R scales det(A) by λ. In
particular, det(λA) = λn det(A).
Swapping two rows/columns changes the sign of det(A).
Because of the last three properties, we can use Gaussian elimination (see
Section 2.1) to compute det(A) by bringing A into row-echelon form.
We can stop Gaussian elimination when we have A in a triangular form
where the elements below the diagonal are all 0. Recall from (4.8) that the
determinant of a triangular matrix is the product of the diagonal elements.
Theorem 4.3. A square matrix A ∈ Rn×n has det(A) ̸= 0 if and only if
rk(A) = n. In other words, A is invertible if and only if it is full rank.
When mathematics was mainly performed by hand, the determinant
calculation was considered an essential way to analyze matrix invertibil-
ity. However, contemporary approaches in machine learning use direct
numerical methods that superseded the explicit calculation of the deter-
minant. For example, in Chapter 2, we learned that inverse matrices can
be computed by Gaussian elimination. Gaussian elimination can thus be
used to compute the determinant of a matrix.
Determinants will play an important theoretical role for the following
sections, especially when we learn about eigenvalues and eigenvectors
(Section 4.2) through the characteristic polynomial.
Definition 4.4. The trace of a square matrix A ∈ Rn×n is defined as trace

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


104 Matrix Decompositions
n
X
tr(A) := aii , (4.18)
i=1

i.e. , the trace is the sum of the diagonal elements of A.


The trace satisfies the following properties:
tr(A + B) = tr(A) + tr(B) for A, B ∈ Rn×n
tr(αA) = αtr(A) , α ∈ R for A ∈ Rn×n
tr(I n ) = n
tr(AB) = tr(BA) for A ∈ Rn×k , B ∈ Rk×n
It can be shown that only one function satisfies these four properties to-
gether – the trace (Gohberg et al., 2012).
The properties of the trace of matrix products are more general. Specif-
The trace is ically, the trace is invariant under cyclic permutations, i.e.,
invariant under
cyclic permutations. tr(AKL) = tr(KLA) (4.19)
for matrices A ∈ Ra×k , K ∈ Rk×l , L ∈ Rl×a . This property generalizes to
products of an arbitrary number of matrices. As a special case of (4.19), it
follows that for two vectors x, y ∈ Rn
tr(xy ⊤ ) = tr(y ⊤ x) = y ⊤ x ∈ R . (4.20)
Given a linear mapping Φ : V → V , where V is a vector space, we
define the trace of this map by using the trace of matrix representation
of Φ. For a given basis of V , we can describe Φ by means of the transfor-
mation matrix A. Then the trace of Φ is the trace of A. For a different
basis of V , it holds that the corresponding transformation matrix B of Φ
can be obtained by a basis change of the form S −1 AS for suitable S (see
Section 2.7.2). For the corresponding trace of Φ, this means
(4.19)
tr(B) = tr(S −1 AS) = tr(ASS −1 ) = tr(A) . (4.21)
Hence, while matrix representations of linear mappings are basis depen-
dent the trace of a linear mapping Φ is independent of the basis.
In this section, we covered determinants and traces as functions char-
acterizing a square matrix. Taking together our understanding of determi-
nants and traces we can now define an important equation describing a
matrix A in terms of a polynomial, which we will use extensively in the
following sections.
Definition 4.5 (Characteristic Polynomial). For λ ∈ R and a square ma-
trix A ∈ Rn×n
pA (λ) := det(A − λI) (4.22a)
= c0 + c1 λ + c2 λ2 + · · · + cn−1 λn−1 + (−1)n λn , (4.22b)
characteristic c0 , . . . , cn−1 ∈ R, is the characteristic polynomial of A. In particular,
polynomial

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.2 Eigenvalues and Eigenvectors 105

c0 = det(A) , (4.23)
cn−1 = (−1)n−1 tr(A) . (4.24)
The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.

4.2 Eigenvalues and Eigenvectors


We will now get to know a new way to characterize a matrix and its associ-
ated linear mapping. Recall from Section 2.7.1 that every linear mapping
has a unique transformation matrix given an ordered basis. We can in-
terpret linear mappings and their associated transformation matrices by
performing an “eigen” analysis. As we will see, the eigenvalues of a lin- Eigen is a German
ear mapping will tell us how a special set of vectors, the eigenvectors, is word meaning
“characteristic”,
transformed by the linear mapping.
“self”, or “own”.
Definition 4.6. Let A ∈ Rn×n be a square matrix. Then λ ∈ R is an
eigenvalue of A and x ∈ Rn \{0} is the corresponding eigenvector of A if eigenvalue
eigenvector
Ax = λx . (4.25)
We call (4.25) the eigenvalue equation. eigenvalue equation

Remark. In the linear algebra literature and software, it is often a conven-


tion that eigenvalues are sorted in descending order, so that the largest
eigenvalue and associated eigenvector are called the first eigenvalue and
its associated eigenvector, and the second largest called the second eigen-
value and its associated eigenvector, and so on. However, textbooks and
publications may have different or no notion of orderings. We do not want
to presume an ordering in this book if not stated explicitly. ♢
The following statements are equivalent:
λ is an eigenvalue of A ∈ Rn×n .
There exists an x ∈ Rn \{0} with Ax = λx, or equivalently, (A −
λI n )x = 0 can be solved non-trivially, i.e., x ̸= 0.
rk(A − λI n ) < n.
det(A − λI n ) = 0.
Definition 4.7 (Collinearity and Codirection). Two vectors that point in
the same direction are called codirected. Two vectors are collinear if they codirected
point in the same or the opposite direction. collinear

Remark (Non-uniqueness of eigenvectors). If x is an eigenvector of A


associated with eigenvalue λ, then for any c ∈ R\{0} it holds that cx is
an eigenvector of A with the same eigenvalue since
A(cx) = cAx = cλx = λ(cx) . (4.26)
Thus, all vectors that are collinear to x are also eigenvectors of A.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


106 Matrix Decompositions

Theorem 4.8. λ ∈ R is an eigenvalue of A ∈ Rn×n if and only if λ is a


root of the characteristic polynomial pA (λ) of A.

algebraic Definition 4.9. Let a square matrix A have an eigenvalue λi . The algebraic
multiplicity multiplicity of λi is the number of times the root appears in the character-
istic polynomial.

Definition 4.10 (Eigenspace and Eigenspectrum). For A ∈ Rn×n , the set


of all eigenvectors of A associated with an eigenvalue λ spans a subspace
eigenspace of Rn , which is called the eigenspace of A with respect to λ and is denoted
eigenspectrum by Eλ . The set of all eigenvalues of A is called the eigenspectrum, or just
spectrum spectrum, of A.

If λ is an eigenvalue of A ∈ Rn×n , then the corresponding eigenspace


Eλ is the solution space of the homogeneous system of linear equations
(A−λI)x = 0. Geometrically, the eigenvector corresponding to a nonzero
eigenvalue points in a direction that is stretched by the linear mapping.
The eigenvalue is the factor by which it is stretched. If the eigenvalue is
negative, the direction of the stretching is flipped.

Example 4.4 (The Case of the Identity Matrix)


The identity matrix I ∈ Rn×n has characteristic polynomial pI (λ) =
det(I − λI) = (1 − λ)n = 0, which has only one eigenvalue λ = 1 that oc-
curs n times. Moreover, Ix = λx = 1x holds for all vectors x ∈ Rn \{0}.
Because of this, the sole eigenspace E1 of the identity matrix spans n di-
mensions, and all n standard basis vectors of Rn are eigenvectors of I .

Useful properties regarding eigenvalues and eigenvectors include the


following:

A matrix A and its transpose A⊤ possess the same eigenvalues, but not
necessarily the same eigenvectors.
The eigenspace Eλ is the null space of A − λI since

Ax = λx ⇐⇒ Ax − λx = 0 (4.27a)
⇐⇒ (A − λI)x = 0 ⇐⇒ x ∈ ker(A − λI). (4.27b)

Similar matrices (see Definition 2.22) possess the same eigenvalues.


Therefore, a linear mapping Φ has eigenvalues that are independent of
the choice of basis of its transformation matrix. This makes eigenvalues,
together with the determinant and the trace, key characteristic param-
eters of a linear mapping as they are all invariant under basis change.
Symmetric, positive definite matrices always have positive, real eigen-
values.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.2 Eigenvalues and Eigenvectors 107

Example 4.5 (Computing Eigenvalues, Eigenvectors, and


Eigenspaces)
Let us find the eigenvalues and eigenvectors of the 2 × 2 matrix
 
4 2
A= . (4.28)
1 3
Step 1: Characteristic Polynomial. From our definition of the eigen-
vector x ̸= 0 and eigenvalue λ of A, there will be a vector such that
Ax = λx, i.e., (A − λI)x = 0. Since x ̸= 0, this requires that the kernel
(null space) of A − λI contains more elements than just 0. This means
that A − λI is not invertible and therefore det(A − λI) = 0. Hence, we
need to compute the roots of the characteristic polynomial (4.22a) to find
the eigenvalues.
Step 2: Eigenvalues. The characteristic polynomial is

pA (λ) = det(A − λI) (4.29a)


   
4 2 λ 0 4−λ 2
= det − = (4.29b)
1 3 0 λ 1 3−λ
= (4 − λ)(3 − λ) − 2 · 1 . (4.29c)

We factorize the characteristic polynomial and obtain

p(λ) = (4 − λ)(3 − λ) − 2 · 1 = 10 − 7λ + λ2 = (2 − λ)(5 − λ) (4.30)


giving the roots λ1 = 2 and λ2 = 5.
Step 3: Eigenvectors and Eigenspaces. We find the eigenvectors that
correspond to these eigenvalues by looking at vectors x such that
 
4−λ 2
x = 0. (4.31)
1 3−λ
For λ = 5 we obtain
     
4−5 2 x1 −1 2 x1
= = 0. (4.32)
1 3 − 5 x2 1 −2 x2
We solve this homogeneous system and obtain a solution space
 
2
E5 = span[ ]. (4.33)
1
This eigenspace is one-dimensional as it possesses a single basis vector.
Analogously, we find the eigenvector for λ = 2 by solving the homoge-
neous system of equations
   
4−2 2 2 2
x= x = 0. (4.34)
1 3−2 1 1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


108 Matrix Decompositions

   
x1 1
This means any vector x = , where x2 = −x1 , such as , is an
x2 −1
eigenvector with eigenvalue 2. The corresponding eigenspace is given as
 
1
E2 = span[ ]. (4.35)
−1

The two eigenspaces E5 and E2 in Example 4.5 are one-dimensional


as they are each spanned by a single vector. However, in other cases
we may have multiple identical eigenvalues (see Definition 4.9) and the
eigenspace may have more than one dimension.
Definition 4.11. Let λi be an eigenvalue of a square matrix A. Then the
geometric geometric multiplicity of λi is the number of linearly independent eigen-
multiplicity vectors associated with λi . In other words, it is the dimensionality of the
eigenspace spanned by the eigenvectors associated with λi .
Remark. A specific eigenvalue’s geometric multiplicity must be at least
one because every eigenvalue has at least one associated eigenvector. An
eigenvalue’s geometric multiplicity cannot exceed its algebraic multiplic-
ity, but it may be lower. ♢

Example 4.6  
2 1
The matrix A = has two repeated eigenvalues λ1 = λ2 = 2 and an
0 2
algebraic multiplicity of 2. The eigenvalue has, however, only one distinct
1
unit eigenvector x1 = and, thus, geometric multiplicity 1.
0

Graphical Intuition in Two Dimensions


Let us gain some intuition for determinants, eigenvectors, and eigenval-
ues using different linear mappings. Figure 4.4 depicts five transformation
matrices A1 , . . . , A5 and their impact on a square grid of points, centered
In geometry, the at the origin:
area-preserving 1 
properties of this 0
A1 = 2 . The direction of the two eigenvectors correspond to the
type of shearing 0 2
parallel to an axis is canonical basis vectors in R2 , i.e., to two cardinal axes. The vertical axis
also known as
Cavalieri’s principle
is extended by a factor of 2 (eigenvalue λ1 = 2), and the horizontal axis
of equal areas for is compressed by factor 12 (eigenvalue λ2 = 12 ). The mapping is area
parallelograms preserving (det(A1 ) = 1 = 2 · 21 ).
(Katz, 2004). 1 12
 
A2 = corresponds to a shearing mapping , i.e., it shears the
0 1
points along the horizontal axis to the right if they are on the positive

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.2 Eigenvalues and Eigenvectors 109

Figure 4.4
Determinants and
eigenspaces.
Overview of five
λ1 = 2.0 linear mappings and
λ2 = 0.5 their associated
det(A) = 1.0
transformation
matrices
Ai ∈ R2×2
projecting 400
color-coded points
x ∈ R2 (left
λ1 = 1.0
λ2 = 1.0 column) onto target
det(A) = 1.0 points Ai x (right
column). The
central column
depicts the first
eigenvector,
stretched by its
λ1 = (0.87-0.5j) associated
λ2 = (0.87+0.5j) eigenvalue λ1 , and
det(A) = 1.0
the second
eigenvector
stretched by its
eigenvalue λ2 . Each
row depicts the
effect of one of five
λ1 = 0.0
λ2 = 2.0 transformation
det(A) = 0.0 matrices Ai with
respect to the
standard basis.

λ1 = 0.5
λ2 = 1.5
det(A) = 0.75

half of the vertical axis, and to the left vice versa. This mapping is area
preserving (det(A2 ) = 1). The eigenvalue λ1 = 1 = λ2 is repeated
and the eigenvectors are collinear (drawn here for emphasis in two
opposite directions). This indicates that the mapping acts only along
one direction (the horizontal
 axis).
√
cos( π6 ) − sin( π6 )
 
1 3 √ −1
A3 = = 2 The matrix A3 rotates the
sin( π6 ) cos( π6 ) 1 3
points by π6 rad = 30◦ counter-clockwise and has only complex eigen-
values, reflecting that the mapping is a rotation (hence, no eigenvectors
are drawn). A rotation has to be volume preserving, and so the deter-
minantis 1. For  more details on rotations, we refer to Section 3.9.
1 −1
A4 = represents a mapping in the standard basis that col-
−1 1
lapses a two-dimensional domain onto one dimension. Since one eigen-

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


110 Matrix Decompositions

value is 0, the space in direction of the (blue) eigenvector corresponding


to λ1 = 0 collapses, while the orthogonal (red) eigenvector stretches
space by
 a factor λ2 = 2. Therefore, the area of the image is 0.
1 12

A5 = 1 is a shear-and-stretch mapping that scales space by 75%
2
1
since | det(A5 )| = 43 . It stretches space along the (red) eigenvector
of λ2 by a factor 1.5 and compresses it along the orthogonal (blue)
eigenvector by a factor 0.5.

Example 4.7 (Eigenspectrum of a Biological Neural Network)

Figure 4.5
0
Caenorhabditis 25
elegans neural 50 20
network (Kaiser and
15
Hilgetag, 100
neuron index

eigenvalue

2006).(a) Sym- 10
metrized 150 5
connectivity matrix;
0
(b) Eigenspectrum. 200
−5

250 −10

0 50 100 150 200 250 0 100 200


neuron index index of sorted eigenvalue

(a) Connectivity matrix. (b) Eigenspectrum.

Methods to analyze and learn from network data are an essential com-
ponent of machine learning methods. The key to understanding networks
is the connectivity between network nodes, especially if two nodes are
connected to each other or not. In data science applications, it is often
useful to study the matrix that captures this connectivity data.
We build a connectivity/adjacency matrix A ∈ R277×277 of the complete
neural network of the worm C.Elegans. Each row/column represents one
of the 277 neurons of this worm’s brain. The connectivity matrix A has
a value of aij = 1 if neuron i talks to neuron j through a synapse, and
aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
plies that eigenvalues may not be real valued. Therefore, we compute a
symmetrized version of the connectivity matrix as Asym := A + A⊤ . This
new matrix Asym is shown in Figure 4.5(a) and has a nonzero value aij if
and only if two neurons are connected (white pixels), irrespective of the
direction of the connection. In Figure 4.5(b), we show the correspond-
ing eigenspectrum of Asym . The horizontal axis shows the index of the
eigenvalues, sorted in descending order. The vertical axis shows the corre-
sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
many biological neural networks. The underlying mechanism responsible
for this is an area of active neuroscience research.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.2 Eigenvalues and Eigenvectors 111

Theorem 4.12. The eigenvectors x1 , . . . , xn of a matrix A ∈ Rn×n with n


distinct eigenvalues λ1 , . . . , λn are linearly independent.
This theorem states that eigenvectors of a matrix with n distinct eigen-
values form a basis of Rn .
Definition 4.13. A square matrix A ∈ Rn×n is defective if it possesses defective
fewer than n linearly independent eigenvectors.
A non-defective matrix A ∈ Rn×n does not necessarily require n dis-
tinct eigenvalues, but it does require that the eigenvectors form a basis of
Rn . Looking at the eigenspaces of a defective matrix, it follows that the
sum of the dimensions of the eigenspaces is less than n. Specifically, a de-
fective matrix has at least one eigenvalue λi with an algebraic multiplicity
m > 1 and a geometric multiplicity of less than m.
Remark. A defective matrix cannot have n distinct eigenvalues, as distinct
eigenvalues have linearly independent eigenvectors (Theorem 4.12). ♢
Theorem 4.14. Given a matrix A ∈ Rm×n , we can always obtain a sym-
metric, positive semidefinite matrix S ∈ Rn×n by defining
S := A⊤ A . (4.36)
Remark. If rk(A) = n, then S := A⊤ A is symmetric, positive definite.

Understanding why Theorem 4.14 holds is insightful for how we can
use symmetrized matrices: Symmetry requires S = S ⊤ , and by insert-
ing (4.36) we obtain S = A⊤ A = A⊤ (A⊤ )⊤ = (A⊤ A)⊤ = S ⊤ . More-
over, positive semidefiniteness (Section 3.2.3) requires that x⊤ Sx ⩾ 0
and inserting (4.36) we obtain x⊤ Sx = x⊤ A⊤ Ax = (x⊤ A⊤ )(Ax) =
(Ax)⊤ (Ax) ⩾ 0, because the dot product computes a sum of squares
(which are themselves non-negative).
spectral theorem
Theorem 4.15 (Spectral Theorem). If A ∈ Rn×n is symmetric, there ex-
ists an orthonormal basis of the corresponding vector space V consisting of
eigenvectors of A, and each eigenvalue is real.
A direct implication of the spectral theorem is that the eigendecompo-
sition of a symmetric matrix A exists (with real eigenvalues), and that
we can find an ONB of eigenvectors so that A = P DP ⊤ , where D is
diagonal and the columns of P contain the eigenvectors.

Example 4.8
Consider the matrix
 
3 2 2
A = 2 3 2 . (4.37)
2 2 3

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


112 Matrix Decompositions

The characteristic polynomial of A is


pA (λ) = −(λ − 1)2 (λ − 7) , (4.38)
so that we obtain the eigenvalues λ1 = 1 and λ2 = 7, where λ1 is a
repeated eigenvalue. Following our standard procedure for computing
eigenvectors, we obtain the eigenspaces
     
−1 −1 1
E1 = span[ 1 ,  0 ], E7 = span[1] . (4.39)
0 1 1
| {z } | {z } |{z}
=:x1 =:x2 =:x3

We see that x3 is orthogonal to both x1 and x2 . However, since x⊤ 1 x2 =


1 ̸= 0, they are not orthogonal. The spectral theorem (Theorem 4.15)
states that there exists an orthogonal basis, but the one we have is not
orthogonal. However, we can construct one.
To construct such a basis, we exploit the fact that x1 , x2 are eigenvec-
tors associated with the same eigenvalue λ. Therefore, for any α, β ∈ R it
holds that
A(αx1 + βx2 ) = Ax1 α + Ax2 β = λ(αx1 + βx2 ) , (4.40)
i.e., any linear combination of x1 and x2 is also an eigenvector of A as-
sociated with λ. The Gram-Schmidt algorithm (Section 3.8.3) is a method
for iteratively constructing an orthogonal/orthonormal basis from a set of
basis vectors using such linear combinations. Therefore, even if x1 and x2
are not orthogonal, we can apply the Gram-Schmidt algorithm and find
eigenvectors associated with λ1 = 1 that are orthogonal to each other
(and to x3 ). In our example, we will obtain
   
−1 −1
1
x′1 =  1  , x′2 = −1 , (4.41)
0 2 2

which are orthogonal to each other, orthogonal to x3 , and eigenvectors of


A associated with λ1 = 1.

Before we conclude our considerations of eigenvalues and eigenvectors


it is useful to tie these matrix characteristics together with the concepts of
the determinant and the trace.
Theorem 4.16. The determinant of a matrix A ∈ Rn×n is the product of
its eigenvalues, i.e.,
n
Y
det(A) = λi , (4.42)
i=1

where λi ∈ C are (possibly repeated) eigenvalues of A.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


4.2 Eigenvalues and Eigenvectors 113

Figure 4.6
Geometric
x2 A interpretation of
eigenvalues. The
v2 eigenvectors of A
x1 v1 get stretched by the
corresponding
eigenvalues. The
area of the unit
Theorem 4.17. The trace of a matrix A ∈ Rn×n is the sum of its eigenval-
square changes by
ues, i.e., |λ1 λ2 |, the
Xn perimeter changes
tr(A) = λi , (4.43) by a factor of
1
i=1 2
(|λ1 | + |λ2 |).

where λi ∈ C are (possibly repeated) eigenvalues of A.

Let us provide a geometric intuition of these two theorems. Consider


a matrix A ∈ R2×2 that possesses two linearly independent eigenvectors
x1 , x2 . For this example, we assume (x1 , x2 ) are an ONB of R2 so that they
are orthogonal and the area of the square they span is 1; see Figure 4.6.
From Section 4.1, we know that the determinant computes the change of
area of unit square under the transformation A. In this example, we can
compute the change of area explicitly: Mapping the eigenvectors using
A gives us vectors v 1 = Ax1 = λ1 x1 and v 2 = Ax2 = λ2 x2 , i.e., the
new vectors v i are scaled versions of the eigenvectors xi , and the scaling
factors are the corresponding eigenvalues λi . v 1 , v 2 are still orthogonal,
and the area of the rectangle they span is |λ1 λ2 |.
Given that x1 , x2 (in our example) are orthonormal, we can directly
compute the perimeter of the unit square as 2(1 + 1). Mapping the eigen-
vectors using A creates a rectangle whose perimeter is 2(|λ1 | + |λ2 |).
Therefore, the sum of the absolute values of the eigenvalues tells us how
the perimeter of the unit square changes under the transformation matrix
A.

Example 4.9 (Google’s PageRank – Webpages as Eigenvectors)


Google uses the eigenvector corresponding to the maximal eigenvalue of
a matrix A to determine the rank of a page for search. The idea for the
PageRank algorithm, developed at Stanford University by Larry Page and
Sergey Brin in 1996, was that the importance of any web page can be ap-
proximated by the importance of pages that link to it. For this, they write
down all web sites as a huge directed graph that shows which page links
to which. PageRank computes the weight (importance) xi ⩾ 0 of a web
site ai by counting the number of pages pointing to ai . Moreover, PageR-
ank takes into account the importance of the web sites that link to ai . The
navigation behavior of a user is then modeled by a transition matrix A of
this graph that tells us with what (click) probability somebody will end up

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


114 Matrix Decompositions

on a different web site. The matrix A has the property that for any ini-
tial rank/importance vector x of a web site the sequence x, Ax, A2 x, . . .
PageRank converges to a vector x∗ . This vector is called the PageRank and satisfies
Ax∗ = x∗ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
A. After normalizing x∗ , such that ∥x∗ ∥ = 1, we can interpret the entries
as probabilities. More details and different perspectives on PageRank can
be found in the original technical report (Page et al., 1999).

4.3 Cholesky Decomposition


There are many ways to factorize special types of matrices that we en-
counter often in machine learning. In the positive real numbers, we have
the square-root operation that gives us a decomposition of the number
into identical components, e.g., 9 = 3 · 3. For matrices, we need to be
careful that we compute a square-root-like operation on positive quanti-
ties. For symmetric, positive definite matrices (see Section 3.2.3), we can
Cholesky choose from a number of square-root equivalent operations. The Cholesky
decomposition decomposition/Cholesky factorization provides a square-root equivalent op-
Cholesky eration on symmetric, positive definite matrices that is useful in practice.
factorization
Theorem 4.18 (Cholesky Decomposition). A symmetric, positive definite
matrix A can be factorized into a product A = LL⊤ , where L is a lower-
triangular matrix with positive diagonal elements:
    
a11 · · · a1n l11 · · · 0 l11 · · · ln1
 .. .. ..  =  .. . . ..   .. . . ..  .
 . . .   . . .  . . .  (4.44)
an1 · · · ann ln1 · · · lnn 0 · · · lnn

Cholesky factor L is called the Cholesky factor of A, and L is unique.

Example 4.10 (Cholesky Factorization)


Consider a symmetric, positive definite matrix A ∈ R3×3 . We are inter-
ested in finding its Cholesky factorization A = LL⊤ , i.e.,
    
a11 a21 a31 l11 0 0 l11 l21 l31

A = 21 22 32 = LL = 21 22
 a a a   l l 0   0 l22 l32  . (4.45)
a31 a32 a33 l31 l32 l33 0 0 l33
Multiplying out the right-hand side yields
 2 
l11 l21 l11 l31 l11
2 2
A = l21 l11 l21 + l22 l31 l21 + l32 l22  . (4.46)
2 2 2
l31 l11 l31 l21 + l32 l22 l31 + l32 + l33

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

You might also like