Orthogonal Projections Explained
Orthogonal Projections Explained
8 Orthogonal Projections 85
x Figure 3.11
Projection onto a
two-dimensional
subspace U with
basis b1 , b2 . The
projection πU (x) of
x − πU (x) x ∈ R3 onto U can
be expressed as a
linear combination
U of b1 , b2 and the
b2
displacement vector
x − πU (x) is
πU (x) orthogonal to both
b1 and b2 .
0 b1
Let us now choose a particular x and see whether it lies in the subspace
⊤
spanned by b. For x = 1 1 1 , the projection is
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10,
we know that a projection matrix P π satisfies P 2π x = P π x for all x.
Remark. With the results from Chapter 4, we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♢
⇐⇒ B ⊤ Bλ = B ⊤ x . (3.56)
normal equation The last expression is called normal equation. Since b1 , . . . , bm are a
basis of U and, therefore, linearly independent, B ⊤ B ∈ Rm×m is reg-
ular and can be inverted. This allows us to solve for the coefficients/
coordinates
λ = (B ⊤ B)−1 B ⊤ x . (3.57)
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B ⊤ B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B ⊤ B)−1 B ⊤ as
⊤
P π = BB
B⊤ B
, which is exactly the projection matrix in (3.46). ♢
projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
is also called the ⊤ √
reconstruction error. ∥x − πU (x)∥ = 1 −2 1 = 6. (3.63)
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , and (b) verify that
P π = P 2π (see Definition 3.10).
Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♢
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♢
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach to derive principal
component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
are satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB ⊤ x (3.65)
Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) first constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
2 1
b1 = , b2 = ; (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
2
u1 := b1 = , (3.70)
0
u1 u⊤
(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
∥u1 ∥ 1 0 0 1 1
(3.71)
These steps are illustrated in Figures 3.12(b) and (c). We immediately see
that u1 and u2 are orthogonal, i.e., u⊤1 u2 = 0.
Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
θ
− sin θ e1 cos θ
3.9.1 Rotations in R2
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ
3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle, we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):
e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
trix
I i−1 ··· ···
0 0
0
cos θ 0 − sin θ 0 n×n
Rij (θ) :=
0 0 I j−i−1 0 0 ∈R , (3.80)
0 sin θ 0 cos θ 0
0 ··· ··· 0 I n−j
Givens rotation for 1 ⩽ i < j ⩽ n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
Exercises
3.1 Show that ⟨·, ·⟩ defined for all x = [x1 , x2 ]⊤ ∈ R2 and y = [y1 , y2 ]⊤ ∈ R2 by
is an inner product.
3.2 Consider R2 with ⟨·, ·⟩ defined for all x and y in R2 as
2 0
⟨x, y⟩ := x⊤ y.
1 2
| {z }
=:A
using
a. ⟨x, y⟩ := x⊤ y
2 1 0
b. ⟨x, y⟩ := x⊤ Ay , A := 1 3 −1
0 −1 2
3.4 Compute the angle between
1 −1
x= , y=
2 −1
using
a. ⟨x, y⟩ := x⊤ y
⊤ 2 1
b. ⟨x, y⟩ := x By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
0 1 −3 −1 −1
−1 −3 4 −3 −9
U = span[
2 , −1 .
1 , 1 , 5 ] , x=
0 −1 2 0 4
2 2 1 7 1
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
1 −1
b1 := 1 , b2 := 2 .
1 0
by 30◦ .
Matrix Decompositions
98
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
4.1 Determinant and Trace 99
used in
where they are used
in other parts of the
book.
Eigenvalues Chapter 6
Probability
& distributions
determines
used
in
constructs used in
Eigenvectors Orthogonal matrix Diagonalization
n
di
use
us
in
ed
ed
SVD
us
in
used in
Chapter 10
Dimensionality
reduction
For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
We call a square matrix T an upper-triangular matrix if Tij = 0 for upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- lower-triangular
angular matrix T ∈ Rn×n , the determinant is the product of the diagonal matrix
elements, i.e.,
n
Y
det(T ) = Tii . (4.8)
i=1
The determinant is
the signed volume
of the parallelepiped
Example 4.2 (Determinants as Measures of Volume) formed by the
columns of the
The notion of a determinant is natural when we consider it as a mapping
matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de- Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped of the parallelogram
formed by columns of the matrix A. (shaded region)
For n = 2, the columns of the matrix form a parallelogram; see Fig- spanned by the
vectors b and g is
ure 4.2. As the angle between vectors gets smaller, the area of a parallel- |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is the
area of the parallelogram with vertices 0, b, g, b + g . In particular, if b, g b
are linearly dependent so that b = λg for some λ ∈ R, they no longer
g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples Figure 4.3 The
of volume of the
b
the canonical basis vectors e1 , e2 then they can be written as b = and parallelepiped
0 (shaded volume)
0 b 0 spanned by vectors
g= , and the determinant is = bg − 0 = bg . r, b, g is
g 0 g
|det([r, b, g])|.
The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. This becomes the familiar formula: area = height ×
b
length. This intuition extends to higher dimensions. In R3 , we consider
r
three vectors r, b, g ∈ R3 spanning the edges of a parallelepiped, i.e., a g
solid with faces that are parallel parallelograms (see Figure 4.3). The ab- The sign of the
determinant
solute value of the determinant of the 3 × 3 matrix [r, b, g] is the volume
indicates the
of the solid. Thus, the determinant acts as a function that measures the orientation of the
signed volume formed by column vectors composed in a matrix. spanning vectors.
Consider the three linearly independent vectors r, g, b ∈ R3 given as
2 6 1
r = 0 , g = 1 , b = 4 . (4.9)
−8 0 −1
c0 = det(A) , (4.23)
cn−1 = (−1)n−1 tr(A) . (4.24)
The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.
algebraic Definition 4.9. Let a square matrix A have an eigenvalue λi . The algebraic
multiplicity multiplicity of λi is the number of times the root appears in the character-
istic polynomial.
A matrix A and its transpose A⊤ possess the same eigenvalues, but not
necessarily the same eigenvectors.
The eigenspace Eλ is the null space of A − λI since
Ax = λx ⇐⇒ Ax − λx = 0 (4.27a)
⇐⇒ (A − λI)x = 0 ⇐⇒ x ∈ ker(A − λI). (4.27b)
x1 1
This means any vector x = , where x2 = −x1 , such as , is an
x2 −1
eigenvector with eigenvalue 2. The corresponding eigenspace is given as
1
E2 = span[ ]. (4.35)
−1
Example 4.6
2 1
The matrix A = has two repeated eigenvalues λ1 = λ2 = 2 and an
0 2
algebraic multiplicity of 2. The eigenvalue has, however, only one distinct
1
unit eigenvector x1 = and, thus, geometric multiplicity 1.
0
Figure 4.4
Determinants and
eigenspaces.
Overview of five
λ1 = 2.0 linear mappings and
λ2 = 0.5 their associated
det(A) = 1.0
transformation
matrices
Ai ∈ R2×2
projecting 400
color-coded points
x ∈ R2 (left
λ1 = 1.0
λ2 = 1.0 column) onto target
det(A) = 1.0 points Ai x (right
column). The
central column
depicts the first
eigenvector,
stretched by its
λ1 = (0.87-0.5j) associated
λ2 = (0.87+0.5j) eigenvalue λ1 , and
det(A) = 1.0
the second
eigenvector
stretched by its
eigenvalue λ2 . Each
row depicts the
effect of one of five
λ1 = 0.0
λ2 = 2.0 transformation
det(A) = 0.0 matrices Ai with
respect to the
standard basis.
λ1 = 0.5
λ2 = 1.5
det(A) = 0.75
half of the vertical axis, and to the left vice versa. This mapping is area
preserving (det(A2 ) = 1). The eigenvalue λ1 = 1 = λ2 is repeated
and the eigenvectors are collinear (drawn here for emphasis in two
opposite directions). This indicates that the mapping acts only along
one direction (the horizontal
axis).
√
cos( π6 ) − sin( π6 )
1 3 √ −1
A3 = = 2 The matrix A3 rotates the
sin( π6 ) cos( π6 ) 1 3
points by π6 rad = 30◦ counter-clockwise and has only complex eigen-
values, reflecting that the mapping is a rotation (hence, no eigenvectors
are drawn). A rotation has to be volume preserving, and so the deter-
minantis 1. For more details on rotations, we refer to Section 3.9.
1 −1
A4 = represents a mapping in the standard basis that col-
−1 1
lapses a two-dimensional domain onto one dimension. Since one eigen-
Figure 4.5
0
Caenorhabditis 25
elegans neural 50 20
network (Kaiser and
15
Hilgetag, 100
neuron index
eigenvalue
2006).(a) Sym- 10
metrized 150 5
connectivity matrix;
0
(b) Eigenspectrum. 200
−5
250 −10
Methods to analyze and learn from network data are an essential com-
ponent of machine learning methods. The key to understanding networks
is the connectivity between network nodes, especially if two nodes are
connected to each other or not. In data science applications, it is often
useful to study the matrix that captures this connectivity data.
We build a connectivity/adjacency matrix A ∈ R277×277 of the complete
neural network of the worm C.Elegans. Each row/column represents one
of the 277 neurons of this worm’s brain. The connectivity matrix A has
a value of aij = 1 if neuron i talks to neuron j through a synapse, and
aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
plies that eigenvalues may not be real valued. Therefore, we compute a
symmetrized version of the connectivity matrix as Asym := A + A⊤ . This
new matrix Asym is shown in Figure 4.5(a) and has a nonzero value aij if
and only if two neurons are connected (white pixels), irrespective of the
direction of the connection. In Figure 4.5(b), we show the correspond-
ing eigenspectrum of Asym . The horizontal axis shows the index of the
eigenvalues, sorted in descending order. The vertical axis shows the corre-
sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
many biological neural networks. The underlying mechanism responsible
for this is an area of active neuroscience research.
Example 4.8
Consider the matrix
3 2 2
A = 2 3 2 . (4.37)
2 2 3
Figure 4.6
Geometric
x2 A interpretation of
eigenvalues. The
v2 eigenvectors of A
x1 v1 get stretched by the
corresponding
eigenvalues. The
area of the unit
Theorem 4.17. The trace of a matrix A ∈ Rn×n is the sum of its eigenval-
square changes by
ues, i.e., |λ1 λ2 |, the
Xn perimeter changes
tr(A) = λi , (4.43) by a factor of
1
i=1 2
(|λ1 | + |λ2 |).
on a different web site. The matrix A has the property that for any ini-
tial rank/importance vector x of a web site the sequence x, Ax, A2 x, . . .
PageRank converges to a vector x∗ . This vector is called the PageRank and satisfies
Ax∗ = x∗ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
A. After normalizing x∗ , such that ∥x∗ ∥ = 1, we can interpret the entries
as probabilities. More details and different perspectives on PageRank can
be found in the original technical report (Page et al., 1999).