0% found this document useful (0 votes)
35 views18 pages

Linear Algebra

The document outlines the fundamentals of linear algebra, including matrix operations, groups, vector spaces, linear independence, and linear mappings. It explains key concepts such as matrix addition, multiplication, identity, and inverse, as well as the properties of vector spaces and the significance of bases and dimensions. The document emphasizes the isomorphism of finite-dimensional vector spaces and the representation of linear mappings through transformation matrices.

Uploaded by

kakice2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views18 pages

Linear Algebra

The document outlines the fundamentals of linear algebra, including matrix operations, groups, vector spaces, linear independence, and linear mappings. It explains key concepts such as matrix addition, multiplication, identity, and inverse, as well as the properties of vector spaces and the significance of bases and dimensions. The document emphasizes the isomorphism of finite-dimensional vector spaces and the representation of linear mappings through transformation matrices.

Uploaded by

kakice2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

CIS3990-002: Mathematics of Machine Learning Fall 2023

Lecture: Linear Algebra


Date: October 4th, 2023 Author: Eric Wong

1 Linear Algebra Basics

Most likely you are familiar with basic operations on matrices and vectors. For example, A ∈ R3×5
is a matrix of real numbers with 3 rows and 5 columns, while b ∈ R3 is a vector of 3 elements.
These form the basis of a system that we call linear algebra, which has several main properties.
Our goal in this module will be to learn about the the fundamental properties of linear systems,
and generalize these properties to abstract vector spaces that are not necessarily in the field of real
numbers.

• For m, n ∈ N, a matrix A is a m, n tuple of elements aij where i denotes the row and j denotes
the column.
• Addition: if C = A + B and A, B ∈ Rm×n then cij = aij + bij
Pk
• Product: If C = AB and A ∈ Rm×k and B ∈ Rk×n then cij = l=1 ail blj where C ∈ Rm×n
• Identity: Im ∈ Rm×m is an identity matrix when it is zero everywhere except the diagonal,
i.e. Iij = 1[i = j]
• Associativity: (AB)C = A(BC)
• Distributivity: (A + B)C = AC + BC, A(C + D) = AC + AD
• Multiplication with identity: ∀A ∈ Rm×n : Im A = AIn = A
• Inverse: Let A ∈ Rn×n . If AB = I then B = A−1 is the inverse of A
• Transpose: Let A ∈ Rm×n . The matrix B = A⊤ such that bij = aji is called the transpose.
• Symmetric: A ∈ Rn×n is symmetric if A = A⊤

We can also add scalars to the mix (single elements).

• Scalar multiplication: Let λ ∈ R. Then, λA = K where Kij = λaij .


• Associativity: (λϕ)C = λ(ϕC). Actually, scalars can be moved around: λ(BC) = (λB)C =
B(λC) = (BC)λ. Also, transpose doesn’t affect matrices: (λC)⊤ = C ⊤ λ = λC ⊤
• Distributivity: (λ + ϕ)C = λC + ϕC and λ(B + C) = λB + λC

One of the most common uses of matrices and vectors is to represent linear systems of equations
in a compact form. I.e.
Ax = b
represents a series of linear equations, where each row of A is the coefficients for each variable x
and the target scalar is the corresponding row in b.

1
2 Groups

The space of matrices and vectors behaves nicely, in that it has these properties of associativity,
distributivity, an identity and an inverse. Let’s now generalize this structure.

• Groups: Let G be a set and an operation ⊗ : G × G → G be defined on G. Then G = (G, ⊗)


is called a group if

1. Closure: ∀x, y ∈ G : x ⊗ y ∈ G
2. Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
3. Neutral element: ∃e ∈ G∀x ∈ G : x ⊗ e = e ⊗ x = x
4. Inverse element: ∀x ∈ G∃y ∈ G : x ⊗ y = y ⊗ x = e. We write x−1 to denote the inverse
element of x. This does not always mean x1 and is with respect to the operator ⊗.
5. (Commutivity) If f ∀x, y ∈ G : x ⊗ y = y ⊗ x then G is an Abelian group

• Examples of Abelian groups: (Z, +), (R \ {0})

• Examples of not-groups: (N + 0, +), (Z, ·), (R, ·)

• (Rn , +), (Z n , +) are Abelian if using component wise addition

• Matrices and addition: (Rm×n , +) is Abelian with component-wise addition

• Matrices and multiplication: (Rm×n , ·) is only a group if the inverse always exists

• General Linear Group: set of invertible matrices A ∈ Rn×n is a group with respect to matrix
multiplication, but is not Abelian (not commutative)

3 Vector Spaces

Groups have an operation with structure that stays within the group. This can be referred to as
an inner operation (i.e. elementwise addition) as the operator stays within the group. We can also
consider an outer operation which takes in an element outside of the group.

• Real-valued vector space V = (V, +, ·) is a set V with operations + : V × V → V and


·:R×V →V

• Distributivity:

1. ∀λ ∈ R, ∀x, y ∈ V : λ ⊙ (x + y) = λ ⊙ x + λ ⊙ y
2. ∀λ, ψ ∈ R, x ∈ V : (λ + ψ) ⊙ x = λ ⊙ x + ψ ⊙ y

• x ∈ V are called vectors, the neutral element is 0, and the inner operator is vector addition
while the outer operation is multiplcation by scalars.

• A subspace of a vector space is a vector space: if U ⊂ V and V = (V, +, ⊙) is a vector space,


then if U = (U, +, ·) is a vector space we call it a subspace of V restricted to U.

2
• Subspaces inherit properties from the higher space, including Abelian, distributivity, associa-
tivity, and neutral element. To show that U is a subspace, we need to show that 0 ∈ U and
U is closed with respect to both inner and outer operations (i.e. ∀λ∀x ∈ U : λx ∈ U and
∀x, y ∈ U : x + y ∈ U).

• Example 2.12 from the textbook.

This structure gives us the nice properties we expect in linear algebra (i.e. we can do operations
on vectors that result in more vectors).

4 Linear Independence, basis and rank

• Linear combination is a combination of scaled vectors:


X
v= λ i xi
i

• If xi ∈ Rd then we will typically abbreviate this as λ⊤ X where X is the matrix of elements


stacked in each row

• If there exists λ such that 0 =


P
i λi xi with at least one λi ̸= 0 then they are linearly
dependent. If no such non-zero solution exists, they are linearly independent.

• Properties of linear independence

– k vectors are either linearly dependent or independent


– If a vector is 0 or if the same vector is repeated, they are dependent

• Let V = (V, +, ·) be a vector space and let A = {x1 , . . . , xk } ⊂ V. If every vector v ∈ V can
be excpressed as a linear combination of A, then this is a generating set of V.

• The set of all linear combinations of vectors in A is the span of A.

• A generating set A is minimal if there does not exist a smaller Ā ⊊ A that spans V .

• Every independent generating sets of V is minimal and is called a basis of V .

3
• Let B ⊆ V, B ̸= ∅. The following statements are equivalent:

– B is a basis
– B is a minimal generating set
– B is a maximally linearly independent set of vectors in V , i.e. adding any vector will
make the set linearly dependent
– Every vector x ∈ V is a linear combination of vectors from B, and every linear combi-
nation is unique: X X
x= λi bi = ψi bi ⇒ λi = ψi
i i

• Example: Standard basis is B = {e1 , . . . , ek } where ei is zero everywhere except for the ith
position which is 1.

• Example:      
 1 1 1 
B =  0 , 1 , 1 
0 0 1
 

• Every vector space has a basis B. There can be multiple bases, i.e. they are not unique.
However, they all have the same number of basis vectors.

• The dimension of V is the number of basis vectors of V , denoted as dim(V ).

• A finite dimensional vector space is one where dim(V ) < ∞.

• Example of an infinite dimensional basis: suppose V is the set of all countably infinite vectors
, v2 , . . . , ). Then it has an infinite basis B = {e1 , e2 , . . . , }. Every vector can be written
v = (v1P
as v = i λi ei for some λi that exists for each v.

• When we go to function spaces, these will be infinite dimensional spaces.

• Example: The functions en (θ) = e2πinθ is an (orthonormal) basis of the Hilbert space L2 ([0, 1])
where L2 ([0, 1]) is the space of functions on [0, 1] for which the Lebesgue integral of the square
of the absolute value is finite, i.e. X |f |2 dµ < ∞
R

• The number of linearly independent columns of a matrix A equals the number of linearly
independent rows and is called the rank of A, denoted as rk(A).

• Properties:

– rk(A) = rk(A⊤ )
– The columns of A span a subspace U ⊆ Rm with dim(U ) = rk(A). This subspace is
called the image or range of A.
– Similarly, the rows of A span a subspace W ⊆ Rn with dim(W ) = rk(A)
– For square matrices A ∈ Rn×n , A is invertible (regular) if and only if rk(A) = n.
– For all A, b the linear system Ax = b can be solved if and only if rk(A) = rk(A|b).
– For A ∈ Rm×n , the subspace of solutions x such that Ax = 0 has rank n − rk(A). This
subspace is called the kernel, or the null space.

4
– A matrix has full rank if rk(A) = min(m, n). Otherwise, it is rank deficient.

The goal of a basis is to provide a sense of structure to the vector space. We will now look at linear
mappings, which are operations that preserve the structure of a vector space. This is analogous to
the group operator, which preserves the structure of a group. Previously, we had operators + and
· for a vector space corresponding to elementwise addition and scalar multiplication. We will now
look at operators between vector spaces that preserve this structure.

• Let V, W be two vector spaces. A linear mapping Φ : V → W satisfies

∀x, y ∈ V, ∀λ, ψ ∈ R : Φ(λx + ψy) = λΦ(x) + ψΦ(y)

These are sometimes also called vector space homomorphism or linear transformation.

• In the vector space of Rn , we can represent linear mappings as matrices.

• A mapping Φ : V → W on arbitrary sets V, W is called:

– Injective if ∀x, y : Φ(x) = Φ(y) ⇒ x = y (different vectors map to different outputs)


– Surjective if Φ(V) = W (all elements can be reached)
– Bijective if it is both injective and surjective (operation can be undone)

• Φ : V → W is an isomorphism if it is both linear and bijective

• Φ : V → V is an endomorphism if it is linear

• Φ : V → V is an automorphism if it is both linear and bijective

• idV : V → V is the identity mapping, or automorphism.

• Theorem: Finite dimensional vector spaces V, W are isomorphic if and only if dim(V ) =
dim(W ) (Axler 2015).

• Intuitively, this means that vector spaces with the same dimension are the ”same” in that
you can transform from one to the other without losing anything. This means that we can
treat the space of Rm×n matrices as the same as Rmn vectors, as there is a linear bijective
mapping from one to the other.

• More properties:

– Let V, W, X be vector spaces. If Φ : V → W and Ψ : W → X are linear mappings, then


Ψ ◦ Φ : V → X is also linear.
– If Φ : V → W is an isomorphism, then Φ−1 : W → V is an isomorphism.
– If Φ : V → W, Ψ : V → W are linear, then Φ + Ψ and λΦ are also linear.

The key point of the previous section is to say that any n dimensional vector space is isomorphic
to Rn . Therefore, any reasoning we can do in Rn applies to any finite dimensional vector space.
So in the finite dimensional case, we only need to study Rn , since everything that is finite can be
reduce to Rn . As an example, suppose

5
• Let B = (b1 , . . . , bn ) be an ordered basis of V . For any x ∈ V , let x =
P
i αi bi be its unique
linear combination. Then, we call α1 , . . . , αn the coordinates of x.

• Think of a basis as definining a coordinate system.

• Typical coordinate system: standard basis e1 , . . . en . The coefficients tell us how to linearly
combine to obtain x. However, one could also use the basis ((1, 0), (1, 1)) to span R2 .

• Transformation matrix: Let V, W be vector spaces with bases B, C, and consider a linear
mapping Φ : V → W . For j ∈ {1, . . . , n} let
X
Φ(bj ) = αij ci
i

be the unique representation of Φ(bj ) with respect to C. Then, if A is the matrix given by
Aij = αij then A is the transformation matrix of Φ with respect to B and C. This tells us
how to go from one vector space to another but representented as a matrix.

• What this means is that any linear mapping between finite dimension spaces can be represented
with a matrix. Just pick a basis for the domain and target, and compute the coefficients!

• If x̂ is the coordinate vector of x ∈ V and ŷ is the coordinate vector of y = Φ(x) ∈ W then


ŷ = AΦ x̂ where A is the transformation matrix of Φ.

We now have a coordinate system for our vector spaces, which depends on a chosen basis B.
However, remember that the basis is not unique: there are multiple different possibly bases for a
vector space. Depending on the basis, the resulting linear transformation could be easier or harder
to work with. We will work towards characterizing what it means for a basis to be “nice”. But in
order to do so, we first we need to understand how to change between bases.
As an example, consider the linear transformation Φ with transformation matrix
 
2 1
A=
1 2

in the standard basis. If instead of the standard basis, we use the basis B = ([1, 1], [1, −1]) then
the linear map Φ has transformation matrix
 
3 0
A=
0 1

which is a digonal matrix (which is nice). We’ll see how to get this in the next section.

• For a linear mapping Φ : V → W , consider two bases B = (b1 , . . . , bn ) and B̃ = (b̃1 , . . . , b̃n )
on V and two bases C = (c1 , . . . , cm ) and C̃ = (c̃1 , . . . , c̃m ) on W .

• Let AΦ ∈ Rm×n be the transformation matrix of Φ with respect to B, C and let ÃΦ be the
transformation matrix of Φ with respect to B̃, C̃.

• How are A and à related?

6
• Theorem: (Basis Change) The transformation matrix of ÃΦ is given by

ÃΦ = T −1 AΦ S

where S ∈ Rn×n is the transformation matrix of the idV that maps B̃ onto B and T ∈ Rm×m
is the transformation matrix of idW that maps coordinates with respect to C̃ to coordinates
with respect to C.

• Proof: First, by definition of S we can write the b̃j as a sum of basis vectors bi :
X
b̃j = sij bi
i

Similarly, we can write c̃k as a combination of basis vectors of C:


X
c̃k = tlk cl
l

Then, S maps B̃ onto B and T maps C̃ onto C (the columns are the coordinate representation
of b̃j and c̃k with respect to B and C). Now, re-express Φ(b̃j ) in two ways using these two
bases. First using C:
m
X m
X m
X m
X m
X
Φ(b̃j ) = ãkj ck = ãkj tlk cl = cl t̃lk akj
k=1 k=1 l=1 l=1 k=1

Then using B:
n n n m m n
!
X X X X X X
Φ(b̃j ) = Φ sij bi = sij Φ (bi ) = sij ali cl = cl ali sij
i=1 i=1 i=1 l=1 l=1 i=1

Therefore for all j = 1, . . . , n and al l = 1, . . . , m it follows that


m
X n
X
t̃lk akj = ali sij
k=1 i=1

In matrix form, this is equivalent to


T Ã = AS
and therefore à = T −1 AS

• Aside: Why are S and T regular (invertible)? They are the matrix representation of the
identity operator, which is an invertible operator.

• Two matrices A, Ã ∈ Rm×n are equivalent if there exists regular matrices S ∈ Rn×n , T ∈
Rm×m such that à = T −1 AS

• Two matrices A, Ã ∈ Rn×n are similar if there exists a regular matrix S ∈ Rn×n where
à = S −1 AS

• Informally, this basis change can be seen as the following:

– A maps V → W bases B to C

7
– Ã maps V → W from bases B̃ to C̃
– S is the identity mapping from basis B̃ to B
– T is the identity mapping from basis C̃ to C
– Then, B̃ → C̃ can be rewritten as

B̃ → B → C → C̃

which reflects S, then A, then T −1 . Hence, Ãx = T −1 (A(Sx))


 
1 1
• To get the example from the start of this section: note that S = (i.e. just
1 −1
horizontally stack the representation of the new basis in the old basis) and that
 
1 1 1
T −1 = S −1 =
2 1 −1

Then,      
−1 1 1 1 2 1 1 1 3 0
à = T AS = =
2 1 −1 1 2 1 −1 0 1

Now that we have a better understanding of bases and how to change between bases for linear
mappings, we’ll now cover our last fundamental concept for vector spaces: images and kernels.

• For Φ : V → W the kernel or null space is ker(Φ) = ϕ−1 (0) = {v ∈ V : Φ(v) = 0}

• This is the set of vectors in V that map to 0

• The image or range is Im(Φ) = Φ(V ){w ∈ W |∃v ∈ V : Φ(v) = w}

• This is the set of vectors in W that get mapped to

• – It is always true that Φ(0) = 0 and therefore 0 ∈ ker(Φ), so the null space is never empty.
– It is also always true that ker(Φ) ⊆ V and Im(Φ) ⊆ W .
– Φ is injective if and only if ker(Φ) = {0}

• Consider a linear mapping Φ : Rn → Rm with transformation matrix A ∈ Rm×n , so Φ(x) = Ax

• let A = [a1 , . . . , an ] be the columns of A. Then the image is the span of the columns (column
space):
X
Im(Φ) = {Ax : x ∈ Rn = xi ai : xi ∈ R} = span[a1 , . . . , an ] ⊂ mathbbA
i

• Then it follows that the rank of A is the dimension of the image, i.e. rk(A) = dim(Im(Φ))

• Rank Nullity Theorem: For vector spaces V, W and linear mapping Φ : V → W it holds that

dim(ker(Φ)) + dim(Im(Φ)) = dim(V )

also known as the fundamental theorem of linear mappings.

8
• Some immediately consequences:

– If dim(Im(Φ)) < dim(V ) then ker(Φ) is non-trivial


– If AΦ is the transformation matrix of Φ and dim(Im(Φ)) < dim(V ) then Ax = 0 has
infinitely many solutions
– If dim(V ) = dim(W ) then Φ is injective if and only if it is surjective

The last part we will consider here is affine subspaces. These are subspaces that have a linear
structure.

• Let V be a vector space x0 ∈ V and U ⊆ V be a subspace. Then

L = x0 + U = {x0 + u : u ∈ U }

is an affine subspace.

• Examples of affine subspaces: points, lines, planes...

• If (b1 , . . . ,P
bk ) is an ordered basis of U then every element x ∈ L is uniquely described as
x = x0 + i λki bi

• In the same way that we can define linear mappings between vector spaces, we can also define
affine mappings between affine subspaces.

• For two vector spaces V, W , linear mapping Φ : V → W and a ∈ W , the mapping

ϕ:V →W

x → a + Φ(x)
is an affine mapping from V to W , where a is the translation vector.

• Every affine mapping is the composition of a linear mapping Φ and a translation τ such that
ϕ=τ ⊙Φ

• Composition ϕ′ ⊙ ϕ of affine operators is affine

• Affine operators preserve dimension and parellelism and other geometric structures

9
5 Inner Product Spaces

• A norm on a vector space V is the function

∥·∥:V →R

where we say ∥x∥ is the norm of x. A norm must satisfy

1. Absolutely homogeneous: ∥λx∥ = |λ|∥x∥


2. Triangle inequality: ∥x + y∥ leq∥x∥ + ∥y∥
3. Positive definite: ∥x∥ ≥ 0 and ∥x∥ = 0 ⇒ x = 0

• A norm represents the size of a vector

• Examples for V = Rd :
P
1. ℓ1 norm ∥x∥1 = i |xi | (also called the Manhattan Norm),
qP √
2. ℓ2 norm ∥x∥2 = 2 xT x (also called the Euclidean distance, typically assumed
i xi =
as default),
P
3. ℓ0 norm ∥x∥0 = i ⊮[xi ̸= 0]
4. ℓ∞ norm ∥x∥∞ = maxi |xi |

• Recall: a linear map is one that is linear with respect to addition and multiplication with a
scalar, i.e. Φ(λx + ψy) = λΦ(x) + ψΦ(y)

• A bi-linear map Ω : V ×V → V is a function of two arguments that is linear in both arguments:

Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z)

Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z)

• A bi-linear map Ω is symmetric if

Ω(x, y) = Ω(y, x)

• A bi-linear map is positive definite if

∀x ∈ V \ {0} : Ω(x, x) > 0, Ω(0, 0) = 0

• A positive definite, symmetric bilinear map Ω is an inner product on V , where we often write
Ω(x, y) = ⟨x, y⟩

• (V, ⟨·, ·⟩ is called an inner product space.

• Inner products capture the notion of alignment between two vectors x, y. This is analagous
to the angle between vectors in Euclidean space.

• If ⟨x, y⟩ = xT y then this is a Euclidean vector space

• Examples of inner products:

10
1. ⟨x, y⟩ = xT y
2. ⟨x, y⟩ = (x1 y2 + x2 y1 ) + 2x2 y2

• Cauchy-Schwarz inequality:
|⟨x, y⟩| ≤ ∥x∥∥y∥

With an inner product, we can define other useful constructs such as positive definite matrices.

• Let V be an inner product space with an ordered basis BP= (b1 , . . . , bn ). Then, P any vectors
x, y ∈ V can be written as a sum of basis vectors x = i ψi bi and y = j λj bj (so x̂ =
(ψ1 , . . . , ψn ) and ŷ = (λ1 , . . . , λn ). Then,
* +
X X XX
⟨x, y⟩ = ψi bi , λj bj = ψi ⟨bi , ⟩bj λj = x̂T Aŷ
i j i j

where Aij = ⟨bi , bj ⟩.

• Similar to how a linear map is uniquely determined by its transformation matrix in a given
basis, an inner product is uniquely determined through A. Since an inner product is positive
definite, it follows that xT Ax > 0 for all x ∈ V \ {0}.

• If xT Ax > 0 for all non-zero x ∈ V , then A is positive definite.

• If xT ax ≥ 0 for all non-zero x ∈ V , then A is positive semi-definite.

• Theorem: For a real-valued, finite dimensional vector space V with ordered basis B, ⟨·, ·⟩ is
an inner product if and only if there exists a symmetric positive definite matrix A such that
⟨x, y⟩ = x̂T Aŷ

• Properties of symmetric positive definite matrices:

1. ker(A) = {0}. Since x⊤ Ax > 0 for all x ̸= 0, it follows that Ax ̸= 0 if x ̸= 0.


2. ker(A) = {0}
3. Diagonal entries must be positive since e⊤
i Aei > 0

• Therefore, a PSD matrix defines an inner product with respect to a change of basis.

Inner products can do more than just capture the alignment of two vectors: they’re a formalization
of the notion of length and distance as well.

• A norm of a vector ∥x∥ formalizes the size or the length of x.

• Any inner product induces a norm: ∥x∥ = ⟨x, x⟩


p

• Can think of this induced norm as the alignment of a vector with itself

• Not all norms are defined by an inner product

• A norm then defines a distance between two vectors as the size of the difference:
p
d(x, y) = ∥x − y∥ = ⟨x − y, x − y⟩

11
• Classic example: Euclidean distance

• Formally this mapping is called a metric if

1. d is positive definite, i.e. d(x, y) ≥ 0 for all x, y ∈ V and d(x, y) = 0 ⇔ x = y


2. d is symmetric, i.e. d(x, y) = d(y, x) for all x, y ∈ V
3. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ V

• Don’t confuse ⟨x, y⟩ with d(x, y) which have similar properties but behave oppositely

6 Orthogonality

Now that we have distances and similarity, we can now formalize a notion of dissimilarity. This is
useful in defining what is “left” after finding some basis vectors while minimizing redundancies.

• Orthogonality: two vectors x, y are orthogonal if ⟨x, y⟩ = 0. We write this as x ⊥ y. If


∥x∥ = ∥y∥ = 1 then we further say x, y are orthonormal.

• Orthogonality generalizes the idea of perpendicular to arbitrary norms that are not the dot
product.

• Orthogonality depends on the inner product! (1, 1) and (−1, 1) are orthogonal
 with
 respect
2 0
to the dot product, but not with respect to the inner product ⟨x, y⟩ = x⊤ y
0 1
• Orthogonal matrix: a square matrix A is orthogonal if and only if its colums are orthonormal.

• This implies that A⊤ A = AA⊤ = I and so A−1 = A⊤

• Orthogonal matrices do not change the length of a vector x:

∥Ax∥2 = ∥x∥2

• Orthogonality is most often used to construct a basis. B is an orthonormal basis (ONB) if


⟨bi , bj ⟩ = 0 for i ̸= j and ⟨bi , bi ⟩ = 1

• Such a basis has minimal redundancy between basis vectors (with respect to the inner product)
and is standardized to all have a norm of 1

• Orthogonal complement: Let V be a D dimensional vector space and U ⊆ V be a M dimen-


sional subspace. The orthogonal complement U ⊥ ⊆ V is a D − M dimensional subspace such
U ⊥ contains all vectors in V that are orthogonal to every vector in U :

U ⊥ = {v ∈ V |∀u ∈ U : ⟨u, v⟩ = 0}

• Note that U ∩ U ⊥ = {0}. Therefore, every vector x can be decomposed into

X D−M
X
x= λM
m=1 bm + ψj b⊥
j
m j=1

12
• Example: the tangent line of a plane in 3D space defines a subspace that is the orthogonal
complement of the plane.

• Projection: Let U ⊆ V be a subspace of V . A linear mapping π : V → U is a projection if


π 2 = π ⊙ π = π.

• Recall that linear mappings can be uniquely represented with transformation matrices (after
choosing a basis). Thus, projection matrices P satisfy P 2 = P .

• An orthogonal projection is a linear mapping that finds the “closest” point on the subspace:

πU (x) = arg min ∥x − u∥


u∈U

If πU (x) is the closest point to x, then the difference ⟨πU (x) − x, u⟩ must be orthogonal to all
u ∈ U . If U has a basis B, then ⟨πU (x) − x, bi ⟩ = 0.

• Aside: projections to general sets instead of subspaces can be defined as the closest point to
a set U with respect to some norm.

• Re-expressing this as a set of coordinates λ so πU (x) = Bλ gives us

⟨bi , x − Bλ⟩ = 0

for all i. This means that

B ⊤ (x − Bλ) = 0 ⇔ B ⊤ x = B ⊤ Bλ

Since B is a basis of U , the columns are linearly independent so B ⊤ B is invertible. Thus,

λ = (B ⊤ B)−1 B ⊤ x

Plug this back in to get


πU (x) = Bλ = B(B ⊤ B)−1 B ⊤ x

• In one dimension, this simplifies to λ = 1 ⊤


∥b∥2
b x

• If a basis is orthonormal, then this simplifies greatly to

πU (x) = Bλ = BB ⊤ x

where the coordinates are λ = B ⊤ x

• Projections let us reason about spaces without solving for an exact solution.

• Projection onto affine spaces: recall that an affine space is L = x0 + U . We can subtract x0 ,
project onto U , and then add back in x0 to get the orthogonal projection:

πL (x) = x0 + πU (x − x0 )

Lastly, we’ll show some examples of inner products on functions, which will reflect what we need
later when we look at functional analysis.

13
• Inner products on non-finite spaces can be generalized to vectors with infinite entries (count-
ably infinite) and also continuous valued (uncountably infinite). In the latter case this becomes
an integral.

• Let u, v : R → R be two functions. One typical inner product is


Z b
⟨u, v⟩ = u(x)v(x)dx
a

• Example: u = sin(x) and v = cos(x) is an odd function, and therefore the integral from −π
to π is 0. Thus, sin and cos are orthogonal functions.

• The collection of functions {1, cos(x), cos(2x), cos(3x), . . . , } is orthogonal when (a, b) = (−π, π).
This is a space of even and periodic functions, and projecting onto this space is the funda-
mental idea behind Fourier series.

14
7 Decompositions

We’ll now explore how various decompositions in linear algebra reduce down to change of basis.

• With norms we have lengths, and with inner products we have similarity between vectors (i.e.
angles). Lastly, we can introduce a notion of volume.

• Determinant det(A) : Rn×n → R is a measure of the volume of a hyper-parallelepiped formed


from the columns of A

• det(T ) = Ti=1 Tii where T is triangular


Q

• Laplace expansion along column j:


N
X
det(A) = (−1)k+j akj det(Akj )
k=1

• A matrix A ∈ Rn×n is invertible if and only if det(A) ̸= 0

• Intuition: if a matrix is not invertible, then some of its columns are linearly dependent and it
spans a space less than N . Therefore, the volume in N dimensional space is 0. On the other
hand, a volume of 0 implies that the columns span a subspace with dimension less than N
and is therefore non-inverible.

• det(AB) = det(A) det(B)

• det(A) = det(A⊤ )

• det(A−1 ) = 1
det(A) if A is invertible

• Similar matrices have the same determinant, i.e. if à = S −1 AS then det(A) = det(A−1 ).
Therefore, determinent does not depend on the basis.

• Multiplying a row by λ scales det(A) by λ, and det(λA) = λN det(A)

• Swapping rows or columns changes the sign of det(A)

• det(A) ̸= 0 if and only if rk(A) = N

• The trace of a matrix tr(A) = i aii is the sum of the diagonal elements
P

• Properties of trace:

1. tr(A + B) = tr(A) + tr(B)


2. tr(αA) = αtr(A)
3. tr(I) = N
4. tr(AB) = tr(BA)

• Furthermore, the only operator that satisfies these properties is the trace

15
• Trace is invariant under cyclic permutations:

tr(AKL) = tr(KLA)

Simpler:
tr(xy T ) = tr(y T x) = y T x

• The trace of a linear map is the trace of the linear operator:

tr(Φ) = tr(AΦ )

• The trace is independent of any basis, since

tr(B) = tr(S −1 AS) = tr(SS −1 A) = tr(A)

With the determinant and the trace, we can now cover eigenvalues and eigenvectors.

• Let A be square. λ ∈ R is an eigenvalue and x ̸= 0 is a corresponding eigenvector if Ax = λx.

• Eigenvectors are directions where the transformation matrix A is a direct scaling of the
direction by λ. In other words, transforming a vector with A simply scales the vector by λ.

• The following are equivalent:

1. λ is an eigenvalue of A
2. There exists x such that Ax = λx, or equivalently (A − λI)x = 0 can be solved for x ̸= 0
3. rk(A − λI) < n
4. det(A − λI) = 0

• x, y are collinear if x = cy for some c. All vectors collinear to an eigenvector are also
eigenvectors, i.e.
A(cx) = cAx = cλx = λ(cx)

• A and A⊤ have the same eigenvalues

• Similar matrices à and A have the same eigenvalues. Therefore, a linear map Φ has the same
eigenvalues regardless of the choice of basis.

• In total, eigenvalues, determinant, and trace are all invariant under basis change.

• Symmetric, positive definite matrices have positive, real eigenvalues

• Spectral Theorem: if A is symmetric, there exists an ONB of V of eigenvectors of A, and


each eigenvalue is real:
A = P DP ⊤
where P contains the eigenvectors and D is diagonal of the eigenvalues

• Theorem. The determinant is det(A) = N


Q
i=1 λi . You can think of this as the “area” of the
parallelepiped after a change of basis to one that behaves like a “standard” basis.

• Theorem. The trace is tr(A) = N


P
i=1 λi

16
A diagonal matrix is very “nice”. It has fast computation of determinants, powers, inverses, etc.
An eigenvalue decomposition is a way to transform matrices into a diagonal form, and is a direct
application of the change of basis. Let’s explore this in more detail.

• Recall two matrices A, D are similar if D = P −1 AP where P is invertible.

• Diagonalizable: we would like to have matrices A that are similar to diagonal matrices D.
Then, this would mean that applying the linear map A is equivalent to scaling each coordinate
in the new basis P .

• If D = diag(λ), and let A be a square matrix. Then

AP = P D

if and only if λ are the eigenvalues and P are the eigenvectors. Why? By definition,

AP = [Ap1 , . . . , ApN ]

P D = λ1 p1 , . . . , λN pN

• Therefore, D = P −1 AP if and only if D are the eigenvalues and P are the eigenvalues (i.e.
the diagonalization of the matrix is the eigenvalues and eigenvectors).

• Eigendecomposition: A square matrix A can be factored into

A = P DP −1

where P is square and D is diagonal with eigenvalues of A if and only if the eigenvectors of
A form a basis of V .

• Symmetric matrices can always be diagonalized. In particular, AA⊤ and A⊤ A are always
diagonalizable.

Lastly, we’ll summarize a few of the major decompositions and how these can be understood as
finding a change of basis that results in a nice transformation matrix.

• The Eigendecomposition is a change of basis for square, symmetric matrices A = U ΛU T ,


which results in a diagonal transformation matrix. The resulting basis is an orthonormal
basis (the eigenvectors) with respect to the standard inner product ⟨x, y⟩ = xT y.

• The singular value decomposition is a change of basis for rectangular matrices A = U ΣV T ,


which results in a diagonal transformation matrix. The resulting bases are orthonormal with
respect to the standard inner product ⟨x, y⟩ = xT y

• The QR decomposition is a change of basis for rectangular matrices A = QR which results


in an identity transformation matrix. The resulting bases are unchanged for the target but
orthonormal for the input with respect to the standard inner product.

• The Cholesky decomposition for symmetric, positive definite matrices is A = LDLT , which
can be viewed as a change of basis that results in a diagonal transformation matrix. In
particular, the resulting basis is an orthogonal basis with respect to the matrix inner product
⟨x, y⟩ = xT Ay.

17
• Sinkhorn normal form for square positive matrices is A = D1 SD2 which can be viewed as a
change of basis that results in a double stochastic transformation matrix S.

• See many more decompositions at https://en.wikipedia.org/wiki/Matrix_decomposition

As an example, PCA is often described as finding a basis that retains as much information as
possible. This is formalized as finding basis directions that maximize the variance of the coordinates:

max V[b⊤ x]
b

where b⊤ x is the 1D projection of x onto the coordinate b. PCA finds basis vectors b that maximize
this variance by iteratively repeating this on the remaining orthogonal subspace.

18

You might also like