Linear Algebra
Linear Algebra
Most likely you are familiar with basic operations on matrices and vectors. For example, A ∈ R3×5
is a matrix of real numbers with 3 rows and 5 columns, while b ∈ R3 is a vector of 3 elements.
These form the basis of a system that we call linear algebra, which has several main properties.
Our goal in this module will be to learn about the the fundamental properties of linear systems,
and generalize these properties to abstract vector spaces that are not necessarily in the field of real
numbers.
• For m, n ∈ N, a matrix A is a m, n tuple of elements aij where i denotes the row and j denotes
the column.
• Addition: if C = A + B and A, B ∈ Rm×n then cij = aij + bij
Pk
• Product: If C = AB and A ∈ Rm×k and B ∈ Rk×n then cij = l=1 ail blj where C ∈ Rm×n
• Identity: Im ∈ Rm×m is an identity matrix when it is zero everywhere except the diagonal,
i.e. Iij = 1[i = j]
• Associativity: (AB)C = A(BC)
• Distributivity: (A + B)C = AC + BC, A(C + D) = AC + AD
• Multiplication with identity: ∀A ∈ Rm×n : Im A = AIn = A
• Inverse: Let A ∈ Rn×n . If AB = I then B = A−1 is the inverse of A
• Transpose: Let A ∈ Rm×n . The matrix B = A⊤ such that bij = aji is called the transpose.
• Symmetric: A ∈ Rn×n is symmetric if A = A⊤
One of the most common uses of matrices and vectors is to represent linear systems of equations
in a compact form. I.e.
Ax = b
represents a series of linear equations, where each row of A is the coefficients for each variable x
and the target scalar is the corresponding row in b.
1
2 Groups
The space of matrices and vectors behaves nicely, in that it has these properties of associativity,
distributivity, an identity and an inverse. Let’s now generalize this structure.
1. Closure: ∀x, y ∈ G : x ⊗ y ∈ G
2. Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
3. Neutral element: ∃e ∈ G∀x ∈ G : x ⊗ e = e ⊗ x = x
4. Inverse element: ∀x ∈ G∃y ∈ G : x ⊗ y = y ⊗ x = e. We write x−1 to denote the inverse
element of x. This does not always mean x1 and is with respect to the operator ⊗.
5. (Commutivity) If f ∀x, y ∈ G : x ⊗ y = y ⊗ x then G is an Abelian group
• Matrices and multiplication: (Rm×n , ·) is only a group if the inverse always exists
• General Linear Group: set of invertible matrices A ∈ Rn×n is a group with respect to matrix
multiplication, but is not Abelian (not commutative)
3 Vector Spaces
Groups have an operation with structure that stays within the group. This can be referred to as
an inner operation (i.e. elementwise addition) as the operator stays within the group. We can also
consider an outer operation which takes in an element outside of the group.
• Distributivity:
1. ∀λ ∈ R, ∀x, y ∈ V : λ ⊙ (x + y) = λ ⊙ x + λ ⊙ y
2. ∀λ, ψ ∈ R, x ∈ V : (λ + ψ) ⊙ x = λ ⊙ x + ψ ⊙ y
• x ∈ V are called vectors, the neutral element is 0, and the inner operator is vector addition
while the outer operation is multiplcation by scalars.
2
• Subspaces inherit properties from the higher space, including Abelian, distributivity, associa-
tivity, and neutral element. To show that U is a subspace, we need to show that 0 ∈ U and
U is closed with respect to both inner and outer operations (i.e. ∀λ∀x ∈ U : λx ∈ U and
∀x, y ∈ U : x + y ∈ U).
This structure gives us the nice properties we expect in linear algebra (i.e. we can do operations
on vectors that result in more vectors).
• Let V = (V, +, ·) be a vector space and let A = {x1 , . . . , xk } ⊂ V. If every vector v ∈ V can
be excpressed as a linear combination of A, then this is a generating set of V.
• A generating set A is minimal if there does not exist a smaller Ā ⊊ A that spans V .
3
• Let B ⊆ V, B ̸= ∅. The following statements are equivalent:
– B is a basis
– B is a minimal generating set
– B is a maximally linearly independent set of vectors in V , i.e. adding any vector will
make the set linearly dependent
– Every vector x ∈ V is a linear combination of vectors from B, and every linear combi-
nation is unique: X X
x= λi bi = ψi bi ⇒ λi = ψi
i i
• Example: Standard basis is B = {e1 , . . . , ek } where ei is zero everywhere except for the ith
position which is 1.
• Example:
1 1 1
B = 0 , 1 , 1
0 0 1
• Every vector space has a basis B. There can be multiple bases, i.e. they are not unique.
However, they all have the same number of basis vectors.
• Example of an infinite dimensional basis: suppose V is the set of all countably infinite vectors
, v2 , . . . , ). Then it has an infinite basis B = {e1 , e2 , . . . , }. Every vector can be written
v = (v1P
as v = i λi ei for some λi that exists for each v.
• Example: The functions en (θ) = e2πinθ is an (orthonormal) basis of the Hilbert space L2 ([0, 1])
where L2 ([0, 1]) is the space of functions on [0, 1] for which the Lebesgue integral of the square
of the absolute value is finite, i.e. X |f |2 dµ < ∞
R
• The number of linearly independent columns of a matrix A equals the number of linearly
independent rows and is called the rank of A, denoted as rk(A).
• Properties:
– rk(A) = rk(A⊤ )
– The columns of A span a subspace U ⊆ Rm with dim(U ) = rk(A). This subspace is
called the image or range of A.
– Similarly, the rows of A span a subspace W ⊆ Rn with dim(W ) = rk(A)
– For square matrices A ∈ Rn×n , A is invertible (regular) if and only if rk(A) = n.
– For all A, b the linear system Ax = b can be solved if and only if rk(A) = rk(A|b).
– For A ∈ Rm×n , the subspace of solutions x such that Ax = 0 has rank n − rk(A). This
subspace is called the kernel, or the null space.
4
– A matrix has full rank if rk(A) = min(m, n). Otherwise, it is rank deficient.
The goal of a basis is to provide a sense of structure to the vector space. We will now look at linear
mappings, which are operations that preserve the structure of a vector space. This is analogous to
the group operator, which preserves the structure of a group. Previously, we had operators + and
· for a vector space corresponding to elementwise addition and scalar multiplication. We will now
look at operators between vector spaces that preserve this structure.
These are sometimes also called vector space homomorphism or linear transformation.
• Φ : V → V is an endomorphism if it is linear
• Theorem: Finite dimensional vector spaces V, W are isomorphic if and only if dim(V ) =
dim(W ) (Axler 2015).
• Intuitively, this means that vector spaces with the same dimension are the ”same” in that
you can transform from one to the other without losing anything. This means that we can
treat the space of Rm×n matrices as the same as Rmn vectors, as there is a linear bijective
mapping from one to the other.
• More properties:
The key point of the previous section is to say that any n dimensional vector space is isomorphic
to Rn . Therefore, any reasoning we can do in Rn applies to any finite dimensional vector space.
So in the finite dimensional case, we only need to study Rn , since everything that is finite can be
reduce to Rn . As an example, suppose
5
• Let B = (b1 , . . . , bn ) be an ordered basis of V . For any x ∈ V , let x =
P
i αi bi be its unique
linear combination. Then, we call α1 , . . . , αn the coordinates of x.
• Typical coordinate system: standard basis e1 , . . . en . The coefficients tell us how to linearly
combine to obtain x. However, one could also use the basis ((1, 0), (1, 1)) to span R2 .
• Transformation matrix: Let V, W be vector spaces with bases B, C, and consider a linear
mapping Φ : V → W . For j ∈ {1, . . . , n} let
X
Φ(bj ) = αij ci
i
be the unique representation of Φ(bj ) with respect to C. Then, if A is the matrix given by
Aij = αij then A is the transformation matrix of Φ with respect to B and C. This tells us
how to go from one vector space to another but representented as a matrix.
• What this means is that any linear mapping between finite dimension spaces can be represented
with a matrix. Just pick a basis for the domain and target, and compute the coefficients!
We now have a coordinate system for our vector spaces, which depends on a chosen basis B.
However, remember that the basis is not unique: there are multiple different possibly bases for a
vector space. Depending on the basis, the resulting linear transformation could be easier or harder
to work with. We will work towards characterizing what it means for a basis to be “nice”. But in
order to do so, we first we need to understand how to change between bases.
As an example, consider the linear transformation Φ with transformation matrix
2 1
A=
1 2
in the standard basis. If instead of the standard basis, we use the basis B = ([1, 1], [1, −1]) then
the linear map Φ has transformation matrix
3 0
A=
0 1
which is a digonal matrix (which is nice). We’ll see how to get this in the next section.
• For a linear mapping Φ : V → W , consider two bases B = (b1 , . . . , bn ) and B̃ = (b̃1 , . . . , b̃n )
on V and two bases C = (c1 , . . . , cm ) and C̃ = (c̃1 , . . . , c̃m ) on W .
• Let AΦ ∈ Rm×n be the transformation matrix of Φ with respect to B, C and let ÃΦ be the
transformation matrix of Φ with respect to B̃, C̃.
6
• Theorem: (Basis Change) The transformation matrix of ÃΦ is given by
ÃΦ = T −1 AΦ S
where S ∈ Rn×n is the transformation matrix of the idV that maps B̃ onto B and T ∈ Rm×m
is the transformation matrix of idW that maps coordinates with respect to C̃ to coordinates
with respect to C.
• Proof: First, by definition of S we can write the b̃j as a sum of basis vectors bi :
X
b̃j = sij bi
i
Then, S maps B̃ onto B and T maps C̃ onto C (the columns are the coordinate representation
of b̃j and c̃k with respect to B and C). Now, re-express Φ(b̃j ) in two ways using these two
bases. First using C:
m
X m
X m
X m
X m
X
Φ(b̃j ) = ãkj ck = ãkj tlk cl = cl t̃lk akj
k=1 k=1 l=1 l=1 k=1
Then using B:
n n n m m n
!
X X X X X X
Φ(b̃j ) = Φ sij bi = sij Φ (bi ) = sij ali cl = cl ali sij
i=1 i=1 i=1 l=1 l=1 i=1
• Aside: Why are S and T regular (invertible)? They are the matrix representation of the
identity operator, which is an invertible operator.
• Two matrices A, Ã ∈ Rm×n are equivalent if there exists regular matrices S ∈ Rn×n , T ∈
Rm×m such that à = T −1 AS
• Two matrices A, Ã ∈ Rn×n are similar if there exists a regular matrix S ∈ Rn×n where
à = S −1 AS
– A maps V → W bases B to C
7
– Ã maps V → W from bases B̃ to C̃
– S is the identity mapping from basis B̃ to B
– T is the identity mapping from basis C̃ to C
– Then, B̃ → C̃ can be rewritten as
B̃ → B → C → C̃
Then,
−1 1 1 1 2 1 1 1 3 0
à = T AS = =
2 1 −1 1 2 1 −1 0 1
Now that we have a better understanding of bases and how to change between bases for linear
mappings, we’ll now cover our last fundamental concept for vector spaces: images and kernels.
• – It is always true that Φ(0) = 0 and therefore 0 ∈ ker(Φ), so the null space is never empty.
– It is also always true that ker(Φ) ⊆ V and Im(Φ) ⊆ W .
– Φ is injective if and only if ker(Φ) = {0}
• let A = [a1 , . . . , an ] be the columns of A. Then the image is the span of the columns (column
space):
X
Im(Φ) = {Ax : x ∈ Rn = xi ai : xi ∈ R} = span[a1 , . . . , an ] ⊂ mathbbA
i
• Then it follows that the rank of A is the dimension of the image, i.e. rk(A) = dim(Im(Φ))
• Rank Nullity Theorem: For vector spaces V, W and linear mapping Φ : V → W it holds that
8
• Some immediately consequences:
The last part we will consider here is affine subspaces. These are subspaces that have a linear
structure.
L = x0 + U = {x0 + u : u ∈ U }
is an affine subspace.
• If (b1 , . . . ,P
bk ) is an ordered basis of U then every element x ∈ L is uniquely described as
x = x0 + i λki bi
• In the same way that we can define linear mappings between vector spaces, we can also define
affine mappings between affine subspaces.
ϕ:V →W
x → a + Φ(x)
is an affine mapping from V to W , where a is the translation vector.
• Every affine mapping is the composition of a linear mapping Φ and a translation τ such that
ϕ=τ ⊙Φ
• Affine operators preserve dimension and parellelism and other geometric structures
9
5 Inner Product Spaces
∥·∥:V →R
• Examples for V = Rd :
P
1. ℓ1 norm ∥x∥1 = i |xi | (also called the Manhattan Norm),
qP √
2. ℓ2 norm ∥x∥2 = 2 xT x (also called the Euclidean distance, typically assumed
i xi =
as default),
P
3. ℓ0 norm ∥x∥0 = i ⊮[xi ̸= 0]
4. ℓ∞ norm ∥x∥∞ = maxi |xi |
• Recall: a linear map is one that is linear with respect to addition and multiplication with a
scalar, i.e. Φ(λx + ψy) = λΦ(x) + ψΦ(y)
Ω(x, y) = Ω(y, x)
• A positive definite, symmetric bilinear map Ω is an inner product on V , where we often write
Ω(x, y) = ⟨x, y⟩
• Inner products capture the notion of alignment between two vectors x, y. This is analagous
to the angle between vectors in Euclidean space.
10
1. ⟨x, y⟩ = xT y
2. ⟨x, y⟩ = (x1 y2 + x2 y1 ) + 2x2 y2
• Cauchy-Schwarz inequality:
|⟨x, y⟩| ≤ ∥x∥∥y∥
With an inner product, we can define other useful constructs such as positive definite matrices.
• Let V be an inner product space with an ordered basis BP= (b1 , . . . , bn ). Then, P any vectors
x, y ∈ V can be written as a sum of basis vectors x = i ψi bi and y = j λj bj (so x̂ =
(ψ1 , . . . , ψn ) and ŷ = (λ1 , . . . , λn ). Then,
* +
X X XX
⟨x, y⟩ = ψi bi , λj bj = ψi ⟨bi , ⟩bj λj = x̂T Aŷ
i j i j
• Similar to how a linear map is uniquely determined by its transformation matrix in a given
basis, an inner product is uniquely determined through A. Since an inner product is positive
definite, it follows that xT Ax > 0 for all x ∈ V \ {0}.
• Theorem: For a real-valued, finite dimensional vector space V with ordered basis B, ⟨·, ·⟩ is
an inner product if and only if there exists a symmetric positive definite matrix A such that
⟨x, y⟩ = x̂T Aŷ
• Therefore, a PSD matrix defines an inner product with respect to a change of basis.
Inner products can do more than just capture the alignment of two vectors: they’re a formalization
of the notion of length and distance as well.
• Can think of this induced norm as the alignment of a vector with itself
• A norm then defines a distance between two vectors as the size of the difference:
p
d(x, y) = ∥x − y∥ = ⟨x − y, x − y⟩
11
• Classic example: Euclidean distance
• Don’t confuse ⟨x, y⟩ with d(x, y) which have similar properties but behave oppositely
6 Orthogonality
Now that we have distances and similarity, we can now formalize a notion of dissimilarity. This is
useful in defining what is “left” after finding some basis vectors while minimizing redundancies.
• Orthogonality generalizes the idea of perpendicular to arbitrary norms that are not the dot
product.
• Orthogonality depends on the inner product! (1, 1) and (−1, 1) are orthogonal
with
respect
2 0
to the dot product, but not with respect to the inner product ⟨x, y⟩ = x⊤ y
0 1
• Orthogonal matrix: a square matrix A is orthogonal if and only if its colums are orthonormal.
∥Ax∥2 = ∥x∥2
• Such a basis has minimal redundancy between basis vectors (with respect to the inner product)
and is standardized to all have a norm of 1
U ⊥ = {v ∈ V |∀u ∈ U : ⟨u, v⟩ = 0}
X D−M
X
x= λM
m=1 bm + ψj b⊥
j
m j=1
12
• Example: the tangent line of a plane in 3D space defines a subspace that is the orthogonal
complement of the plane.
• Recall that linear mappings can be uniquely represented with transformation matrices (after
choosing a basis). Thus, projection matrices P satisfy P 2 = P .
• An orthogonal projection is a linear mapping that finds the “closest” point on the subspace:
If πU (x) is the closest point to x, then the difference ⟨πU (x) − x, u⟩ must be orthogonal to all
u ∈ U . If U has a basis B, then ⟨πU (x) − x, bi ⟩ = 0.
• Aside: projections to general sets instead of subspaces can be defined as the closest point to
a set U with respect to some norm.
⟨bi , x − Bλ⟩ = 0
B ⊤ (x − Bλ) = 0 ⇔ B ⊤ x = B ⊤ Bλ
λ = (B ⊤ B)−1 B ⊤ x
πU (x) = Bλ = BB ⊤ x
• Projections let us reason about spaces without solving for an exact solution.
• Projection onto affine spaces: recall that an affine space is L = x0 + U . We can subtract x0 ,
project onto U , and then add back in x0 to get the orthogonal projection:
πL (x) = x0 + πU (x − x0 )
Lastly, we’ll show some examples of inner products on functions, which will reflect what we need
later when we look at functional analysis.
13
• Inner products on non-finite spaces can be generalized to vectors with infinite entries (count-
ably infinite) and also continuous valued (uncountably infinite). In the latter case this becomes
an integral.
• Example: u = sin(x) and v = cos(x) is an odd function, and therefore the integral from −π
to π is 0. Thus, sin and cos are orthogonal functions.
• The collection of functions {1, cos(x), cos(2x), cos(3x), . . . , } is orthogonal when (a, b) = (−π, π).
This is a space of even and periodic functions, and projecting onto this space is the funda-
mental idea behind Fourier series.
14
7 Decompositions
We’ll now explore how various decompositions in linear algebra reduce down to change of basis.
• With norms we have lengths, and with inner products we have similarity between vectors (i.e.
angles). Lastly, we can introduce a notion of volume.
• Intuition: if a matrix is not invertible, then some of its columns are linearly dependent and it
spans a space less than N . Therefore, the volume in N dimensional space is 0. On the other
hand, a volume of 0 implies that the columns span a subspace with dimension less than N
and is therefore non-inverible.
• det(A) = det(A⊤ )
• det(A−1 ) = 1
det(A) if A is invertible
• Similar matrices have the same determinant, i.e. if à = S −1 AS then det(A) = det(A−1 ).
Therefore, determinent does not depend on the basis.
• The trace of a matrix tr(A) = i aii is the sum of the diagonal elements
P
• Properties of trace:
• Furthermore, the only operator that satisfies these properties is the trace
15
• Trace is invariant under cyclic permutations:
tr(AKL) = tr(KLA)
Simpler:
tr(xy T ) = tr(y T x) = y T x
tr(Φ) = tr(AΦ )
With the determinant and the trace, we can now cover eigenvalues and eigenvectors.
• Eigenvectors are directions where the transformation matrix A is a direct scaling of the
direction by λ. In other words, transforming a vector with A simply scales the vector by λ.
1. λ is an eigenvalue of A
2. There exists x such that Ax = λx, or equivalently (A − λI)x = 0 can be solved for x ̸= 0
3. rk(A − λI) < n
4. det(A − λI) = 0
• x, y are collinear if x = cy for some c. All vectors collinear to an eigenvector are also
eigenvectors, i.e.
A(cx) = cAx = cλx = λ(cx)
• Similar matrices à and A have the same eigenvalues. Therefore, a linear map Φ has the same
eigenvalues regardless of the choice of basis.
• In total, eigenvalues, determinant, and trace are all invariant under basis change.
16
A diagonal matrix is very “nice”. It has fast computation of determinants, powers, inverses, etc.
An eigenvalue decomposition is a way to transform matrices into a diagonal form, and is a direct
application of the change of basis. Let’s explore this in more detail.
• Diagonalizable: we would like to have matrices A that are similar to diagonal matrices D.
Then, this would mean that applying the linear map A is equivalent to scaling each coordinate
in the new basis P .
AP = P D
if and only if λ are the eigenvalues and P are the eigenvectors. Why? By definition,
AP = [Ap1 , . . . , ApN ]
P D = λ1 p1 , . . . , λN pN
• Therefore, D = P −1 AP if and only if D are the eigenvalues and P are the eigenvalues (i.e.
the diagonalization of the matrix is the eigenvalues and eigenvectors).
A = P DP −1
where P is square and D is diagonal with eigenvalues of A if and only if the eigenvectors of
A form a basis of V .
• Symmetric matrices can always be diagonalized. In particular, AA⊤ and A⊤ A are always
diagonalizable.
Lastly, we’ll summarize a few of the major decompositions and how these can be understood as
finding a change of basis that results in a nice transformation matrix.
• The Cholesky decomposition for symmetric, positive definite matrices is A = LDLT , which
can be viewed as a change of basis that results in a diagonal transformation matrix. In
particular, the resulting basis is an orthogonal basis with respect to the matrix inner product
⟨x, y⟩ = xT Ay.
17
• Sinkhorn normal form for square positive matrices is A = D1 SD2 which can be viewed as a
change of basis that results in a double stochastic transformation matrix S.
As an example, PCA is often described as finding a basis that retains as much information as
possible. This is formalized as finding basis directions that maximize the variance of the coordinates:
max V[b⊤ x]
b
where b⊤ x is the 1D projection of x onto the coordinate b. PCA finds basis vectors b that maximize
this variance by iteratively repeating this on the remaining orthogonal subspace.
18