Lecture 9
Least Squares, QR and SVD
T. Gambill
Department of Computer Science
University of Illinois at Urbana-Champaign
March 15, 2011
T. Gambill (UIUC) CS 357 March 15, 2011 1 / 22
Example 1: Finding a curve that best fits the data
Suppose we are given the data {(t1 , y1 ), ..., (t21 , y21 )} (circles) and we want to
find a parabolic curve that best fits the data.
T. Gambill (UIUC) CS 357 March 15, 2011 2 / 22
Example 1: Finding a curve that best fits the data
We are looking for a curve of the form,
f (t, x) = x1 + x2 t + x3 t2
so that
t21
1 t1 y1
1 t2 t22 x1 y2
A ∗ x = . .. ∗ x2 ≈ .. = b
..
.. . . x3 .
1 t21 t221 y21
The matrix A has the form of a Vandermonde matrix.
T. Gambill (UIUC) CS 357 March 15, 2011 3 / 22
Example 2: Reducing Measurement Error
Suppose a surveyor determined the heights of three hills above some
reference point as x1 = 1237ft., x2 = 1941ft. and x3 = 2417ft. and to confirm
these measurements the surveyor climbs to the top of the first hill and
measures the height of the second hill above the first to be, x2 = x1 + 711 and
the third above the first to be x3 = x1 + 1177. Similarly, the surveyor climbs the
second hill and measures the height of the third above the second to be
x3 = x2 + 475. (M. Heath) These equations can be written in matrix form as,
1 0 0 1237
0 1 0 1941
x
0 1 1
0 2417
A∗x= −1 1 0 ∗ x2 ≈ 711 = b
−1 0 1
x3
1177
0 −1 1 475
Systems with more equations than unknowns are called overdetermined
The system above, Ax = b is an over-determined linear system. What values
should the surveyor give for the heights of the hills?
T. Gambill (UIUC) CS 357 March 15, 2011 4 / 22
Overdetermined Systems
If A is an m × n matrix, then in general, an m × 1 vector b may not lie in the
column space of A. Hence Ax = b may not have an exact solution.
Definition
The residual vector is
r = b − Ax.
The least squares solution is given by minimizing the square of the residual
in the 2-norm.
T. Gambill (UIUC) CS 357 March 15, 2011 5 / 22
Normal equations
Writing r = (b − Ax) and substituting, we want to find an x that minimizes the
following function
φ(x) = ||r||22 = rT r = (b − Ax)T (b − Ax) = bT b − 2xT AT b + xT AT Ax
From calculus we know that the minimizer occurs where ∇φ(x) = 0.
The derivative is given by
∇φ(x) = −2AT b + 2AT Ax = 0
Definition
The system of normal equations is given by
AT Ax = AT b.
The normal equations has a unique solution if rank(A) = n.
T. Gambill (UIUC) CS 357 March 15, 2011 6 / 22
Normal equations, a Geometric View
If the vector b is not in the span (the set of all linear combinations of vectors)
of the columns of A then in order to find the minimum distance from b to the
span of the columns of A we need to find x∗ such that r = (b − Ax∗ ) is
orthogonal to Ax for any x.
< r, Ax >=< b − Ax∗ , Ax >= 0
T. Gambill (UIUC) CS 357 March 15, 2011 7 / 22
Normal equations, a Geometric View
< r, Ax >=< b − Ax∗ , Ax >= 0
or
< AT (b − Ax∗ ), x >= 0
so that with x = ei the column vectors of the identity matrix we have,
(AT (b − Ax∗ ))T ei = 0 for i = 1, ...n
and thus
AT b = AT Ax∗
T. Gambill (UIUC) CS 357 March 15, 2011 8 / 22
Solving normal equations
Since the normal equations forms a symmetric positive definite system
(assuming rank(A) = n), we can solve by computing the Cholesky
factorization
AT A = LLT
and solving Ly = AT b and LT x = y.
Consider
1 1
A = 0
0
√
where 0 < < mach . The normal equations for this system is given by
1 + 2
1 1 1
AT A = =
1 1 + 2 1 1
T. Gambill (UIUC) CS 357 March 15, 2011 9 / 22
Normal equations: conditioning
The normal equations tend to worsen the condition of the matrix.
Since we defined the condition number for a square matrix only we will have
to extend this definition for Am×n .
Definition
Let Am×n have rank(A) = n. Then we define the pseudo-inverse A+ of A as
A+ = (AT A)−1 AT
and we define the condition number of A as,
cond(A) = ||A||2 ||A+ ||2
Theorem
cond(AT A) = (cond(A))2
T. Gambill (UIUC) CS 357 March 15, 2011 10 / 22
Normal equations: Matlab conditioning example
1 >> A = rand (10 ,10);
2 >> cond(A)
3 43.4237
4 >> cond(A’*A)
5 1.8856 e+03
How can we solve the least squares problem without squaring the condition of
the matrix?
T. Gambill (UIUC) CS 357 March 15, 2011 11 / 22
Other approaches
QR factorization.
I For A ∈ Rm×n , factor A = QR where
F Q is an m × m orthogonal matrix
F R is an m × n upper triangular
0 matrix (since R is an m × n upper triangular
R
matrix we can write R = where R 0 is n × n upper triangular and 0 is the
0
(m − n) × n matrix of zeros)
SVD - singular value decomposition
I For A ∈ Rm×n , factor A = USV T where
F U is an m × m orthogonal matrix
F V is an n × n orthogonal matrix
F S is an m × n diagonal matrix whose elements are the singular values.
T. Gambill (UIUC) CS 357 March 15, 2011 12 / 22
Orthogonal matrices
Definition
A matrix Q is orthogonal if
QT Q = QQT = I
Orthogonal matrices preserve the Euclidean norm of any vector v,
||Qv||22 = (Qv)T (Qv) = vT QT Qv = vT v = ||v||22 .
T. Gambill (UIUC) CS 357 March 15, 2011 13 / 22
Using QR factorization for least squares
Now that we know orthogonal matrices preserve the euclidean norm, we can
apply orthogonal matrices to the residual vector without changing the norm of
the residual. Note that A is m x n, Q is m x m, R 0 is n x n, x is n x 1 and b is
m x 1.
0 2 0 2 0 2
2 2 R T T R T R
krk2 = kb−Axk2 = b − Q x = Q b−Q Q x = Q b− x
0 2
0 2
0 2
c1
If QT b = where c1 is an n x 1 vector then
c2
2 2 2
R0
0
c1 − R 0 x
c1 Rx
T
Q b− x = − = = ||c1 − R 0 x||22 + ||c2 ||22
0 2
c2 0 2
c2 2
Hence the least squares solution is given by solving R 0 x = c1 . We can solve
R 0 x = c1 using back substitution and the residual is ||r||2 = ||c2 ||2 .
T. Gambill (UIUC) CS 357 March 15, 2011 14 / 22
Gram-Schmidt orthogonalization
One way to obtain the QR factorization of a matrix Am×n (rank(A) = n) is by
Gram-Schmidt orthogonalization.
For each column of A, subtract out its component in the other columns.
For the simple case of 2 vectors {a1 , a2 }, first normalize a1 and obtain
a1
q1 = .
||a1 ||
Now subtract from a2 the components from q1 . This is given by
a20 = a2 − < q1 , a2 > q1 = a2 − (qT1 a2 )q1 .
Normalizing a20 we have
a20
q2 =
||a20 ||
Repeating this idea for n columns gives us Gram-Schmidt orthogonalization.
T. Gambill (UIUC) CS 357 March 15, 2011 15 / 22
Gram-Schmidt orthogonalization
T. Gambill (UIUC) CS 357 March 15, 2011 16 / 22
Gram-Schmidt orthogonalization
Since R is upper triangular and A = QR where Q = [q1 |q2 | . . . |qm ] we have
a1 = q1 r11
a2 = q1 r12 + q2 r22
.. ..
. = .
an = q1 r1n + q2 r2n + ... + qn rnn
From this we see that rij = qTi aj , j >= i
T. Gambill (UIUC) CS 357 March 15, 2011 17 / 22
Gram-Schmidt orthogonalization
1 function [Q,R] = gs_qr (A)
2
3 m = size(A ,1);
4 n = size(A ,2);
5
6 for i = 1:n
7 R(i,i) = norm(A(:,i) ,2);
8 Q(:,i) = A(:,i)./R(i,i);
9 for j = i+1:n
10 R(i,j) = Q(:,i)’ * A(:,j);
11 A(:,j) = A(:,j) - R(i,j)*Q(:,i);
12 end
13 end
14
15 end
T. Gambill (UIUC) CS 357 March 15, 2011 18 / 22
Using SVD for least squares
Recall that a singular value decomposition is given by
σ1
.. .. ..
..
... vT1 ...
. . . .
A = u1 . . . um
σr
..
... . . . .
.. .. .. .. vTn
... ...
. . . .
0
where σi are the singular values.
T. Gambill (UIUC) CS 357 March 15, 2011 19 / 22
Using SVD for least squares
Assume that A has rank k (and hence k nonzero singular values σi ) and recall
that we want to minimize
||r||22 = ||b − Ax||22 .
Substituting the SVD for A we find that
||r||22 = ||b − Ax||22 = ||b − USV T x||22
where U and V are orthogonal and S is diagonal with k nonzero singular
values.
||b − USV T x||22 = ||UT b − UT USV T x||22 = ||UT b − SV T x||22
T. Gambill (UIUC) CS 357 March 15, 2011 20 / 22
Using SVD for least squares
Let c = UT b and y = V T x (and hence x = Vy) in ||UT b − SV T x||22 . We now have
||r||22 = ||c − Sy||22
Since S has only k nonzero diagonal elements, we have
X
k X
m
||r||22 = (ci − σi yi )2 + c2i
i=1 i=k+1
ci
which is minimized when yi = σi for 1 6 i 6 k.
T. Gambill (UIUC) CS 357 March 15, 2011 21 / 22
Using SVD for least squares
Theorem
Let A be an m × n matrix of rank r and let A = USV T , the singular value
decomposition. The least squares solution of the system Ax = b is
X
r
x= (σ−1
i ci )vi
i=1
where ci = uTi b.
T. Gambill (UIUC) CS 357 March 15, 2011 22 / 22