LN2 Projection Ver2 Slides
LN2 Projection Ver2 Slides
Projection
Ping Yu
3 Projection in Rn
Projection Matrices
Hilbert Space
[Figure here]
Definition
Let M1 and M2 be two disjoint subspaces of H so that M1 \ M2 = f0g. The space
V = fh 2 Hjh = h1 + h2 , h1 2 M1 , h2 2 M2 g
Definition
Let M be a subspace of H. The space
M? fh 2 Hj hh, Mi = 0g
Definition
Suppose H = M1 M2 . Let h 2 H so that h = h1 + h2 for unique hi 2 Mi , i = 1, 2. Then
P is a projector onto M1 along M2 if Ph = h1 for all h. In other words, PM1 = M1 and
PM2 = 0. When M2 = M1? , we call P as an orthogonal projector. [Figure here]
What is M2 ?
[Back to Lemma 9]
The first part of the theorem states the existence and uniqueness of the projector.
The second part of the theorem states something related to the first order
conditions (FOCs) of (1) or, simply, orthogonal conditions. [Figure here]
From the theorem, Π( ) is the orthogonal projector onto M, where "orthogonality"
is defined by h , i, and need not be the intuitive orthogonality in the Euclidean
inner product.1
- In other words, given any closed subspace M of H, H = M M ? .
Also, the closest element in M to y is determined by M itself, not the vectors
generating M since there may be some redundancy in these vectors.
1
If we insist using the Euclidean inner product, then Π ( ) need not be an orthogonal projector but may be an
projector along a subspace, see GLS below.
Ping Yu (HKU) Projection 11 / 45
Hilbert Space and Projection Theorem
Figure: Projection
Sequential Projection
Proof.
Write y = Π2 (y ) + Π?
2 (y ). Then
Π1 (y ) = Π1 ( Π2 (y ) + Π? ?
2 (y )) = Π1 ( Π2 (y )) + Π1 ( Π2 (y )) = Π1 ( Π2 (y )),
We first project y onto a larger space M2 , and then project the projection of y (in
the first step) onto a smaller space M1 .
The theorem shows that such a sequential procedure is equivalent to projecting y
onto M1 directly.
We will see some applications of this theorem below.
Linear Projection
2
span(x) = z 2 L2 (P )jz = x0 α, α 2 Rk .
Ping Yu (HKU) Projection 15 / 45
Projection in the L2 Space
continue...
2E x y x0 β 0
= 0 ) E [xu ] = 0 (3)
h i
where u = y Π(y ) is the error, and β 0 = arg min E (y x0 β )2 .
β 2Rk
Π(y ) always exists and is unique, but β 0 needn’t be unique unless x1 , , xk are
linearly independent, that is, there is no nonzero vector a 2 Rk such that a0 x = 0
almost surely (a.s.).
h i
Why? If 8 a 6= 0, a0 x 6= 0, then E (a0 x)2 > 0 and a0 E [xx0 ] a > 0, thus E [xx0 ] > 0.4
So from (3),
1
β 0 = E xx0 E [xy ] (why?) (4)
1
and Π(y ) = x0 (E [xx0 ]) E [xy ].
In the literature, β with a subscript 0 usually represents the true value of β .
3 ∂ 0
= ∂∂x (x0 a) = a
∂ x (a x)
4
For a matrix A, A > 0 means it is positive definite.
Ping Yu (HKU) Projection 16 / 45
Projection in the L2 Space
Regression
The setup is the same as in linear projection except that M = L2 (P, σ (x)), where
L2 (P, σ (x)) is the space spanned by any function of x (not only the linear function
of x) as long as it is in L2 (P ).
h i
Π(y ) = arg min E (y h)2 (5)
h2M
Note that
h i
E (y h )2
h i
= E (y E [y jx] + E [y jx] h)2
h i h i
= E (y E [y jx])2 + 2E [(y E [y jx]) (E [y jx] h)] + E (E [y jx] h)2
h i h i h i
?
= E (y E [y jx])2 + E (E [y jx] h)2 E (y E [y jx])2 E [u 2 ],
so Π(y ) = E [y jx], which is called the population regression function (PRF), where
the error u satisfies E [ujx] = 0 (why?).
We can use variation to characterize the FOCs:
h i
0 = arg minE (y (Π(y ) + εh(x)))2
ε2R
2 E [h(x) (y (Π(y ) + εh(x)))]j =0 (6)
ε =0
) E [h(x)u ] = 0, 8 h(x) 2 L2 (P, σ (x))
Ping Yu (HKU) Projection 17 / 45
Projection in the L2 Space
1.75
1.5
1.5
Conditional Density
1
0.5
0 0.5
6
4
9
12
16 2
21 0 6 9 12 16 21
E [y jx] is a very nonlinear function, but over some range of education levels, e.g.,
[6, 16], it can be approximated by a linear function quite well.
x0 (E [xx0 ]) 1 E [xy ] is the BLP of E [y jx], i.e., the BLPs of y and E [y jx] are the
same. [Figure here]
This is a straightforward application of the law of iterated projections.
Explicitly, define
h i Z h i
2 2
β o = arg min E E [y jx] x0 β = arg min E [y jx] x0 β dF (x).
β 2Rk β 2Rk
Linear Regression
y = x0 β + u,
E [ujx] = 0.
Projection in Rn
The LSE
where En [ ] is the expectation under the empirical distribution of the data, and
n n n n
∑ ∑ yi2 ∑ xi yi + β 0 ∑ xi x0i β
2
SSR (β ) yi x0i β = 2β 0
i =1 i =1 i =1 i =1
= y0 y 2β 0 X0 y + β 0 X0 X β 5
5
X and y will be defined in the following slide.
Ping Yu (HKU) Projection 25 / 45
Projection in Rn
Normal Equations
= 2X y + 2X Xβb ,
0 0
X0 Xβb = X0 y.
So
βb = (X0 X) 1 0
X y.
6 ∂ 0 0 0
∂
∂ x (a x) = ∂ x (x a) = a, and ∂
∂ x (x Ax) = (A + A0 )x.
Ping Yu (HKU) Projection 27 / 45
Projection in Rn
Notations
Matrices are represented using uppercase bold. In matrix notation the sample
(data, or dataset) is (y, X), where y is an n 1 vector with ith entry yi and X is a
matrix with ith row x0i , i.e.,
0 1 0 0 1
y1 x1
B C B C
y = @ ... A and X = @ ... A ,
(n 1) (n k )
yn x0n
The first column of X is assumed to be ones if without further specification, i.e., the
first column of X is
1 = (1, , 1)0 .
The bold zero, 0, denotes a vector or matrix of zeros.
Reexpress X as
X = X1 Xk ,
where different from xi , Xj , j = 1, , k , represents the jth column of X and is all
the observations for jth variable.
The linear regression model upon stacking all n observations is then
y = Xβ + u,
where u is an n 1 column vector with ith entry ui .
Ping Yu (HKU) Projection 28 / 45
Projection in Rn
LSE as a Projection
The above derivation of βb expresses the LSE using rows of the data matrices y
and X. The following expresses the LSE using columns of y and X.
y 2 Rn , X1 , , Xk 2 Rn are linearly independent,
M = span (X1 , , Xk ) span(X),7 H = Rn with the Euclidean inner product.8
2
where ∑ni=1 yi x0i β is exactly the objective function of OLS.
7
span(X) = z 2 Rn jz = Xα, α 2 Rk is called the column space of X.
8
Recall that for x = (x1 , , xn )0 , and z = (z1 , , zn )0 , the Euclidean inner product of x and z is
hx, zi = ∑ni=1 xi zi , so kxk2 = hx, xi = ∑ni=1 xi2 .
Ping Yu (HKU) Projection 29 / 45
Projection in Rn
Solving βb
As Π(y) = Xβb , we can solve out βb by premultiplying both sides by X, that is,
X0 u
b = 0,
Finally,
1 0
Π ( y ) = X ( X0 X ) X y = PX y,
where PX is called the projection matrix.
Multicollinearity
In the calculation of the above two slides, we first project y on span(X) to get Π(y)
and then find βb such that Π(y) = Xβb .
The two steps involve very different operations: optimization versus solving linear
equations.
Furthermore, although Π(y) is unique, βb may not be. When rank(X) < k or X is
rank deficient, there are more than one (actually, infinite) βb such that Xβb = Π(y).
This is called multicollinearity and will be discussed in more details in the next
chapter.
In the following discussion, we always assume rank(X) = k or X is full-column
rank; otherwise, some columns of X can be deleted to make it so.
All are the same as in the last example except hx, ziW = x0 Wz, where the weight
matrix W > 0.
The projection
Π(y) = X arg min ky Xβ k2W . (8)
β 2Rk
FOCs are
e
X, u W = 0 (orthogonal conditions)
e=y
where u Xβe , that is,
Thus
1 0
Π(y) = X(X0 WX) X Wy = PX?WX y
where the notation PX?WX will be explained later.
Projection Matrices
continue...
0
PX is symmetric: P0X = X(X0 X) 1 X0 = X ( X0 X ) 1 X0 = PX .
and
tr(MX ) = tr(In PX ) = tr(In ) tr(PX ) = n k < n.
For a general "nonorthogonal" projector P, it is still unique and idempotent, but
need not be symmetric (let alone positive semidefiniteness).
For example, PX?WX in the GLS estimation is not symmetric.
9
A square matrix A is idempotent if A2 = AA = A. An idempotent matrix need not be symmetric.
10
Trace of a square matrix is the sum of its diagonal elements. tr(A + B) =tr(A)+tr(B) and tr(AB) =tr(BA).
Ping Yu (HKU) Projection 34 / 45
Partitioned Fit and Residual Regression
Partitioned Fit
with rank(X) = k .
We will show that βb 1 is the "net" effect of X1 on y when the effect of X2 is removed
from the system. This result is called the Frisch-Waugh-Lovell (FWL) theorem due
to Frisch and Waugh (1933) and Lovell (1963). [Figure here]
The FWL theorem is an excellent implication of the projection property of least
squares.
To simplify notation, P P ,M j M , Π (y) = X βb , j = 1, 2.
Xj j Xj j j j
History of FWL
Theorem
βb 1 could be obtained when the residuals from a regression of y on X2 alone are
regressed on the set of residuals obtained when each column of X1 is regressed on
X2 . In mathematical notations,
1 1
βb 1 = X01?2 X1?2 X01?2 y?2 = X01 M2 X1 X01 M2 y.
Corollary
1 ?a
Π1 (y) X1 βb 1 = X1 X01?2 X1 X01?2 y P12 y = P12 (Π(y)).
a
Will be explained below.
P12
1
P12 = X1 X01 (I P2 )X1 X0 (I P2 ).
|{z} | 1 {z }
trailing term leading term
I P2 in the leading term annihilates span(X2 ) so that P12 (Π2 (y)) = 0. The
leading term sends Π(y) toward span? (X2 ).
But the trailing X1 ensures that the final result will lie in span(X1 ).
The rest of the expression for P12 ensures that X1 is preserved under the
transformation: P12 X1 = X1 .
Why P12 y = P12 (Π(y))? We can treat the projector P12 as a sequential projector:
first project y onto span(X) to get Π(y), and then project Π(y) to span(X1 ) along
span(X2 ) to get Π1 (y). [Figure here]
βb is calculated from Π (y) by
1 1
βb 1 = (X01 X1 ) 1 0
X1 Π1 (y).
1 1
βe 1 = b0 U
U b
x1 x1
b0 u
U 0
x1 b y = X1 M2 X1 X01 M2 y
h i 1h i
1
= X01 X1 X01 X2 (X02 X2 ) 1 X02 X1 X01 y X01 X2 X02 X2 X02 y
h i
1 0
B 1 X01 y X01 X2 X02 X2 X2 y
continue...
and
βb 1 = B 1 0
X1 y B 1 0
X1 X2 (X02 X2 ) 1 X02 y
h i
1 0
= B 1
X01 y X01 X2 X02 X2 X2 y = βe 1 .
where A11.2 = A11 A12 A221 A21 and A22.1 is similarly defined.
1
To show βb 1 = X01?2 X1?2 X01?2 y?2 , we need only show that
X01 M2 y = X01 M2 X1 βb 1 .
Multiplying y = X1 βb 1 + X2 βb 2 + u
b by X01 M2 on both sides, we have
Lemma
Define PX?Z as the projector onto span(X) along span? (Z), where X and Z are n k
matrices and Z0 X is nonsingular. Then PX?Z is idempotent, and
1 0
PX?Z = X(Z0 X) Z.
1 0 1 1
PX = and PX?Z = .
0 0 0 0