Understanding  Conditional   Expectation  via  Vector
Projection
Cheng-Shang  Chang
Department  of  Electrical  Engineering
National  Tsing  Hua  University
Hsinchu,  Taiwan,  R.O.C.
Jan.   14,  2008
1
Motivation  and  References
 Many  students  are  confused  with  conditional  expectation.
 In  this  talk,  we  explain  how  conditional  expectation  (taught  in
probability)  is  related  to  linear  transformation  and  vector  projection
(taught  in  linear  algebra).
 References:
 S.J.  Leon.   Linear Algebra with Applications.   New  Jersey:   Prentice  Hall,
1998.
 S.  Ghahramani.   Fundamentals of Probability.   Pearson  Prentice  Hall,
2005.
2
Conditional   Expectation
 Consider  two  discrete  random  variables  X  and  Y .
 Let  p(x, y) = P(X  = x, Y  = y)  be  the  joint  probability  mass  function.
 Then  the  marginal  distribution
p
X
(x) = P(X  = x) =
yB
p(x, y),
where  B  is  the  set  of  possible  values  of  Y .
 Similarly,
p
Y
(y) = P(Y  = y) =
xA
p(x, y),
where  A  is  the  set  of  possible  values  of  X.
 Then  the  conditional  probability  mass  function  of  X  given  Y  = y  is
p
X[Y
(x[y) = P(X  = x[Y  = y) =
  p(x, y)
p
Y
(y)
 .
3
Conditional   Expectation
 The  conditional  expectation  of  X  given  Y  = y  is  dened  as
E[X[Y  = y] =
xA
xp
X[Y
(x[y).   (1)
 Consider  a  real-valued  function  h  from 1  to 1.
 From  the  law  of  unconscious  statistician,  the  conditional  expectation  of
h(X)  given  Y  = y  is
E[h(X)[Y  = y] =
xA
h(x)p
X[Y
(x[y).
 The  conditional  expectation  of  X  given  Y ,  denoted  by  E[X[Y ],  is  the
function  of  Y   that  is  dened  to  be  E[X[Y  = y]   when  Y  = y.
 Specically,  let  (x)  be  the  function  with  (0) = 1  and  (x) = 0  for  all   x ,= 0.
 Also,  let  
y
(Y ) = (Y y)  be  the  indicator  random  variable  such  that
y
(Y ) = 1  if  the  event Y  = y  occurs  and  
y
(Y ) = 0  otherwise.
 Then
E[X[Y ] =
yB
E[X[Y  = y]
y
(Y ) =
yB
xA
xp
X[Y
(x[y)
y
(Y ).   (2)
4
Properties  of  Conditional   Expectation
 The  expectation  of  the  conditional  expectation  of  X  given  Y   is  the  same
as  the  expectation  of  X,  i.e.,
E[X] = E[E[X[Y ]].   (3)
 Let  h  be  a  real-valued  function  from 1  to 1.   Then
E[h(Y )X[Y ] = h(Y )E[X[Y ].   (4)
As  E[X[Y ]   is  a  function  of  Y ,
E[E[X[Y ][Y ] = E[X[Y ]E[1[Y ] = E[X[Y ].
 This  then  implies
E[X E[X[Y ][Y ] = 0.   (5)
 Using  (3)  and  (5)  yields
E[h(Y )(X E[X[Y ])] = E[E[h(Y )(X E[X[Y ])][Y ]
= E[h(Y )E[(X E[X[Y ])][Y ] = 0.   (6)
5
Properties  of  Conditional   Expectation
 Let  f  be  a  real-valued  function  from 1  to 1.
E[(X f(Y ))
2
] = E
(X E[X[Y ]) + (E[X[Y ] f(Y ))
= E[(X E[X[Y ])
2
] + 2E[(X E[X[Y ])(E[X[Y ] f(Y ))] + E[(E[X[Y ] f(Y ))
2
]
= E[(X E[X[Y ])
2
] + E[(E[X[Y ] f(Y ))
2
],
where  the  crossterm  is  0  from  (6).
 The  conditional  expectation  of  X  given  Y   is  the  function  of  Y   that
minimizes  E[(X f(Y ))
2
]   over  the  set  of  functions  of  Y ,  i.e.,
E[(X E[X[Y ])
2
]  E[(X f(Y ))
2
],   (7)
for  any  function  f.
6
Vector  Space
 Let  V   be  the  a  set  on  which  the  operations  of  vector  addition  and  scalar
multiplication  are  dened.
 Axioms:
 (Commutative  law)  u + v  = v + u  for  all   u  and  v  in  V .
 (Associative  law(i))  (u + v) + w = u + (v + w)  for  all   u, v, w  in  V .
 (Zero  element)  There  exists  an  element  0  such  that  u + 0 = u  for
any  u  V .
 (Inverse)  For  any  u  V ,  there  exists  an  element u  V   such  that
u + (u) = 0.
 (Distributive  law(i))  (u + v) = u + v  for  any  scalar    and  u, v  V .
 (Distributive  law(ii))  ( + )u = u + u  for  any  scalars    and    and
any  u  V .
 (Associative  law  (ii))  ()u = (u)  for  any  scalars    and    and  any
u  V .
 (Identity)  1  u = u  for  any  u  V .
7
Vector  Space
 Closure  properties:
 If  u  V   and    is  a  scalar,  then  u  V .
 If  u, v  V ,  then  u + v  V .
 Additional  properties  from  the  axioms  and  the  closure  properties:
 0  u = 0.
 u + v  = 0  implies  that  v  = u.
 (1)  u = u.
 Example:   the  vector  space  C[a, b]
 Let  C[a, b]   be  the  set  of  real-valued  functions  that  are  dened  and
continuous  on  the  closed  interval   [a, b].
 Vector  addition:   (f  + g)(x) = f(x) + g(x).
 Scalar  multiplication:   (f)(x) = f(x).
8
Subspace
 (Subspace)  If  S  is  a  nonempty  subset  of  a  vector  space  V ,  and  S  satises
the  closure  properties,  then  S  is  called  a  subspace  of  V .
 (Linear  combination)  Let  v
1
, v
2
, . . . , v
n
  be  vectors  in  a  vector  space  V .   A
sum  of  the  form  
1
v
1
 + 
2
v
2
 + . . . + 
n
v
n
  is  called  a  linear  combination  of
v
1
, v
2
, . . . , v
n
.
 (Span)  The  set  of  all  linear  combinations  of  v
1
, v
2
, . . . , v
n
  is  called  span  of
v
1
, v
2
, . . . , v
n
  (denoted  by  Span(v
1
, v
2
, . . . , v
n
)).
 (Spanning  set)  The  set v
1
, v
2
, . . . , v
n
  is  a  spanning  set  for  V   if  and  only  if
every  vector  in  V   can  be  written  as  a  linear  combination  of  v
1
, v
2
, . . . , v
n
,
i.e.,
V  Span(v
1
, v
2
, . . . , v
n
).
 (Linearly  independent)  The  vectors  v
1
, v
2
, . . . , v
n
  in  a  vector  space  V   are
said  to  be  linearly  independent  if
c
1
v
1
 + c
2
v
2
 + . . . c
n
v
n
 = 0
implies  that  all  of  the  scalars  c
1
, . . . , c
n
  must  be  0.
9
Basis  and  Dimension
 (Basis)  The  vectors  v
1
, v
2
, . . . , v
n
  form  a  basis  for  a  vector  space  V   if  and
only  if
 v
1
, v
2
, . . . , v
n
  are  linear  independent.
 v
1
, v
2
, . . . , v
n
  span  V .
 (Dimension)  If  a  vector  space  V   has  a  basis  consisting  of  n  vectors,  we
say  that  V   has  dimension  n.
 nite-dimensional  vector  space:   If  there  is  a  nite  set  of  vectors
that  span  the  vector  space.
 innite-dimensional  vector  space:   for  example  C[a, b]
 Theorem:   Suppose  that  V   is  a  vector  space  of  dimension  n > 0.
 Any  set  of  n  linear  independent  vectors  spans  V .
 Any  n  vectors  that  span  V   are  linear  independent.
 No  set  of  less  than  n  vectors  can  span  V .
10
Coordinates
 Let  E  = v
1
, v
2
, . . . , v
n
  be  an  ordered  basis  for  a  vector  space  V .
 For  any  vector  v  V ,  it  can  be  uniquely  written  in  the  form
v  = c
1
v
1
 + c
2
v
2
 + . . . c
n
v
n
.
 The  vector  c = (c
1
, c
2
, . . . , c
n
)
T
in 1
n
is  called  the  coordinate  vector  of  v
with  respect  to  the  ordered  basis  E  (denoted  by  [v]
E
).
 The  c
i
s  are  called  the  coordinates  of  v  relative  to  E.
 A  vector  space  with  dimension  n  is  isomorphic  to 1
n
once  a  basis  is
found.
11
Random  Variables  on  the  Same  Probability  Space
 A  probability  space  is  a  triplet  (S, T, P),  where  S  is  the  sample  space, T  is
the  set  of  (measurable)  events,  and  P  is  the  probability  measure.
 A  random  variable  X  on  a  probability  space  (S, T, P)  is  a  mapping  from
X  : S  1.
 The  set  of  all  random  variables  on  the  same  probability  space  forms  a
vector  space  with  each  random  variable  being  a  vector.
 Vector  addition:   (X + Y )(s) = X(s) + Y (s)  for  every  sample  point  s  in
the  sample  space  S.
 Scalar  multiplication:   (X)(s) = X(s)  for  every  sample  point  s  in
the  sample  space  S.
12
The  Set  of  Functions  of  a  Discrete  Random  Variable
 Suppose  that  X  is  a  discrete  random  variable  with  the  set  of  possible
values  A = x
1
, x
2
, . . . , x
n
.
 Let  
x
i
(X) = (X x
i
)  be  the  indicator  random  variable  with  
x
i
(X) = 1  if
the  event X  = x
i
  occurs  and  0  otherwise.
 Let  (X) = Span(
x
1
(X), 
x
2
(X), . . . , 
x
n
(X)).
 
x
1
(X), 
x
2
(X), . . . , 
x
n
(X)  are  linearly  independent.   To  see  this,
suppose  s
i
  is  a  sample  point  such  that  X(s
i
) = x
i
.   Then
(c
1
x
1
(X) + c
2
x
2
(X) + . . . + c
n
x
n
(X))(s
i
) = 0(s
i
) = 0
implies  that  c
i
 = 0.
 
x
1
(X), 
x
2
(X), . . . , 
x
n
(X)  is  a  basis  of  (X).
 (X)  is  a  vector  space  with  dimension  n.
13
The  Set  of  Functions  of  a  Discrete  Random  Variable
 (X)  is  the  set  of  (measurable)  functions  of  the  random  variable  X.
 For  any  real-valued  function  g  from 1  to 1,   g(X)  is  a  vector  in
(X)  as
g(X) =
n
i=1
g(x
i
)
x
i
(X).
 For  any  vector  v  in  (X),  there  is  a  real-valued  function  g  from 1
to 1  such  that  v  = g(X).   To  see  this,  suppose  that
v  =
n
i=1
c
i
x
i
(X).
We  simply  nd  a  function  g  such  that  g(x
i
) = c
i
  for  all   i.
 The  vector  (g(x
1
), g(x
2
), . . . , g(x
n
))
T
 1
n
is  the  coordinate  vector  of  g(X)
with  respect  to  the  ordered  basis 
x
1
(X), 
x
2
(X), . . . , 
x
n
(X).
 In  probability  theory,   (X)  is  often  called  as  the  -algebra  generated  by
the  random  variable  X,  and  a  random  variable  Y   is  called
(X)-measurable  if  there  is  a  (measurable)  function  g  such  that  Y  = g(X).
14
Linear  Transformation
 A  mapping  L  from  a  vector  space  V   into  a  vector  space  W  is  said  to  be  a
linear  transformation  if
L(v
1
 + v
2
) = L(v
1
) + L(v
2
)
for  all   v
1
, v
2
  V   and  for  all  scalars  , .
 (Matrix  representation  theorem)  If  E  = [v
1
, v
2
, . . . , v
n
]   and  F  = [w
1
, w
2
, . . . , w
m
]
are  ordered  bases  for  vector  spaces  V   and  W,  respectively,  then
corresponding  to  each  linear  transformation  L : V  W  there  is  an  mn
matrix  A  such  that
[L(v)]
F
  = A[v]
E
  for  each v  V.
 The  matrix  A  is  called  the  matrix  representing  the  linear  transformation
L  relative  to  the  ordered  bases  E  and  F.
 The  j
th
column  of  the  matrix  A  is  simply  of  the  coordinate  vector  of  L(v
j
)
with  respect  to  the  ordered  basis  F,  i.e.,
a
j
  = [L(v
j
)]
F
.
15
Conditional   Expectation  As  a  Linear  Transformation
 Suppose  that  X  is  a  discrete  random  variable  with  the  set  of  possible
values  A = x
1
, x
2
, . . . , x
n
.
 Suppose  that  Y   is  a  discrete  random  variable  with  the  set  of  possible
values  B  = y
1
, y
2
, . . . , y
m
.
 Let  (X) = Span(
x
1
(X), 
x
2
(X), . . . , 
x
n
(X))  be  the  vector  space  that  consists
of  the  set  of  functions  of  the  random  variable  X.
 Let  (Y ) = Span(
y
1
(Y ), 
y
2
(Y ), . . . , 
y
m
(Y ))  be  the  vector  space  that  consists
of  the  set  of  functions  of  the  random  variable  Y .
 Consider  the  linear  transformation  L : (X)  (Y )  with
L(
x
i
(X)) =
m
j=1
P(X  = x
i
[Y  = y
j
)
y
j
(Y ),   i = 1, 2, . . . , n.
 The  linear  transformation  L  can  be  represented  by  the  mn  matrix  A
with
a
i,j
  = P(X  = x
i
[Y  = y
j
).
16
Conditional   Expectation  As  a  Linear  Transformation
 Since  g(X) =
n
i=1
g(x
i
)
x
i
(X),  we  then  have
L(g(X)) = L(
n
i=1
g(x
i
)
x
i
(X))
=
n
i=1
g(x
i
)L(
x
i
(X))
=
n
i=1
g(x
i
)
m
j=1
P(X  = x
i
[Y  = y
j
)
y
j
(Y )
=
m
j=1
i=1
g(x
i
)P(X  = x
i
[Y  = y
j
)
y
j
(Y )
=
m
j=1
E[g(X)[Y  = y
j
]
y
j
(Y )
= E[g(X)[Y ].
 The  linear  transformation  L  of  the  random  variable  g(X)  is  the  condition
expectation  of  g(X)  given  Y .
17
Inner  Product
 (Inner  product)  An  inner  product  on  a  vector  space  V   is  a  mapping  that
assigns  to  each  pair  of  vectors  u  and  v  in  V   a  real  number u, v)  with  the
following  three  properties:
 u, u)  0  with  equality  if  and  only  if  u = 0.
 u, v) = v, u)  for  all   u  and  v  in  V .
 u + v, w >= u, w) + v, w)  for  all   u, v, w  in  V   and  all  scalars  
and  .
 (Inner  product  space)  A  vector  space  with  an  inner  product  is  called  an
inner  product  space.
 (Length)  The  length  of  a  vector  u  is  given  by
[[u[[ =
u, u).
 (Orthogonality)  Two  vectors  u  and  v  are  orthogonal  if u, v) = 0.
 (The  Pythagorean  law)  If  u  and  v  are  orthogonal  vectors,  then
[[u + v[[
2
= [[u[[
2
+ [[v[[
2
.
18
Inner  Product  on  the  Vector  Space  of  Random  Variables
 Consider  the  vector  space  of  the  set  of  random  variables  on  the  same
probability  space.
 Then
X, Y ) = E[XY ]
is  an  inner  product  of  that  vector  space.
 Note  that  E[X
2
] = 0  implies  that  X  = 0  with  probability  1.
 If  we  restrict  ourselves  to  the  set  of  random  variables  with  mean  0.   Then
two  vectors  are  orthogonal  if  and  only  if  they  are  uncorrelated.
 As  a  direct  consequence,  two  independent  random  variables  with  mean  0
are  orthogonal.
19
Scalar  Projection  and  Vector  Projection
 (Scalar  projection)  If  u  and  v  are  vectors  in  an  inner  product  space  V   and
v ,= 0,  then  the  scalar  projection  of  u  onto  v  is  given  by
 =
 u, v)
[[v[[
  .
 (Vector  Projection)  The  vector  projection  of  u  onto  v  is  given  by
p = (
  1
[[v[[
v) =
 u, v)
v, v)
v.
 Properties:
 u p  and  p  are  orthogonal.
 u = p  if  and  only  if  u  is  a  scalar  multiple  of  v.
20
Vector  Projection  on  a  Vector  Space  with  an  Orthogonal
Basis
 An  order  basis v
1
, v
2
, . . . v
n
  for  a  vector  space  V   is  said  to  be  an
orthogonal  basis  for  V   if v
i
, v
j
) = 0  for  all   i ,= j.
 Let  S  be  a  subspace  of  an  inner  product  space  V .   Suppose  that  S  has  an
orthogonal  basis v
1
, v
2
, . . . v
n
.   Then  the  vector  projection  of  u  onto  S  is
given  by
p =
n
i=1
u, v
i
)
v
i
, v
i
)
v
i
.
 Properties:
 u p  is  orthogonal  to  every  vector  in  S.
 u = p  if  and  only  if  u  S.
 (Least  square)  p  is  the  element  of  S  that  is  closest  to  u,  i.e.,
[[u v[[ > [[u p[[,
for  any  v ,= p  in  S.   Prove  by  the  Pythagorean  law.
[[u v[[
2
= [[(u p) + (p v)[[
2
= [[u p[[
2
+ [[p v[[
2
.
21
Conditional   Expectation  as  a  Vector  Projection
 We  have  shown  that  E[g(X)[Y ]  is  the  linear  transformation  of  L(g(X))  from
(X)  to  (Y )  with
L(
x
i
(X)) =
m
j=1
P(X  = x
i
[Y  = y
j
)
y
j
(Y ) = E[
x
i
(X)[Y ],   i = 1, 2, . . . , n.
 Note  that  
y
i
(Y )
y
j
(Y ) = 0  for  all   i ,= j.
 Thus,   E[
y
i
(Y )
y
j
(Y )] = 0  for  all   i ,= j.
 
y
1
(Y ), 
y
2
(Y ), . . . , 
y
m
(Y )  is  an  orthogonal  basis  for  (Y ).
 The  vector  projection  of  
x
i
(X)  on  (Y )  is  then  given  by
m
j=1
x
i
(X), 
y
j
(Y ))
y
j
(Y ), 
y
j
(Y ))
y
j
(Y ) =
m
j=1
E[
x
i
(X)
y
j
(Y )]
E[
y
j
(Y )
y
j
(Y )]
y
j
(Y )
=
m
j=1
E[
x
i
(X)
y
j
(Y )]
E[
y
j
(Y )]
  
y
j
(Y ) =
m
j=1
P(X  = x
i
, Y  = y
j
)
P(Y  = y
j
)
  
y
j
(Y )
=
m
j=1
P(X  = x
i
[Y  = y
j
)
y
j
(Y ) = E[
x
i
(X)[Y ].
22
Conditional   Expectation  as  a  Vector  Projection
 Recall  that  an  inner  product  is  a  linear  transformation  for  the  rst
argument,  i.e.,
u + v, w >= u, w) + v, w)
for  all   u, v, w  in  V   and  all  scalars    and  .
 Since  g(X) =
n
i=1
g(x
i
)
x
i
(X),  the  vector  projection  of  g(X)  on  (Y )  is  then
given  by
m
j=1
g(X), 
y
j
(Y ))
y
j
(Y ), 
y
j
(Y ))
y
j
(Y ) =
m
j=1
n
i=1
g(x
i
)
x
i
(X), 
y
j
(Y ))
y
j
(Y ), 
y
j
(Y ))
  
y
j
(Y )
=
n
i=1
g(x
i
)
m
j=1
x
i
(X), 
y
j
(Y ))
y
j
(Y ), 
y
j
(Y ))
y
j
(Y )
=
n
i=1
g(x
i
)E[
x
i
(X)[Y ] = E[
n
i=1
g(x
i
)
x
i
(X)[Y ]
= E[g(X)[Y ].
 Thus,   E[g(X)[Y ]   is  the  vector  projection  of  g(X)  on  (Y ).
23
Conditional   Expectation  as  a  Vector  Projection
 It  then  follows  from  the  properties  of  vector  projection  that
 g(X) E[g(X)[Y ]   is  orthogonal  to  every  random  variable  in  (Y ),  i.e.,
for  any  real-valued  function  h : 1  1,
g(X) E[g(X)[Y ], h(Y )) = E[(g(X) E[g(X)[Y ])h(Y )] = 0.
 (Least  square)  E[g(X)[Y ]   is  the  element  of  (Y )  that  is  closest  to
g(X),  i.e.,  for  any  real-valued  function  h : 1  1  and
h(Y ) ,= E[g(X)[Y ],
E[(g(X)h(Y ))
2
] = [[g(X)h(Y )[[ > [[g(X)E[g(X)[Y ][[ = E[(g(X)E[g(X)[Y ])
2
].
24
Conditioning  on  a  Set  of  Random  Variables
 Note  that  Y   only  needs  to  be  a  random  element  in  the  previous
development.
 In  particular,  if  Y  = (Y
1
, Y
2
, . . . , Y
d
)  is  a  d-dimensional  random  vector,  then
(Y ) = (Y
1
, Y
2
, . . . , Y
d
)  is  the  set  of  functions  of  Y
1
, Y
2
, . . . , Y
d
.
 E[g(X)[Y ] = E[g(X)[Y
1
, Y
2
, . . . , Y
d
]   is  the  vector  projection  of  g(X)  on
(Y
1
, Y
2
, . . . , Y
d
).
 g(X) E[g(X)[Y
1
, Y
2
, . . . , Y
d
]   is  orthogonal  to  every  random  variable  in
(Y
1
, Y
2
, . . . , Y
d
),  i.e.,  for  any  function  h : 1
d
 1,
g(X) E[g(X)[Y
1
, Y
2
, . . . , Y
d
], h(Y
1
, Y
2
, . . . , Y
d
))
= E[(g(X) E[g(X)[Y
1
, Y
2
, . . . , Y
d
])h(Y
1
, Y
2
, . . . , Y
d
)] = 0.
 (Least  square)  E[g(X)[Y
1
, Y
2
, . . . , Y
d
]   is  the  element  of  (Y
1
, Y
2
, . . . , Y
d
)
that  is  closest  to  g(X),  i.e.,  for  any  function  h : 1
d
 1  and
h(Y
1
, Y
2
, . . . , Y
d
) ,= E[g(X)[Y
1
, Y
2
, . . . , Y
d
],
E[(g(X) h(Y
1
, Y
2
, . . . , Y
d
))
2
] > E[(g(X) E[g(X)[Y
1
, Y
2
, . . . , Y
d
])
2
].
25
General   Denition  of  Conditional   Expectation
 In  some  advanced  probability  books,  conditional  expectation  is  dened  in
a  more  general  way.
 For  a  -algebra (,   E[X[(]   is  dened  to  be  the  random  variable  that
satises
(i) E[X[(]   is (-measurable,  and
(ii)
A
XdP =
A
 E[X[(]dP  for  all   A  (.
 To  understand  this  denition,  consider  the  -algebra  generated  by  the
random  variable  Y   (denoted  by  (Y )).
 The  condition  that  E[X[Y ]   is  (Y )-measurable  is  simply  that  E[X[Y ]   is  a
(measurable)  function  of  Y ,  i.e.,   E[X[Y ] = h(Y )  for  some  (measurable)
function.
 To  understand  the  second  condition,  one  may  rewrite  it  as  follows:
E[1
A
X] = E[1
A
E[X[Y ]],   (8)
for  all  event  A  in  (Y ),  where  1
A
  is  the  indicator  random  variable  with
1
A
 = 1  when  the  event  A  occurs.
26
General   Denition  of  Conditional   Expectation
 Since  1
A
  is  (Y )-measurable,  it  must  be  a  function  of  Y .   Thus,  (8)  is
equivalent  to
E[g(Y )X] = E[g(Y )E[X[Y ]],   (9)
for  any  (measurable)  function  g.
 Now  rewriting  (9)  using  the  inner  product  yields
g(Y ), X E[X[Y ]) = 0,   (10)
for  any  function  g.
 The  condition  in  (10)  simply  says  that  X E[X[Y ]   is  orthogonal  to  every
vector  in  (Y )  (X E[X[Y ]   is  in  the  orthogonal  complement  of  (Y )).
 To  summarize,  the  rst  condition  is  that  the  vector  projection  should  be
in  the  projected  space,  and  the  second  condition  is  that  the  dierence
between  the  vector  being  projected  and  the  vector  projection  should  be
in  the  orthogonal  complement  of  the  projected  space.
 These  two  conditions  are  exactly  the  same  as  those  used  to  dene
projections  in  linear  algebra.
27
Projections  on  the  Set  of  Linear  Functions  of  Y
 Recall  that  (Y ) = Span(
y
1
(Y ), 
y
2
(Y ), . . . , 
y
m
(Y ))  is  the  set  of  functions  of
Y .
 
L
(Y ) = Span(Y, 1)  be  the  set  of  linear  functions  of  Y ,  i.e.,  the  set  of
functions  of  the  form  aY  + b  for  some  constants  a  and  b.
 
L
(Y )  is  a  subspace  of  (Y ).
 However,   Y   and  1  are  in  general  not  orthogonal  as  E[Y  1] = E[Y ]   may  not
be  0.
 (Gram-Schmidt  orthogonalization  process) Y E[Y ], 1  is  an  orthogonal
basis  for  
L
(Y )  as
E[(Y E[Y ])  1] = E[Y ] E[Y ] = 0.
 The  projection  of  a  random  variable  X  on  
L
(Y )  is  then  given  by
p
L
 =
  X, Y E[Y ])
Y E[Y ], Y E[Y ])
(Y E[Y ]) +
 X, 1)
1, 1)
  1
=
  E[XY ] E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) + E[X].
28
Projections  on  the  Set  of  Linear  Functions  of  Y
 It  then  follows  from  the  properties  of  vector  projection  that
 X 
  E[XY ]E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) E[X]   is  orthogonal  to  every  random
variable  in  
L
(Y ),  i.e.,  for  any  constants  a  and  b,
E
X 
 E[XY ] E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) E[X]
aY  + b
 = 0.
 (Least  square)
  E[XY ]E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) + E[X]   is  the  element  of  
L
(Y )
that  is  closest  to  X,  i.e.,  for  any  constants  a  and  b,
E[(X aY b)
2
]  E
X 
 E[XY ] E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) E[X]
.
 When  X  and  Y   are  jointly  normal,  then  the  vector  projection  of  X  on
(Y )  is  the  same  as  that  on  
L
(Y ),  i.e.,
E[X[Y ] =
  E[XY ] E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) + E[X].
29
Projections  on  a  Subspace  of  (Y )
 Let  Y
i
 = 
i
(Y ),   i = 1, 2, . . . , d,  where  
i
()s  are  some  known  functions  of  Y .
 Let  
(Y ) = Span(1, Y
1
, Y
2
, . . . , Y
d
).
 
(Y )  is  a  subspace  of  (Y ).
 In  general, 1, Y
1
, Y
2
, . . . , Y
d
  is  not  an  orthogonal  basis  of  
(Y ).
 How  do  we  nd  an  orthogonal  basis  of  
(Y )?
 (Zero  mean)  Let
  
Y
i
 = Y
i
E[Y
i
].   Then 1,
  
Y
i
) = E[
Y
i
] = 0.
 (Matrix  diagonalization)  Let
  
Y = (
Y
1
,
  
Y
2
, . . . ,
  
Y
d
)
T
.   Let  A = E[
Y
T
]   be  the
d d  covariance  matrix.   As  A  is  symmetric,  there  is  an  orthogonal  matrix
U  and  a  diagonal  matrix  D  such  that
D = U
T
AU.
Let  Z = (Z
1
, Z
2
. . . , Z
d
)
T
= U
T
 
Y.   Then
E[ZZ
T
] = E[U
T
Y
T
U] = U
T
E[
Y
T
]U  = U
T
AU  = D.
 Thus, 1, Z
1
, Z
2
, . . . , Z
d
  is  an  orthogonal  basis  of  
(Y ).
30
Projections  on  a  Subspace  of  (Y )
 The  projection  of  a  random  variable  X  on  
(Y )  is  then  given  by
p
 =
d
k=1
X, Z
k
)
Z
k
, Z
k
)
Z
k
 +
 X, 1)
1, 1)
  1
=
d
k=1
E[XZ
k
]
E[Z
2
k
]
  Z
k
 + E[X].
 It  then  follows  from  the  properties  of  vector  projection  that
 X p
  is  orthogonal  to  every  random  variable  in  
(Y ),  i.e.,  for  any
constants  a
k
,   k = 1, 2, . . . , d,  and  b,
E
X p
k=1
a
k
k
(Y ) + b
 = 0.
 (Least  square)  p
  is  the  element  of  
(Y )  that  is  closest  to  X,  i.e.,
for  any  constants  a
k
,   k = 1, 2, . . . , d,  and  b,
E[(X 
d
k=1
a
k
k
(Y ) b)
2
]  E
X p
)
2
.
31
Regression
 We  have  shown  how  to  compute  the  conditional  expectation  (and  other
projections  on  a  subspace  of  (Y ))  if  the  point  distribution  of  X  and  Y   is
known.
 Suppose  that  the  point  distribution  of  X  and  Y   is  unknown.
 Instead,  a  random  sample  of  size  n  is  given,  i.e., (x
k
, y
k
), k = 1, 2, . . . , n  is
known.
 How  do  you  nd  h(Y )  such  that  E[(X h(Y ))
2
]   is  minimized?
 (Empirical  distribution)  Even  though  we  do  not  know  the  true
distribution,  we  still  have  the  empirical  distribution,  i.e.,
P(X  = x
k
, Y  = y
k
) =
  1
n
,   k = 1, 2, . . . , n.
 Then  one  can  use  the  empirical  distribution  to  compute  the  conditional
expectation  (and  other  projections  on  a  subspace  of  (Y )).
32
Linear  Regression
 (Linear  regression)  Use  the  empirical  distribution  as  the  distribution  of  X
and  Y .   Then
p
L
 =
  E[XY ] E[X]E[Y ]
E[(Y E[Y ])
2
]
  (Y E[Y ]) + E[X],
where
E[XY ] =
  1
n
n
k=1
x
k
y
k
,
E[X] =
  1
n
n
k=1
x
k
,   E[Y ] =
  1
n
n
k=1
y
k
,
E[Y
2
] =
  1
n
n
k=1
y
2
k
.
 p
L
  minimizes  the  empirical  square  error  (risk)
E[(X aY b)
2
] =
  1
n
n
k=1
(x
k
 ay
k
 b)
2
for  any  constants  a  and  b.
33