0% found this document useful (0 votes)
85 views57 pages

Running Master

This document appears to be a full solutions manual for the textbook "Probabilistic Machine Learning: An Introduction" by Kevin Murphy. It contains solutions to problems and examples from chapters in the textbook. The solutions cover topics like conditional independence, pairwise independence versus mutual independence, conditional independence implying factorization of the joint distribution, convolution of Gaussians resulting in another Gaussian, expected value of the minimum of two random variables, and variance of the sum of random variables. The document is organized into parts and chapters corresponding to the textbook.

Uploaded by

Prashantha Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views57 pages

Running Master

This document appears to be a full solutions manual for the textbook "Probabilistic Machine Learning: An Introduction" by Kevin Murphy. It contains solutions to problems and examples from chapters in the textbook. The solutions cover topics like conditional independence, pairwise independence versus mutual independence, conditional independence implying factorization of the joint distribution, convolution of Gaussians resulting in another Gaussian, expected value of the minimum of two random variables, and variance of the sum of random variables. The document is organized into parts and chapters corresponding to the textbook.

Uploaded by

Prashantha Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Full Solution Manual for

“Probabilistic Machine Learning: An Introduction”


Kevin Murphy
November 21, 2021

1
1 Solutions

2
Part I
Foundations

3
2 Solutions
2.1 Conditional independence
PRIVATE

1. Bayes’ rule gives


P (E1 , E2 |H)P (H)
P (H|E1 , E2 ) = (1)
P (E1 , E2 )
Thus the information in (ii) is sufficient. In fact, we don’t need P (E1 , E2 ) because it is equal to the
normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient.

2. Now the equation simplifies to

P (E1 |H)P (E2 |H)P (H)


P (H|E1 , E2 ) = (2)
P (E1 , E2 )

so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P (E1 , E2 ) using
normalization.

2.2 Pairwise independence does not imply mutual independence


We provide two counter examples.
Let X1 and X2 be independent binary random variables, and X3 = X1 ⊕ X2 , where ⊕ is the XOR
operator. We have p(X3 |X1 , X2 ) 6= p(X3 ), since X3 can be deterministically calculated from X1 and X2 . So
the variables {X1 , X2 , X3 } are not mutually independent. However, we also have p(X3 |X1 ) = p(X3 ), since
without X2 , no information can be provided to X3 . So X1 ⊥ X3 and similarly X2 ⊥ X3 . Hence {X1 , X2 , X3 }
are pairwise independent.
Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at
random. Define 3 events as follows:
• X1 : ball 1 or 2 is drawn.
• X2 : ball 2 or 3 is drawn.
• X3 : ball 1 or 3 is drawn.

We have p(X1 ) = p(X2 ) = p(X3 ) = 0.5. Also, p(X1 , X2 ) = p(X2 , X3 ) = p(X1 , X3 ) = 0.25. Hence
p(X1 , X2 ) = p(X1 )p(X2 ), and similarly for the other pairs. Hence the events are pairwise independent.
However, p(X1 , X2 , X3 ) = 0 6= 1/8 = p(X1 )p(X2 )p(X3 ).

2.3 Conditional independence iff joint factorizes


PRIVATE
Independency ⇒ Factorization. Let g(x, z) = p(x|z) and h(y, z) = p(y|z). If X ⊥ Y |Z then

p(x, y|z) = p(x|z)p(y|z) = g(x, z)h(y, z) (3)

4
Factorization ⇒ Independency. If p(x, y|z) = g(x, z)h(y, z) then
X X X X
1= p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (4)
x,y x,y x y
X X X
p(x|z) = p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (5)
y y y
X X X
p(y|z) = p(x, y|z) = g(x, z)h(y, z) = h(y, z) g(x, z) (6)
x x x
X X
p(x|z)p(y|z) = g(x, z)h(y, z) g(x, z) g(y, z) (7)
x y

= g(x, z)h(y, z) = p(x, y|z) (8)

2.4 Convolution of two Gaussians is a Gaussian


We follow the derivation of [Jay03, p221]. Define
1 x−µ
φ(x − µ|σ) = N (x|µ, σ 2 ) = φ( ) (9)
σ σ
where φ(z) is the pdf of the standard normal. We have

p(y) = N (x1 |µ1 , σ12 ) ⊗ N (x2 |µ2 , σ22 ) (10)


Z
= dx1 φ(x1 − µ1 |σ1 )φ(y − (x1 − µ2 )|σ2 ) (11)

Now the product inside the integral can be written as follows


( " 2  2 #)
1 1 x1 − µ1 y − x1 − µ2
φ(x1 − µ1 |σ1 )φ(y − (x1 − µ2 )|σ2 ) = exp − + (12)
2πσ1 σ2 2 σ1 σ2

We can bring out the dependency on x1 by rearranging the quadratic form inside the exponent as follows
 2  2
x − µ1 y − (x1 − µ2 ) w1 w2
+ = (w1 + w2 )(x1 − x̂) + (y − (µ1 − µ2 ))2 (13)
σ1 σ2 w1 + w2

where we have defined the precision or weighting terms w1 = 1/σ12 , w2 = 1/σ22 , and the term
w1 µ1 + w2 y − w2 µ2
x̂ = (14)
w1 + w2
Note that
w1 w2 1 σ2 σ2 1
= 2 2 21 2 2 = 2 (15)
w1 + w2 σ1 σ2 σ1 + σ2 σ1 + σ22
Hence
Z  
1 1 1 w1 w2
p(y) = exp − (w1 + w2 )(x1 − x̂)2 − (y − (µ1 − µ2 ))2 dx1 (16)
2πσ1 σ2 2 2 w1 + w2

The integral over x1 becomes one over the normalization constant for the Gaussian:
 21
σ22 σ22
Z 
1 1
exp[− (w1 + w2 )(x1 − x̂)2 ] dx1 = (2π) 2 (17)
2 σ12 + σ22

5
Hence
 21
σ22 σ22

1 1 1
p(y) = (2π)−1 (σ12 σ22 )− 2 (2π) 2 exp[− (y − µ1 − µ2 )2 ] (18)
σ12 + σ22 2(σ12 + σ22 )
1 − 12 1
= (2π)− 2 σ12 + σ22 exp[− (y − µ1 − µ2 )2 ] (19)
2(σ1 + σ22 )
2

= N (y|µ1 + µ2 , σ12 + σ22 ) (20)

2.5 Expected value of the minimum of two rv’s


PRIVATE
Let Z = min(X, Y ). We have P (Z > z) = P (X > z, Y > z) = P (X > z)p(Y > z) = (1 − z)2 = 1 + z 2 − 2z.
Hence the cdf of the minimum is P (Z ≤ z) = 2z − z 2 and the pdf is p(z) = 2 − 2z. Hence the expected value
is Z 1
E [Z] = z(2 − 2z) = 1/3 (21)
0

2.6 Variance of a sum


We have

V [X + Y ] = E[(X + Y )2 ] − (E[X] + E[Y ])2 (22)


2 2
= E[X + Y + 2XY ] − (E[X] + E[Y ] + 2E[X]E[Y ])2 2
(23)
2 2 2
= E[X ] − E[X] + E[Y ] − E[Y ] + 2E[XY ] − 2E[X]E[Y ] 2
(24)
= V [X] + V [Y ] + 2Cov [X, Y ] (25)

If X and Y are independent, then Cov [X, Y ] = 0, so V [X + Y ] = V [X] + V [Y ].

2.7 Deriving the inverse gamma density


PRIVATE
We have
dx
py (y) = px (x)| | (26)
dy
where
dx 1
= − 2 = −x2 (27)
dy y
So
ba a−1 −xb
py (y) = x2 x e (28)
Γ(a)
ba a+1 −xb
= x e (29)
Γ(a)
ba −(a+1) −b/y
= y e = IG(y|a, b) (30)
Γ(a)

6
2.8 Mean, mode, variance for the beta distribution
For the mode we can use simple calculus, as follows. Let f (x) = Beta(x|a, b). We have

df
0= (31)
dx
= (a − 1)xa−2 (1 − x)b−1 − (b − 1)xa−1 (1 − x)b−2 (32)
= (a − 1)(1 − x) − (b − 1)x (33)
a−1
x= (34)
a+b−2
For the mean we have
Z 1
E [θ|D] = θp(θ|D)dθ (35)
0
Z
Γ(a + b)
= θ(a+1)−1 (1 − θ)b−1 dθ (36)
Γ(a)Γ(b)
Γ(a + b) Γ(a + 1)Γ(b) a
= = (37)
Γ(a)Γ(b) Γ(a + 1 + b) a+b

where we used the definition of the Gamma function and the fact that Γ(x + 1) = xΓ(x).
We can find the variance in the same way, by first showing that

Γ(a + b) 1 Γ(a + 2 + b) (a+2)−1


Z
E θ2 = (1 − θ)b−1 dθ (38)
 
θ
Γ(a)Γ(b) 0 Γ(a + 2)Γ(b)
Γ(a + b) Γ(a + 2 + b) a a+1
= = (39)
Γ(a)Γ(b) Γ(a + 2)Γ(b) a+ba+1+b
2
Now we use V [θ] = E θ − E [θ] and E [θ] = a/(a + b) to get the variance.
 2

2.9 Bayes rule for medical diagnosis


PRIVATE
Let T = 1 represent a positive test outcome, T = 0 represent a negative test outcome, D = 1 mean you have
the disease, and D = 0 mean you don’t have the disease. We are told

P (T = 1|D = 1) = 0.99 (40)


P (T = 0|D = 0) = 0.99 (41)
P (D = 1) = 0.0001 (42)

We are asked to compute P (D = 1|T = 1), which we can do using Bayes’ rule:

P (T = 1|D = 1)P (D = 1)
P (D = 1|T = 1) = (43)
P (T = 1|D = 1)P (D = 1) + P (T = 1|D = 0)P (D = 0)
0.99 × 0.0001
= (44)
0.99 × 0.0001 + 0.01 × 0.9999
= 0.009804 (45)

So although you are much more likely to have the disease (given that you have tested positive) than a random
member of the population, you are still unlikely to have it.

7
2.10 Legal reasoning
Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and
G = ¬I be the event that the defendant is guilty.
1. The prosecutor is confusing p(E|I) with p(I|E). We are told that p(E|I) = 0.01 but the relevant
quantity is p(I|E). By Bayes rule, this is

p(E|I)p(I) 0.01p(I)
p(I|E) = = (46)
p(E|I)p(I) + p(E|G)p(G) 0.01p(I) + (1 − p(I))

since p(E|G) = 1 and p(G) = 1 − p(I). So we cannot determine p(I|E) without knowing the prior
probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of
innocence.
To understand this more intuitively, consider the following isomorphic problem (from http://en.
wikipedia.org/wiki/Prosecutor’s_fallacy):

A big bowl is filled with a large but unknown number of balls. Some of the balls are made of
wood, and some of them are made of plastic. Of the wooden balls, 100 are white; out of the
plastic balls, 99 are red and only 1 are white. A ball is pulled out at random, and observed
to be white.

Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it
is that the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls,
for instance, then a white ball pulled from the bowl at random is far more likely to be a white plastic
ball than a white wooden ball — even though white plastic balls are a minority of the whole set of
plastic balls.

2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are

p(G) 1
= (47)
p(I) 799, 999

The posterior odds are


p(G|E) 1
= (48)
p(I|E) 7999
So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although
perhaps still not enough to find the suspect guilty.

2.11 Probabilities are sensitive to the form of the question that was used to
generate the answer
PRIVATE

1. The event space is shown below, where X is one child and Y the other.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4

8
Let Ng be the number of girls and Nb the number of boys. We have the constraint (side information)
that Nb + Ng = 2 and 0 ≤ Nb , Ng ≤ 2. We are told Nb ≥ 1 and are asked to compute the probability of
the event Ng = 1 (i.e., one child is a girl). By Bayes rule we have
p(Nb ≥ 1|Ng = 1)p(Ng = 1)
p(Ng = 1|Nb ≥ 1) = (49)
p(Nb ≥ 1)
1 × 1/2
= = 2/3 (50)
3/4

2. Let Y be the identity of the observed child and X be the identity of the other child. We want
p(X = g|Y = b). By Bayes rule we have
p(Y = b|X = g)p(X = g)
p(X = g|y = b) = (51)
p(Y = b)
(1/2) × (1/2)
= = 1/2 (52)
1/2

Tom Minka [Min98] has written the following about these results:
This seems like a paradox because it seems that in both cases we could condition on the fact that
"at least one child is a boy." But that is not correct; you must condition on the event actually
observed, not its logical implications. In the first case, the event was "He said yes to my question."
In the second case, the event was "One child appeared in front of me." The generating distribution
is different for the two events. Probabilities reflect the number of possible ways an event can
happen, like the number of roads to a town. Logical implications are further down the road and
may be reached in more ways, through different towns. The different number of ways changes the
probability.

2.12 Normalization constant for a 1D Gaussian


Following the first hint we have
2π ∞
r2
Z Z  
Z2 = r exp − 2 drdθ (53)
0 0 2σ
Z 2π  Z ∞
r2
  
= dθ r exp − 2 dr (54)
0 0 2σ
= (2π)I (55)
where I is the inner integral

r2
Z  
I= r exp − 2 (56)
0 2σ
Following the second hint we have
Z
r −r2 /2σ2
I = −σ 2
− e dr (57)
σ2
h 2 2
i ∞
= −σ 2 e−r /2σ
(58)
0
= −σ 2 [0 − 1] = σ 2
(59)
Hence
Z 2 = 2πσ 2 (60)
(61)
p
Z = σ (2π)

9
3 Solutions
3.1 Uncorrelated does not imply independent
PRIVATE
We have
E[X] = 0 (62)
2
V [X] = (1 − −1) /12 = 1/3 (63)
2
E[Y ] = E[X ] = V [X] + (E[X]) = 1/3 + 0 2
(64)
Z  Z 0 Z 1 
1
E[XY ] = p(x)x3 dx = −( x3 dx) + ( x3 dx) (65)
2 −1 0
 
1 −1 4 0 1 41 1
= [x ]−1 + [x ]0 = [−1 + 1] = 0 (66)
2 4 4 8
where we have split the integral into two pieces, since x3 changes sign in the interval [−1, 1]. Hence
Cov [X, Y ] E[XY ] − E[X]E[Y ] 0
ρ= = =0 (67)
σX σY σX σY σX σY

3.2 Correlation coefficient is between -1 and +1


We have
 
X Y
0≤V + (68)
σX σY
     
X Y X Y
=V +V + 2Cov , (69)
σX σY σX σY
 
V [X] V [Y ] X Y
= 2 + + 2Cov , (70)
σX σY2 σX σY
= 1 + 1 + 2ρ = 2(1 + ρ) (71)
Hence ρ ≥ −1. Similarly,
 
X Y
0≤V − = 2(1 − ρ) (72)
σX σY
so ρ ≤ 1.

3.3 Correlation coefficient for linearly related variables is ±1


PRIVATE

Let E [X] = µ and V [X] = σ 2 , and assume a > 0. Then we have


E [Y ] = E [aX + b] = aE [X] + b = aµ + b (73)
2
V [Y ] = V [aX + b] = a V [X] = a σ 2 2
(74)
E [XY ] = E [X(aX + b)] = E aX 2 + bX = aE X 2 + bE [X] = a(µ2 + σ 2 ) + bµ (75)
   

2 2
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = a(µ + σ ) + bµ − µ(aµ + b) = aσ 2
(76)
2
Cov [X, Y ] aσ
ρ= p =√ √ =1 (77)
V [X] V [Y ] σ a2 σ 2
2


If Y = aX + b, where a < 0, we find ρ = −1, since a/ a2 = −a/|a| = −1.

10
3.4 Linear combinations of random variables
1. Let y = Ax. Then

Cov [y] = E (y − E [y])(y − E [y])T (78)


 

= E (Ax − Am)(Ax − Am)T (79)


 

= AE (x − m)(x − m)T AT (80)


 

= AΣA T
(81)

2. C = AB has entries cij = k aik bkj . The diagonalPelements of C (when i = j) are given by k aik bki .
P P
So the sum of the diagonal elements is tr(AB) = ik aik bki which is symmetric in A and B.
3. We have

E xT Ax = E tr(xT Ax) = E tr(AxxT ) (82)


     

= tr(AE xxT ) = tr(A(Σ + mmT )) (83)


 

= tr(AΣ) + m Am T
(84)

3.5 Gaussian vs jointly Gaussian


1. For the mean, we have
E [Y ] = E [W X] = E [X] E [X] = 0 (85)
For the variance, we have

V [Y ] = E [V [Y |W ]] + V [E [Y |W ]] (86)
= E [W [V [X]]W ] + V [W E [X]] (87)
= E W2 + 0 = 1 (88)
 

To show it’s Gaussian, we note that Y is a linear combination of Gaussian rv’s.


2. To show that Cov [X, Y ] = 0, we use the rule of iterated expectation. First we have

E [XY ] = E [E [XY |W ]] (89)


X
= p(w)E [XY |w] (90)
w∈{−1,1}

= −1 · 0.5 · E [X · −X] + 1 · 0.5 · E [X · X] (91)


== 0 (92)

Then we have

E [Y ] = E [E [Y |W ]] (93)
X
= p(w)E [Y |w] (94)
w∈{−1,1}

= 0.5 · E [−X] + 0.5 · E [X] (95)


=0 (96)

Hence
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = 0 (97)
So X and Y are uncorrelated even though they are dependent.

11
3.6 Normalization constant for a multidimensional Gaussian
Let Σ = UΛUT , so
p
−1 −1 −1
X 1
Σ =U −T
Λ U −1
= UΛ U= ui uTi (98)
λ
i=1 i

Hence
p
!
−1
X 1
T
(x − µ) Σ (x − µ) = (x − µ) T
ui uTi (x − µ) (99)
i=1
λ i
p p
X 1 X yi2
= T T
(x − µ) ui ui (x − µ) = (100)
λ
i=1 i
λ
i=1 i

where yi , uTi (x − µ). The y variables define a new coordinate system that is shifted (by µ) and rotated
(by U) with respect to the original x coordinates: y = U(x − µ). Hence x = UT y + µ.
The Jacobian of this transformation, from y to x, is a matrix with elements
∂xi
Jij = = Uji (101)
∂yj

so J = UT and |J| = 1.
So
1 X yi2
Z Z Y
1
exp(− (x − µ)T Σ−1 (x − µ))dx = exp(− )dyi |J| (102)
2 i
2 i λi
Yp
= 2πλi = |2πΣ| (103)
i

3.7 Sensor fusion with known variances in 1d


Define the sufficient statistics as
n1 n2
1 X (1) 1 X (2)
y1 = yi , y 2 = y , (104)
n1 i=1 n2 i=1 i

Define the prior as


µµ = 0, Σµ = ∞ (105)
Define the likelihood as      v1 
1 0 0
A= , µy = , Σy = n 1 v2 (106)
1 0 0 n2

Now we just apply the equations. The posterior precision is a sum of the precisions of each sensor:

 nv1
  
−1 T −1 0 1 n1 n2
Σµ|y = A Σy A = 1 1 1 = + (107)
0 nv22 1 v1 v2

The posterior mean is a weighted sum of the observed values from each sensor:

 nv1
      
−1 0 y1 −1 n1 y 1 n2 y 2
µµ|y = Σµ|y 1 1 1
= Σµ|y + (108)
0 nv22 y2 v1 v2

12
3.8 Show that the Student distribution can be written as a Gaussian scale
mixture
Z ∞
p(w|µ, a, b) = N (w|µ, α−1 )Ga(α|a, b)dα (109)
0

ba e−bα αa−1  α  21
Z  α 
= exp − (w − µ)2 dα (110)
0 Γ(a) 2π 2
1 Z

ba
    
1 2 1 1
= αa− 2 exp −α b + (w − µ)2 dα (111)
Γ(a) 2π 0 2

Let us define
R∞ ∆ = b + (w − µ)2 /2 and z = α∆. Then, using dz = ∆dα, and the definition of the Gamma
function, 0 u e du = Γ(x), the integral becomes
x−1 −u

Z ∞  z a+ 12 −1 1 1
e−z ∆−1 dz = ∆−a− 2 Γ(a + ) (112)
0 ∆ 2

so we have
  12
Γ(a + 1/2) a 1 1
p(w|µ, a, b) = b ∆−a− 2 (113)
Γ(a) 2π

Let us define a = ν/2 and b = ν/(2λ). Then we have


 1  −(ν+1)/2
Γ((ν + 1)/2)  ν ν/2 1 2 ν (w − µ)2
p(w|µ, a, b) = + (114)
Γ(ν/2) 2λ 2π 2λ 2
  12   −(ν+1)/2
Γ((ν + 1)/2) ν ν/2 1
  ν λ
= 1 + (w − µ)2 (115)
Γ(ν/2) 2λ 2π 2λ ν
  12  −(ν+1)/2
Γ((ν + 1)/2) λ λ
= 1 + (w − µ)2 (116)
Γ(ν/2) νπ ν
= Tν (w|µ, λ−1 ) (117)

13
4 Solutions
4.1 MLE for the univariate Gaussian
PRIVATE
Starting with the mean, we have
∂` 2 X
= (xi − µ) = 0 (118)
∂µ 2σ 2 i
N
1 X
µ̂ = xi = x (119)
N i=1

We can derive σ̂ 2 as follows.


∂` 1 X n
2
= σ −4 (xi − µ̂)2 − 2 = 0 (120)
∂σ 2 i

N
1 X
σ̂ 2 = (xi − µ̂)2 (121)
N i=1
" #
1 X 2 X 2 X 1 X 2
= xi + µ̂ − 2 xi µ̂ = [ xi + N µ̂2 − 2N µ̂2 ] (122)
N i i i
N i
N
!
1 X 2
= x − (x)2 (123)
N i=1 i

4.2 MAP estimation for 1D Gaussians


PRIVATE

1. Since the mean and mode of a Gaussian are the same, we can just write down the MAP estimate using
the standard equations for the posterior mean of a Gaussian, which is

ns2 σ2
   
E[µ|D] = x+ m (124)
ns2 + σ 2 ns2 + σ 2

However, we can also derive the MAP estimate explicitly, as follows. The prior distribution over the
mean is
(µ − m)2
p(µ) = (2πs2 )−1/2 exp[− ] (125)
2s2
Since the samples xi are taken to be independent, we have:
1
Pn 2
p(D|θ) = e− 2σ2 i=1 (xi −µ) (126)

And combining them:


n
1 (µ − m)2 n 1 X
log p(µ)p(D|µ) = − log(2πs2 ) − − log(2πσ 2
) − (xi − µ)2 (127)
2 2s2 2 2σ 2 i=1

14
Taking the derivative with respect to µ :
n
∂ log(p(µ)p(D|µ)) 2(µ − m)(1) 1 X
=0− − 0 − (2)(xi − µ)(−1) (128)
∂µ 2s2 2σ 2 i=1
n
µ−m 1 X
=− 2 + 2 (xi − µ) (129)
s σ i=1
n
µ m 1 X nµ
=− 2
+ 2
+ 2
xi − 2 (130)
s s σ i=1 σ

Setting this derivative to zero in order to find the maximum:


n
µ̂M AP m 1 X nµ̂M AP
0=− + + xi − (131)
s2 s2 σ 2 i=1 σ2
n
n 1 1 X m
µ̂M AP ( 2
+ 2
) = 2
xi + 2 (132)
σ s σ i=1 s
1
Pn m Pn
i=1 xi + s2 s2 i=1 xi + mσ 2
µ̂M AP = σ2
n 1 = (133)
σ 2 + s2
ns2 + σ 2
n
ns2 σ2
  P  
i=1 xi
= + m (134)
ns2 + σ 2 n ns2 + σ 2

σ2 ns2
2. Notice that as n increases, while s, m and σ remain constant, we have ns2 +σ 2 → 0 while ns2 +σ 2 → 1,
P
xi
yielding µ̂M AP → n = µ̂M LE . That is, when there are many samples, the prior knowledge becomes
i

less and less relevant, and all MAP estimators (for any prior) converge to the maximum likelihood
estimator µ̂M LE . This is not surprising: as we have more data, it outweighs our prior speculations.
σ2
3. As the prior variance s2 increases, even for a small number of samples n, we have ns2 +σ 2 → 0 while
P
ns2 xi
→ 1, again yielding µ̂M AP → n = µ̂M LE . That is, when we have an uninformative, almost
ns2 +σ 2
i

uniform, prior, we are left only with our data to base out estimation on.
σ2
4. On the other hand, if the prior variance is very small, s2 → 0, then we have ns2 +σ 2 → 1 while
ns2
→ 0, yielding µ̂M AP → m. That is, if our prior is very concentrated, we essentially already
ns2 +σ 2
know the answer, and can ignore the data.

4.3 Gaussian posterior credible interval


We want an interval that satisfies
p(` ≤ µn ≤ u|D) ≥ 0.95 (135)
where

` = µn + Φ−1 (0.025)σn = µn − 1.96σn (136)


u = µn + Φ −1
(0.9755)σn = µn + 1.96σn (137)

where Φ is the cumulative distribution function for the standard normal N (0, 1) distribution, and Φ−1 (0.025) is
the value below which 2.5% of the probability mass lies (in Matlab, norminv(0.025)=-1.96). and Φ−1 (0.975)
is the value below which 97.5% of the probability mass lies (in Matlab, norminv(0.975)=1.96). We want to
find n such that
u−`=1 (138)

15
Hence we solve

2(1.96)σn = 1 (139)
1
σn2 = (140)
4(1.96)2

where
σ 2 σ02
σn2 = (141)
nσ02 + σ 2

Hence

nσ02 + σ 2 = (σ 2 σ02 )4(1.96)2 (142)


2
σ (σ02 4(1.96)2 − 1)
n= (143)
σ02
2
4(9 × (1.96) − 1)
= = 61.0212 (144)
9
Hence we need at least n ≥ 62 samples.

4.4 BIC for Gaussians


PRIVATE

1. Using the fact that Σ̂M L = S, we have


N  −1  N
log p(D|θ̂M L ) = − tr Σ̂ S − log(|Σ̂|) (145)
2 2
N N
= − tr(ID ) − log(|Σ̂|) (146)
2 2
N  
=− D + log |Σ̂| (147)
2
There are D free parameters for the mean and D(D + 1)/2 for the covariance, so the BIC score is

N  d
BIC = − D + log |Σ̂| − log N (148)
2 2
where d = D + D(D + 1)/2.

2. Let Sdiag = diag(diag(S)) = Σ̂diag . Then

tr(Σ̂−1 −1
diag S) = tr(Σ̂diag Sdiag ) = tr(ID ) = D

because the off-diagonal elements of the matrix product get ignored. Hence
N  d
BIC = − D + log |Σ̂| − log N (149)
2 2
where d = D + D parameters (for the mean and diagonal).

16
4.5 BIC for a 2d discrete distribution
1. The joint distribution is p(x, y|θ) = p(x|θ1 )p(y|x, θ2 ):
y=0 y=1
x=0 (1 − θ1 )θ2 (1 − θ1 )(1 − θ2 )
x=1 θ1 (1 − θ2 ) θ1 θ2
2. The log likelihood is
X X
log p(D|θ) = log p(xi |θ1 ) + log p(yi |xi , θ2 ) (150)
i i

Hence we can optimize each term separately. For θ1 , we have


P
I(xi = 1) N (x = 1) 4
θ̂1 = i = = = 0.5714 (151)
n N 7
For θ2 , we have P
I(xi = yi ) N (x = y) 4
θ̂2 = i
= = (152)
n N 7
The likelihood is
4 3 4 3
p(D|θ̂, M2 ) = ( )N (x=1) ( )N (x=0) ( )N (x=y) ( )N (x6=y) (153)
7 7 7 7
4 4 3 3 4 4 3 3
=( ) ( ) ( ) ( ) (154)
7 7 7 7
4 8 3 6
= ( ) ( ) ≈ 7.04 × 10−5 (155)
7 7
3. The table of joint counts is
y=0 y=1
x=0 2 1
x=1 2 2
We can think of this as a multinomial distribution with 4 states. Normalizing the counts gives the MLE:
y=0 y=1
x = 0 2/7 1/7
x = 1 2/7 2/7
The likelihood is
N (x=0,y=0) N (x=0,y=1) N (x=1,y=0) N (x=1,y=1) 2 1 2 2
p(D|θ̂, M4 ) = θ00 θ01 θ10 θ11 = ( )2 ( )1 ( )2 ( )2 (156)
7 7 7 7
2 1
= ( )6 ( )1 ≈ 7.77 × 10−5 (157)
7 7
Thus is higher than the previous likelihood, because the model has more parameters.
4. For M4 , when we omit case 7, we will have θ̂01 = 0, so p(x7 , y7 |m4 , θ̂) = 0, so L(m4 ) = −∞. However,
L(m2 ) will be finite, since all counts remain non zero when we leave out a single case. Hence CV will
prefer M2 , since M4 is overfitting.
5. The BIC score is
dof(m)
BIC(m) = log p(D|θ̂, m) − log n (158)
2
where n = 7. For M2 , we have dof = 2, so
4 3 2
BIC(m2 ) = 8 log( ) + 6 log( ) − log 7 = −11.5066 (159)
7 7 2

17
For M4 , we have dof = 3 because of the sum-to-one constraint, so
2 1 3
BIC(m4 ) = 6 log( ) + 1 log( ) − log 7 = −12.3814 (160)
7 7 2
So BIC also prefers m2 .

4.6 A mixture of conjugate priors is conjugate


PRIVATE

We now show that, if the prior is a mixture, so is the posterior:

p(D|θ)p(θ)
p(θ|D) = (161)
p(D)
P
p(D|θ) k p(Z = k)p(θ|Z = k)
=R (162)
p(D|θ0 ) k0 p(Z = k 0 )p(θ0 |Z = k 0 )dθ0
P
P
k p(Z = Rk)p(D, θ|Z = k)
=P 0
(163)
0 p(Z = k ) p(D, θ0 |Z = k 0 )dθ0
Pk
p(Z = k)p(θ|D, Z = k)p(D|Z = k)
= k P 0 0
(164)
k0 p(Z = k )p(D|Z = k )
 
X p(Z = k)p(D|Z = k)
= P 0 0
p(θ|D, Z = k) (165)
k k0 p(Z = k )p(D|Z = k )
X
= p(Z = k|D)p(θ|D, Z = k) (166)
k

where p(Z = k) = πk are the prior mixing weights, and p(Z = k|D) are the posterior mixing weights given by

p(Z = k)p(D|Z = k)
p(Z = k|D) = P 0 0
(167)
k0 p(Z = k )p(D|Z = k )

2
4.7 ML estimator σmle is biased
Because the variance of any random variable R is given by var(R) = E[R2 ]−(E[R])2 , the expected value of the
square of a Gaussian random variable Xi with mean µ and variance σ 2 is E[Xi2 ] = var(Xi )+(E[Xi ])2 = σ 2 +µ2 .

n Pn
2 1X j=1 Xj
EX1 ,...,Xn ∼N (µ,σ) [σ (X1 , . . . , Xn )] = E[ (Xi − )2 ]
n i=1 n
Pn
1X j=1 Xj
= nE[(Xi − )2 ]
n i=1 n
Pn Pn
1X j=1 Xj j=1 Xj
= nE[(Xi − )(Xi − )]
n i=1 n n
n n n
1X 2 2 X 1 XX
= nE[Xi − Xi Xj + 2 Xj Xk ]
n i=1 n j=1 n j=1
k=1
n n n n n
1X 2 XX 1 XXX
= nE[Xi2 ] − 2 E[Xi Xj ] + 3 [Xj Xk ]
n i=1 n i=1 j=1 n i=1 j=1
k=1

18
Pn Pn Pn Pn
Consider the two summations i=1 j=1 E[Xi Xj ] and j=1 k=1 [Xj Xk ]. Of the n2 terms in each of
these summations, n of them satisfy i = j or j = k, so these terms are of the form E[Xi2 ]. By linearity
of expectation, these terms contribute nE[Xi2 ] to the sum. The remaining n2 − n terms are of the form
E[Xi Xj ] or E[Xj Xk ] for i =
6 j or j =
6 k. Because the Xi are independent samples, it follows from linearity of
expectation that these terms contribute (n2 − n)E[Xi ]E[Xj ] to the summation.

n X
X n n X
X n
E[Xi Xj ] = E[Xj Xk ]
i=1 j=1 j=1 k=1

= nE[Xi2 ] + (n2 − n)E[Xi ][Xj ]


= n(σ 2 + µ2 ) + (n2 − n)µµ = nσ 2 + nµ2 + n2 µ2 − nµ2
= nσ 2 + n2 µ2

EX1 ,...,Xn ∼N (µ,σ) [σ 2 (X1 , . . . , Xn )] =


n
1X 2 1 X
= 1n (σ 2 + µ2 ) − 2 (nσ 2 + n2 µ2 ) + 3 (nσ 2 + n2 µ2 )
n i n n i=1
1 σ2 1
= (nσ 2 + nµ2 ) − 2 − 2µ2 + 3 (n2 σ 2 + N 3 µ2 )
n n n
σ2 σ2
= σ 2 + µ2 − 2 − 2µ2 + + µ2
n n
σ2 n−1 2
= σ2 − = σ
n n

Since the expected value of σˆ2 (X1 , . . . , Xn ) is not equal to the actual variance σ 2 , σ 2 is not an unbiased
estimator. In fact, the maximum likelihood estimator tends to underestimate the variance. This is not
surprising: consider the case of only a single sample: we will never detect any variance. If there are multiple
samples, we will detect variance, but since our estimate for the mean will tend to be shifted from the true
mean in the direction of our samples, we will tend to underestimate the variance.

4.8 Estimation of σ 2 when µ is known


PRIVATE

We can re-do the derivation of σ̂M


2
LE , but instead of taking the maximum over µ , we just use the known
µ. We get:
n
1X
σ̂ 2 = (xi − µ)2
n i=1
In calculating the expected value of the estimator, note that we get exactly the definition of a variance of a

19
random variable:
n
1X
EX1 ,...,Xn ∼N (µ,σ) [σ̂ 2 (X1 , X2 , . . . , Xn ) = EXi ∼N (µ,σ) [(Xi − µ)2 ]
n i=1
n
1X
= V ar[N (µ, σ)]
n i=1
n
1X 2
= σ
n i=1
= σ2

So the estimator is unbiased.

4.9 Variance and MSE of the unbiased estimator for Gaussian variance
PRIVATE
We have
 
N −1 2
V σ unb = 2(N − 1) (168)
σ2
(N − 1)2  2 
V σunb = 2(N − 1) (169)
σ4
2σ 4
(170)
 2 
V σunb =
N −1
Hence the variance of the ML estimator is
2
2σ 4
  
N −1 2 N −1 2(N − 1) 4
(171)
 2 
V σ̂M L = V σ̂N −1 = = σ
N N N −1 N2

Finally, we have

MSE(σmle
2
) = bias(σmle
2
)2 + V σmle (172)
 2 

2
−σ 2 2(N − 1) 4
=( ) + σ (173)
N N2
2N − 1 4
= σ (174)
N2
and

MSE(σunb
2
) = bias(σunb
2
)2 + V σunb (175)
 2 

2σ 4
=0+ (176)
N −1

20
5 Solutions
5.1 Reject option in classifiers
1. We have to choose between rejecting, with risk λr , and choosing the most probable class, jmax =
arg maxj p(Y = j|x), which has risk
X
λs p(Y = j|x) = λs (1 − p(Y = jmax |x)) (177)
j6=jmax

Hence we should pick jmax if

λr ≥ λs (1 − p(Y = jmax |x)) (178)


λr
≥ (1 − p(Y = jmax |x)) (179)
λs
λr
p(Y = jmax |x) ≥ 1 − (180)
λs
otherwise we should reject.
For completeness, we should prove that when we decide to choose a class (and not reject), we always
pick the most probable one. If we choose a non-maximal category k 6= jmax , the risk is
X
λs p(Y = j|x) = λs (1 − p(Y = k|x)) ≥ λs (1 − p(Y = jmax |x)) (181)
j6=k

which is always bigger than picking jmax .


2. If λr /λs = 0, there is no cost to rejecting, so we always reject. As λr /λs → 1, the cost of rejecting
increases. We find p(Y = jmax |x) ≥ 1 − λλrs is always satisfied, so we always accept the most probable
class.

5.2 Newsvendor problem


PRIVATE

Z ∞ Z Q Z Q
Eπ(Q) = (P − C)Qf (D)dD + (P − C)Df (D) − C(Q − D)f (D)dD (182)
Q 0 0
Z Q Z Q
= (P − C)Q(1 − F (Q)) + P Df (D) − CQ f (D)dD (183)
0 0
Z Q
= (P − C)Q(1 − F (Q)) + P Df (D) − CQF (Q) (184)
0

Hence
d
Eπ(Q) = (P − C)(1 − F (Q)) − (P − C)Qf (Q) + P Qf (Q) − CF (Q) − CQf (Q) (185)
dQ
= (P − C) − P F (Q) (186)
=0 (187)

So
P −C
F (Q) = (188)
P

21
5.3 Bayes factors and ROC curves
PRIVATE

p(H1 |D) is a monotonically increasing function of B, and therefore cannot affect the shape of the ROC
curve. This makes sense intuitively, since, in the two class case, it should not matter whether we threshold the
ratio p(H1 |D)/p(H0 |D), or the posterior, p(H1 |D), since they contain the same information, just measured
on different scales.

5.4 Posterior median is optimal estimate under L1 loss


To prove this, we expand the posterior expected loss as follows:
Z Z
ρ(a|x) = Eθ|x |θ − a| = (θ − a)p(θ|x)dθ + (a − θ)p(θ|x)dθ (189)
θ≥a θ≤a
Z ∞ Z a
= (θ − a)p(θ|x)dθ + (a − θ)p(θ|x)dθ (190)
a −∞

Now recall the rule to differentiate under the integral sign:


Z B(a) Z B(a)
d
φ(a, θ)dθ = φ0 (a, θ)dθ + φ(a, B(a))B 0 (a) + φ(a, A(a))A0 (a) (191)
da A(a) A(a)

where φ0 (a, θ) = da
d
φ(a, θ). Applying this to the first integral in Equation 190, with A(a) = a, B(a) = ∞,
φ(a, θ) = (θ − a)p(θ|x)dy, we have
Z ∞ Z ∞
(θ − a)p(θ|x)dθ = −p(θ|x)dθ + 0 + 0 (192)
a a

Analogously, one can show Z a Z a


(a − θ)p(θ|x)dθ = p(θ|x)dθ (193)
−∞ −∞

Hence
Z ∞ Z a
0
ρ (a|x) = − p(θ|x)dθ + p(θ|x)dθ (194)
a −∞
= −P (θ ≥ a|x) + P (θ ≤ a|x) = 0 (195)

So the value of a that makes ρ0 (a|x) = 0 satisfies

P (θ ≥ a|x) = P (θ ≤ a|x) (196)

Hence the optimal a is the posterior median.

22
6 Solutions
6.1 Expressing mutual information in terms of entropies
PRIVATE

X p(x, y)
I(X, Y ) = p(x, y) log (197)
x,y
p(x)p(y)
X p(x|y)
= p(x, y) log (198)
x,y
p(x)
X X
=− p(x, y) log p(x) + p(x, y) log p(x|y) (199)
x,y x,y
!
X X
=− p(x) log p(x) − − p(x, y) log p(x|y) (200)
x x,y
!
X X X
=− p(x) log p(x) − − p(y) p(x|y) log p(x|y) (201)
x y x

= H(X) − H(X|Y ) (202)

We can show I(X, Y ) = H(Y ) − H(Y |X) by symmetry.


For the second. Using H(Y |X) = H(X, Y ) − H(X) we have

I(X; Y ) = H(X) + H(Y ) − H(X, Y ) (203)

Hence adding and subtracting H(Y ) we have

H(X, Y ) = H(X, Y ) + H(X, Y ) − H(X, Y ) (204)


= (H(X, Y ) − H(Y )) + H(X, Y ) + (H(Y ) − H(X, Y )) (205)

Now adding and subtracting H(X) we have

H(X, Y ) = (H(X, Y ) − H(Y )) + (H(X, Y ) − H(X)) + (H(Y ) + H(X) − H(X, Y )) (206)

Simplifying yields

H(X, Y ) = H(X|Y ) + H(Y |X) + I(X, Y ) (207)

23
6.2 Relationship between D(p||q) and χ2 statistic
We have
X p(x)
D(p||q) = p(x) log (208)
x
q(x)

X ∆(x)
= (Q(x) + ∆(x)) log 1 + (209)
x
q(x)
∆(x) ∆(x)2
X  
= (Q(x) + ∆(x)) − + ··· (210)
x
q(x) 2q(x)
X ∆(x)2 ∆(x)2
= ∆(x) + − + ··· (211)
x
q(x) 2q(x)
X ∆(x)2
=0+ + ··· (212)
x
2q(x)

since X X
∆(x) = p(x) − q(x) = 0 (213)
x x

6.3 Fun with entropies


PRIVATE

Here is p(x, y) and the marginals p(x) and p(y) written in the margins (hence the name).
x p(y)
1 2 3 4
1 1/8 1/16 1/32 1/32 1/4
y 2 1/16 1/8 1/32 1/32 1/4
3 1/16 1/16 1/16 1/16 1/4
4 1/4 0 0 0 1/4
p(x) 1/2 1/4 1/8 1/8
Here is p(x|y) and H(X|y)
x H(X|y)
1 2 3 4
1 1/2 1/4 1/8 1/8 7/4
y 2 1/4 1/2 1/8 1/8 7/4
3 1/4 1/4 1/4 1/4 2
4 1 0 0 0 0

1. H(X, Y ) = 27/8 = 3.375


2. H(X) = 7/4 = 1.75, H(Y ) = 2.0.

3. H(X|y) = 1.75, 1.75, 2.0, 0.0. Note that H(X|Y = 4) = 0 < H(X), so observing Y = 4 decreases
our entropy, but H(X|Y = 3) = 2.0 > H(X), so observing Y = 3 increases our entropy. Also,
H(X|Y = 1) = H(X|Y = 2) = H(X), so observing Y = 1 or Y = 2 does not change our entropy (but
does change our posterior distribution!).
4. H(X|Y ) = 11/8 = 1.375 < H(X), so on average, we are less uncertain after seeing Y .

5. I(X, Y ) = H(X) − HX|Y ) = 3/8 = 0.375

24
6.4 Forwards vs reverse KL divergence
1. We have
X
KL (pkq) = p(x, y)[log p(x, y) − log q(x) − log q(y)] (214)
xy
X X X
= p(x, y) log p(x, y) − p(x) log q(x) − p(y) log q(y) (215)
xy x y

We can
P optimize wrt q(x) and q(y) separately. Imposing a Lagrange multiplier to enforce the constraint
that x q(x) = 1 we have the Lagrangian
X X
L(q, λ) = p(x) log q(x) + λ(1 − q(x)) (216)
x x

Taking derivatives wrt q(x) (thinking of the function as a finite length vector, for simplicity), we have

∂L p(x)
= −λ=0 (217)
∂q(x) q(x)
p(x)
q(x) = (218)
λ
Summing both sides over x we get λ = 1 and hence

q(x) = p(x) (219)

Analogously, q(y) = p(y).

2. We require q(x, y) = 0 whenever p(x, y) = 0, otherwise log q(x, y)/p(x, y) = ∞. Since q(x, y) =
qx (x)qy (y), it must be that qx (x) = qy (y) whenever x = y, and hence qx = qy are the same distribution.
There are only 3 possible distributions that put 0s in the right places and yet sum to 1. The first is:
x
1 2 3 4 q(y)
1 0 0 0 0 0
y 2 0 0 0 0 0
3 0 0 1 0 1
4 0 0 0 0 0
q(x) 0 0 1 0
The second one is
x
1 2 3 4 q(y)
1 0 0 0 0 0
y 2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 1 1
q(x) 0 0 0 1
For both of these, we have KL (qkp) = 1 × log 1/4
1
= log 4. Furthermore, any slight perturbation of
these probabilities away from the designated values will cause the KL to blow up, meaning these are
local minima.
The final local optimum is

25
x
1 2 3 4 q(y)
1 1/4 1/4 0 0 1/2
y 2 1/4 1/4 0 0 1/2
3 0 0 0 0 0
4 0 0 0 0 0
q(x) 1/2 1/2 0 0
1/4
This has KL (qkp) = 4( 14 log 1/8 ) = log 2, so this is actually the global optimum.
To see that there are no other solutions, one can do a case analysis, and see that any other distribution
will not put 0s in the right places. For example, consider this:
x
1 2 3 4 q(y)
1 1/4 0 1/4 0 1/2
y 2 0 0 0 0 0
3 1/4 0 1/4 0 1/2
4 0 0 0 0 0
q(x) 1/2 0 1/2 0
Obviously if we set q(x, y) = p(x)p(y) = 1/16, we get KL (qkp) = ∞.

26
7 Solutions
7.1 Orthogonal matrices
1. Let c = cos(α) and s = sin(α). Using the fact that c2 + s2 = 1, we have
  2
c + s2 + 0 −cs + sc + 0 0
    
c s 0 c −s 0 1 0 0
RT R = −s c 0 s c 0 = −sc + sc + 0 c2 + s2 + 0 0 = 0 1 0 (220)
0 0 1 0 0 1 0 0 1 0 0 1

2. The z-axis v = (0, 0, 1) is not affected by a rotation around z. We can easily check that (0, 0, 1) is an
eigenvector with eigenvalue 1 as follows:
    
c s 0 0 0
−s c 0 0 = 0 (221)
0 0 1 1 1
Of course, (0, 0, −1) is also a valid solution. We can check this using the symbolic math toolbox in
Matlab:
syms c s x y z
S=solve(’c*x-s*y=x’,’s*x+c*y=y’,’x^2+y^2+z^2=1’)
>> S.x
ans =
0
0
>> S.y
ans =
0
0
>> S.z
ans =
1
-1
If we ignore the unit norm constraint, we find that (0, 0, z) is a solution for any z ∈ R. We can see this
as follows:
    
c s 0 x x
−s c 0 y  = y  (222)
0 0 1 z z
becomes
cx − sy = x (223)
sx + cy = y (224)
z=z (225)
Solving gives
x(c − 1)
y= (226)
s
cx(x − 1)
sx + cy = sx + (227)
s
x(c − 1)
y= (228)
s
and hence x = y = 0. We can check this in Matlab:

27
S=solve(’c*x-s*y=x’,’s*x+c*y=y’)
S =
x: [1x1 sym]
y: [1x1 sym]
>> S.x
0
>> S.y
0

7.2 Eigenvectors by hand


PRIVATE

In matlab, we find
>> [V,D]=eig([2 0; 0 3])
V =
1 0
0 1
D =
2 0
0 3

so λ1 = 2, λ2 = 3, v1 = (1, 0)T , v2 = (0, 1)T . Let us check this by hand. We want to solve

Avi = λi vi (229)
(A − λi I)vi = 0 (230)

The characteristic equation is

|A − λI| = 0 (231)
 
2 − λ1 0
(232)

=0
0 3 − λ2
(2 − λ1 )(3 − λ2 ) = 0 (233)

so we see λ1 = 2,λ2 = 3. Solving for the eigenvectors


    
2 0 vi1 v
= λi i1 (234)
0 3 vi2 vi2

yields

2v11 = 2(v11 + v12 ) (235)


3v22 = 3(v21 + v22 ) (236)

so v11 = 1, v12 = 0, v21 = 0, v22 = 1.

28
8 Solutions
8.1 Subderivative of the hinge loss function
PRIVATE

At x = 0 and x = 2 the function is differentiable and ∂f (0) = {−1} and ∂f (2) = {0}. At x = 1 there is a
kink, and ∂f (1) = [−1, 0].

8.2 EM for the Student distribution


At first blush, it might not be apparent why EM can be used, since there is no missing data. The key idea
is to introduce an “artificial” hidden or auxiliary variable in order to simplify the algorithm. In particular,
we will exploit the fact that a Student distribution can be written as a Gaussian scale mixture [AM74;
Wes87] as follows: Z
ν ν
T (xn |µ, Σ, ν) = N (xn |µ, Σ/zn )Ga(zn | , )dzn (237)
2 2
This can be thought of as an “infinite” mixture of Gaussians, each one with a slightly different covariance
matrix.
Treating the zn as missing data, we can write the complete data log likelihood as
N
X
`c (θ) = [log N (xn |µ, Σ/zn ) + log Ga(zn |ν/2, ν/2)] (238)
n=1
N 
X D 1 zn ν ν ν
= − log(2π) − log |Σ| − δn + log − log Γ( ) (239)
n=1
2 2 2 2 2 2

ν D
+ (log zn − zn ) + ( − 1) log zn (240)
2 2
where we have defined the Mahalanobis distance to be
δn = (xn − µ)T Σ−1 (xn − µ) (241)
We can partition this into two terms, one involving µ and Σ, and the other involving ν. We have, dropping
irrelevant constants,
`c (θ) = LN (µ, Σ) + LG (ν) (242)
N
1 1 X
LN (µ, Σ) , − N log |Σ| − zn δn (243)
2 2 n=1
N
1 1 X
LG (ν) , −N log Γ(ν/2) + N ν log(ν/2) + ν (log zn − zn ) (244)
2 2 n=1

Let us first derive the algorithm with ν assumed known, for simplicity. In this case, we can ignore the LG
term, so we only need to figure out how to compute E [zn ] wrt the old parameters.
One can show that
ν + D ν + δn
p(zn |xn , θ) = Ga(zn | , ) (245)
2 2
Now if zn ∼ Ga(a, b), then E [zn ] = a/b. Hence the E step at iteration t is
h i ν (t) + D
z (t)
n , E zn |xn , θ
(t)
= (t)
(246)
ν (t) + δn

29
The M step is obtained by maximizing E [LN (µ, Σ)] to yield
P (t)
z n xn
(t+1)
µ̂ n
= P (t)
(247)
zn n
1 X (t)
Σ̂(t+1) = z (xn − µ̂(t+1) )(xn − µ̂(t+1) )T (248)
N n n
" N
! #
1 X (t) X
= z xn xTn − z (t) µ̂(t+1) (µ̂(t+1) )T (249)
N n n n=1
n

These results are quite intuitive: the quantity z n is the precision of measurement n, so if it is small,
the corresponding data point is down-weighted when estimating the mean and covariance. This is how the
Student achieves robustness to outliers.
To compute the MLE for the degrees of freedom, we first need to compute the expectation of LG (ν),
which involves zn and log zn . Now if zn ∼ Ga(a, b), then one can show that
(t)
h i
`n , E log zn |θ (t) = ψ(a) − log b (250)

where ψ(x) = d
dx log Γ(x) is the digamma function. Hence, from Equation (245), we have
(t)
(t) ν (t) + D ν (t) + δn
`n = Ψ( ) − log( ) (251)
2 2
ν (t) + D ν (t) + D
= log(z (t)
n ) + Ψ( ) − log( ) (252)
2 2
Substituting into Equation (244), we have

Nν ν X (t)
E [LG (ν)] = −N log Γ(ν/2) + log(ν/2) + (` − z (t)
n ) (253)
2 2 n n

The gradient of this expression is equal to


d N N N 1 X (t)
E [LG (ν)] = − Ψ(ν/2) + log(ν/2) + + (` − z (t)
n ) (254)
dν 2 2 2 2 n n

This has a unique solution in the interval (0, +∞] which can be found using a 1d constrained optimizer.
Performing a gradient-based optimization in the M step, rather than a closed-form update, is an example
of what is known as the generalized EM algorithm.

30
Part II
Linear models

31
9 Solutions
9.1 Derivation of Fisher’s linear discriminant
We have

f = wT SB w (255)
0
f = 2SB w (256)
T
g = w SW w (257)
0
g = 2SW w (258)
T T
dJ(w) (2sB w)(w SW w) − (w SB w)(2SW w)
= =0 (259)
dw (wT SW w)T (wT SW w)

Hence

(sB w) (wT SW w) = (wT SB w)(SW w) (260)


| {z } | {z }
a b
aSB w = bSW w (261)
b
SB w = SW w (262)
a

32
10 Solutions
10.1 Gradient and Hessian of log-likelihood for multinomial logistic regression
1. Let us drop the i subscript for simplicity. Let S = k eηk be the denominator of the softmax.
P
 
∂µk ∂ ηk −1 ∂
= e S + e ηk · [ S] · −S −2 (263)
∂ηj ∂ηj ∂ηj
= (δij eηk )S −1 − eηk eηj S −2 (264)
e ηk e ηj
 
= δij − (265)
S S
= µk (δkj − µj ) (266)

2. We have
X X ∂` ∂µik ∂ηij X X yik
∇wj ` = = µik (δjk − µij )xi (267)
i
∂µik ∂ηij ∂wj i
µik
k k
XX X XX
= yik (δjk − µij )xi = yij xi − ( yik )µij xi (268)
i k i i k
X
= (yij − µij )xi (269)
i

3. We consider a single term xi in the log likelihood; we can sum over i at the end. Using the Jacobian
expression from above, we have
∇wc0 (∇wc `)T = ∇wc0 ((yic − µic )xTi ) (270)
= −(∇wc0 µic )xTi (271)
= −(µic (δc,c0 − µi,c0 )xi )xTi (272)

10.2 Regularizing separate terms in 2d logistic regression


PRIVATE

1. Unregularized solution: see Figure 1(a). There is a unique ML decision boundary which makes 0 errors,
but there are several possible decision boundaries which all makes 0 errors.
2. Heavily regularizing w0 sets w0 = 0, so the line must go through the origin. There are several possible
lines with different slopes, but all will make 1 error. see Figure 1(b).
3. Regularizing w1 makes the line horizontal since x1 is ignored. This incurs 2 errors. see Figure 1(c).
4. Regularizing w2 makes the line vertical, since x2 is ignored. This incurs 0 errors. see Figure 1(d).

10.3 Logistic regression vs LDA/QDA


PRIVATE

1. GaussI ≤ LinLog. Both have logistic (sigmoid) posteriors p(y|x, w) = σ(ywT x), but LinLog is the
logistic model which is trained to maximize p(y|x, w). (GaussI may have high joint p(y, x), but this
does not necessarily mean p(y|x) is high; LinLog can achieve the maximum of p(y|x), so will necessarily
do at least as well as GaussI.)

33
(a) (b)

(c) (d)

Figure 1: (a) Unregularized. (b) Regularizing w0 . (c) Regularizing w1 . (d) Regularizing w2 .

2. GaussX ≤ QuadLog. Both have logistic posteriors with quadratic features, but QuadLog is the model
of this class maximizing the average log probabilities.
3. LinLog ≤ QuadLog. Logistic regression models with linear features are a subclass of logistic regression
models with quadratic functions. The maximum from the superclass is at least as high as the maximum
from the subclass.
4. GaussI ≤ QuadLog. Follows from above inequalities.
5. Although one might expect that higher log likelihood results in better classification performance, in
general, having higher average log p(y|x) does not necessarily translate to higher or lower classification
error. For example, consider linearly separable data. We have L(linLog) > L(GaussI), since maximum
likelihood logistic regression will set the weights to infinity, to maximize the probability of the correct
labels (hence p(yi |xi , ŵ) = 1 for all i). However, we have R(linLog) = R(gaussI), since the data is
linearly separable. (The GaussI model may or may not set σ very small, resulting in possibly very large
class conditional pdfs; however, the posterior over y is a discrete pmf, and can never exceed 1.)
As another example, suppose the true label is always 1 (as opposed to 0), but model M always predicts
p(y = 1|x, M ) = 0.49. It will always misclassify, but it is at least close to the decision boundary.
By contrast, there might be another model M 0 that predicts p(y = 1|x, M 0 ) = 1 on even-numbered
inputs, and p(y = 1|x, M 0 ) = 0 on odd-numbered inputs. Clearly R(M 0 ) = 0.5 < R(M ) = 1, but
L(M 0 ) = −∞ < L(M ) = log(0.49).

34
11 Solutions
11.1 Multi-output linear regression
PRIVATE

We have

Ŵ = (ΦT Φ)−1 ΦT Y (273)


 
1 1 1 0 0 0
T
Φ = (274)
0 0 0 1 1 1
 
3 0
ΦT Φ = (275)
0 3
 
−1 −1
−1 −2
 
−2 −1
Y= 1
 (276)
 1
1 2
2 1
 
−4 −4
ΦT Y = (277)
4 4
(ΦT Φ)−1 = (1/3)I2 (278)
 
−4/3 −4/3
Ŵ = (ΦT Φ)−1 ΦT Y = (279)
4/3 4/3

11.2 Centering and ridge regression


Suppose X is centered, so x = 0. Then

J(w, w0 ) = (y − Xw − w0 1)T (y − Xw − w0 1) + λwT w (280)


T T T T T T T T
(281)

= y y + w (X X)w − 2y (Xw) + λw w + −2w0 1 y + 2w0 1 Xw + w0 1 1w0

Consider the terms in brackets:

w0 1T y = w0 ny (282)
X
w0 1T Xw = w0 xTi w = nxT w = 0 (283)
i
w0 1T 1w0 = nw02 (284)

Optimizing wrt w0 we find



J(w, w0 ) = −2ny + 2nw0 = 0 (285)
∂w0
ŵ0 = y (286)

Optimizing wrt w we find



J(w, ŵ0 ) = [2XT Xw − 2XT y] + 2λw = 0 (287)
∂w
w = (XT X + λI)−1 XT y (288)

35
11.3 Partial derivative of the RSS
PRIVATE

1. We have
X
RSS(w) = (ri − wk xik )(ri − wk xik ) (289)
i
X
= [ri2 − wk2 x2ik − 2wk xik ri ] (290)
i

So
∂ X
RSS(w) = [−2wk x2ik − 2xik ri ] (291)
∂wk i

Let us check this matches the standard expression

∇w = 2XT Xw − 2XT y (292)

Recall that [XT X]jk = xij xik . For simplicity, suppose there are two features. Then
P
i
X X X X X
[∇w ]1 = 2[ x2i1 , xi1 xi2 ]T w − 2[ xi1 yi ] = 2w1 x2i1 + 2 xi1 (w2 xi2 − yi ) = a1 w1 − c1 (293)
i i i i i

2. Setting the derivative to zero, we have


X X
wk ( x2ik ) = (xik ri ) (294)
i i

so
xT:,k r
wk = (295)
xT:,k x:,k

11.4 Reducing elastic net to lasso


We have
J1 (w) = y T y + (Xw)T (Xw) − 2y T (Xw) + λ2 wT w + λ1 |w|1 (296)
and
J2 (w) = y T y + c2 (Xw)T (Xw) − 2c2 y T (Xw) + λ2 c2 wT w + cλ1 |w|1 = J1 (cw) (297)

11.5 Shrinkage in linear regression


PRIVATE

1. From the book, we have that ŵkOLS = xTk y = ck /2. Hence the solid line (1) is OLS and has slope 1/2.
Also, ŵkridge = ŵkOLS /(1 + λ2 ) = 2(1+λ
ck
2)
. Hence the dotted line (2) is ridge and has slope 1/4. Finally,
lasso must be the dashed line (3), since it sets coefficients to zero if −λ1 ≤ ck ≤ λ1 .
2. λ1 = 1
3. λ2 = 1

36
11.6 EM for mixture of linear regression experts
In the E step, we compute the conditional responsibilities

p(yn |xn , zn = k)p(zn = k|xn )


rnk = p(zn = k|xn , yn ) = (298)
p(yn |xn )

In the M step, we update the parameters of the gating funtion by maximizing the weighted likelihood
XX
`(θg ) = rnk log p(zn = k|xn , θg ) (299)
n k

and the parameters of the k’th expert by maximizing the weighted likelihood
X
`(θk ) = rnk log p(yn |xn , zn = k, θk ) (300)
n

If the gating function and experts are linear models, these M steps correspond to convex subproblems that
can be solved efficiently.
For example, consider a mixture of linear regression experts using logistic regression gating functions. In
the M step, we need to maximize Q(θ, θ old ) wrt wk , σk2 and V. For the regression parameters for model k,
the objective has the form
N  
X 1
Q(θk , θ old ) = rnk − 2 (yn − wTk xn ) (301)
n=1
σk
We recognize this as a weighted least squares problem, which makes intuitive sense: if rnk is small, then data
point n will be downweighted when estimating model k’s parameters. From the weighted least square section,
we can immediately write down the MLE as

wk = (XT Rk X)−1 XT Rk y (302)

where Rk = diag(r:,k ). The MLE for the variance is given by


PN T 2
n=1 rnk (yn − wk xn )
σk2 = PN (303)
n=1 rnk

We replace the estimate of the unconditional mixing weights π with the estimate of the gating parameters,
V. The objective has the form
XX
`(V) = rnk log πn,k (304)
n k

We recognize this as equivalent to the log-likelihood for multinomial logistic regression, except we replace the
“hard” 1-of-C encoding yi with the “soft” 1-of-K encoding ri . Thus we can estimate V by fitting a logistic
regression model to soft target labels.

37
12 Solutions

38
Part III
Deep neural networks

39
13 Solutions
13.1 Backpropagation for a 1 layer MLP
To compute δ 1 , let L = CrossEntropyWithLogits(y, a). Then

∂L
δ1 = = (p − y)T (305)
∂a
where p = S(a).
To compute δ 2 , we have
∂L ∂L ∂a ∂h ∂a ∂h
δ2 = = = δ1 (306)
∂z ∂a ∂h ∂z ∂h ∂z
∂h
= δ1 U since a = Uh + b2 (307)
∂z
= δ 1 U ◦ ReLU0 (z) since h = ReLU(z) (308)
= δ 1 U ◦ H(h) (309)

Now we compute the gradients wrt the parameters:


∂L ∂L ∂a
= = δ 1 hT (310)
∂U ∂a ∂U
∂L ∂L ∂a
= = δ1 (311)
∂b2 ∂a ∂b2
∂L ∂L ∂z
= = δ 2 xT (312)
∂W ∂z ∂W
∂L ∂L ∂z
= = δ2 (313)
∂b1 ∂z ∂b1
∂L ∂L ∂z
= = δ2 W (314)
∂x ∂z ∂x

40
14 Solutions

41
15 Solutions

42
Part IV
Nonparametric models

43
16 Solutions

44
17 Solutions
17.1 Fitting an SVM classifier by hand
PRIVATE

1. The perpendicular to the decision boundary is a line through φ(x1 ) to φ(x2 ), and hence is parallel to
φ(x2 ) − φ(x1 ) = (1, 2, 2) − (1, 0, 0) = (0, 2, 2). Of course, any scalar multiple of this is acceptable, e.g.,
(0, 1, 1).
2. There are 2 support vectors, namely the two data points. The decision boundary will be half way
between them. This midpoint is m = (1,2,2)+(1,0,0)
2 = (1, 1, 1). The distance of each of the training
points to this midpoint is
p √
||φ(x1 ) − m|| = ||φ(x2 ) − m|| = ||(0, 1, 1)|| = 02 + 12 + 12 = 2 (315)

Hence the margin is 2.

3. We have w = (0, 1/2, 1/2), which is parallel to (0, 2, 2) and has ||w|| = (1/2)2 + (1/2)2 = 1/ 2 as
p

required.
4. Using the equations we have

−1([1, 0, 0]T w + w0 ) = 1 (316)


T
+1([1, 2, 2] w + w0 ) = 1 (317)

Hence w0 = −1.
5. We have √
√ 2 1
T T 2
w0 + w φ(x) = −1 + (0, 1/2, 1/2) (1, 2x, x ) = −1 + x + x2 (318)
2 2

45
18 Solutions

46
Part V
Beyond supervised learning

47
19 Solutions
19.1 Information gain equations
To see this, let us define pn , p(θ|D) as the current belief state, and pn+1 , p(θ|D, x, y) as the belief state
after observing (x, y). Then
Z
pn+1
KL (pn+1 kpn ) = pn+1 log dθ (319)
pn
Z Z
= pn+1 log pn+1 dθ − pn+1 log pn dθ (320)
Z
= − H pn+1 − pn+1 log pn dθ (321)

Next we need to take expectations wrt p(y|x, D). We will use the following identity:
X Z
p(y|x, D) pn+1 log pn dθ (322)
y
X Z
= p(y|x, D) p(θ|D, x, y) log p(θ|D)dθ (323)
y
Z X
= p(y, θ|x, D) log p(θ|D)dθ (324)
y
Z
= p(θ|D) log p(θ|D)dθ (325)

Using this, we have


Z
0
U (x) = Ep(y|x,D) [− H pn+1 ] − pn log pn dθ (326)

= Ep(y|x,D) [− H p(θ|x, y, D) + H p(θ|D)] = U (x) (327)

48
20 Solutions
20.1 EM for FA
In the E step, we can compute the posterior for zn as follows:

p(zn |xn , θ) = N (zn |mn , Σn ) (328)


−1
Σn , (I + W Ψ T
W) −1
(329)
−1
mn , Σn (W Ψ T
(xn − µ)) (330)

We now discuss the M step. We initially assume µ = 0. Using the trace trick we have
X   X h T −1 i
E (x̃i − Wzi )T Ψ−1 (x̃i − Wzi ) = x̃i Ψ x̃i + E ziT WT Ψ−1 Wzi − 2x̃Ti Ψ−1 WE [zi ] (331)
 
i i
Xh
tr(Ψ−1 x̃i x̃Ti ) + tr(Ψ−1 WE zi ziT WT ) (332)
 
=
i
i
−tr(2Ψ−1 WE [zi ] x̃Ti ) (333)
, tr(Ψ−1 G(W)) (334)

Hence the expected complete data log likelihood is given by


N 1
Q= log |Ψ−1 | − tr(Ψ−1 G(W)) (335)
2 2
Using the chain rule and the facts that ∂ T
∂W tr(W A) =A ∂ T
∂W W AW = (A + AT )W we have

1
∇W Q(W) = − Ψ−1 ∇W G(W) = 0 (336)
2
X  X
E zi ziT − 2( E [zi ] x̃Ti )T (337)

∇W G(W) = 2W
i i
" #" #−1
X T
X 
T
(338)

Wmle = x̃i E [zi ] E zi zi
i i

Using the facts that ∇X log |X| = X−T and ∇X tr(XA) = AT we have

N 1
∇Ψ−1 Q = Ψ − G(Wmle ) = 0 (339)
2 2
1
Ψ = diag(G(Wmle )) (340)
N
We can simplify this as follows, by plugging in the MLE (this simplification no longer holds if we use MAP
estimation). First note that X  X
E zi ziT Wmle E [zi ] x̃Ti (341)
 T
=
i i
so
1 X 
Ψ= x̃i x̃Ti + Wmle E [zi ] x̃Ti − 2Wmle E [zi ] x̃Ti (342)
N i
!
1 X X
= T
x̃i x̃i − Wmle T
E [zi ] x̃i (343)
N i i

49
To estimate µ and W at the same time, we can define W̃ = (W, µ) and z̃ = (z, 1), Also, define

bn , E [z̃|xn ] = [mn ; 1] (344)


h i E zzT |x  E [z|x ]
n n
T
Ωn , E z̃z̃ |xn = T (345)
E [z|xn ] 1

Then the M step is as follows:


" #" #−1
ˆ
W̃ =
X
T
x n bn
X
Ωn (346)
n n
( )
1 X ˆ

Ψ̂ = diag xn − W̃b T
n xn (347)
N n

It is interesting to apply the above equations to the PPCA case in the limit where σ 2 → 0. This provides
an alternative way to fit PCA models, as shown by [Row97]. Let Z̃ be a L × N matrix storing the posterior
means (low-dimensional representations) along its columns. Similarly, let X̃ = XT be an D × N matrix
storing the original data along its columns. From the PPCA posterior, when σ 2 = 0, we have

Z̃ = (WT W)−1 WT X̃ (348)

This constitutes the E step. Notice that this is just an orthogonal projection of the data.
From Equation (346), the M step is given by
" #" #−1
X T
X T
Ŵ = xn E [zn ] E [zn ] E [zn ] (349)
n n

where we exploited the fact that Σ = Cov [zn |xn , θ] = 0I when σ 2 = 0. It P


is worth comparing this expression
to the MLE for multi-output linear regression, which has the form W = ( n yn xTn )( n xn xTn )−1 . Thus we
P
see that the M step is like linear regression where we replace the observed inputs by the expected values of
the latent variables.
In summary, here is the entire algorithm:

• E step: Z̃ = (WT W)−1 WT X̃


T T
• M step: W = X̃Z̃ (Z̃Z̃ )−1

[TB99] showed that the only stable fixed point of the EM algorithm is the globally optimal solution.
That is, the EM algorithm converges to a solution where W spans the same linear subspace as that defined
by the first L eigenvectors. However, if we want W to be orthogonal, and to contain the eigenvectors in
descending order of eigenvalue, we have to orthogonalize the resulting matrix (which can be done quite
cheaply). Alternatively, we can modify EM to give the principal basis directly [AO03].

20.2 EM for mixFA


PRIVATE
We state the results below without proof (the derivation can be found in [GH96]).
In the E step, we compute the posterior responsibility of cluster c for data point n using

rnc , p(sn = c|xn , θ) ∝ πc N (xn |µc , Wc WTc + Ψ) (350)

50
The conditional posterior for zn is given by

p(zn |xn , sn = c, θ) = N (zn |mnc , Σnc ) (351)


Σnc , (I + WTc Ψ−1
c Wc )
−1
(352)
mnc , Σnc (WTc Ψ−1
c (xn − µc )) (353)

For the M step, we define W̃c = (Wc , µc ), and z̃ = (z, 1). Also, define

bnc , E [z̃|xn , cn = c] = [mnc ; 1] (354)


h i E zzT |x , c = c E [z|x , c = c]
n n n n
T
Ωnc , E z̃z̃ |xn , cn = c = T (355)
E [z|xn , cn = c] 1

Then the M step is as follows:


" #" #−1
ˆ
W̃c =
X
T
rnc xn bc
X
rnc Ωnc (356)
n n
( )
1 X 
ˆ

Ψ̂ = diag T
rnc xn − W̃c bnc xn (357)
N nc
1 X
π̂c = rnc (358)
N n

20.3 Deriving the second principal component


1. Dropping terms that no not involve z2 we have
n n
1 X 1 X
−2zi2 v2T (xi − zi1 v1 ) + zi2
2 T
−2zi2 v2T xi + zi2
2
(359)
 
J= v2 v2 =
n i=1 n i=1

since v2T v2 = 1 and v1T v2 = 0. Hence


∂J
= −2v2T xi + 2zi2 = 0 (360)
∂zi2
so
zi2 = v2T xi (361)

2. We have
∂ J˜
= −2Cv2 + 2λ2 v2 + λ12 v1 = 0 (362)
∂v2
Premultiplying by v1T yields

0 = −2v1T Cv2 + 2λ2 v1T v2 + λ12 v1T v1 (363)

Now v1T Cv2 = v1T (λ1 v2 ) = 0, and v1T v2 = 0, and v1T v1 = 1, so λ12 = 0. Hence

0 = −2Cv2 + 2λ2 v2 (364)


Cv2 = λ2 v2 (365)

So v2 is an eigenvector of C. Since we want to maximize the variance, we want to pick the eigenvector
with the largest eigenvalue, but the first one is already taken. Hence v2 is the evector with the second
largest evalue.

51
20.4 Deriving the residual error for PCA
PRIVATE

1. From the previous exercise we have


z z
* i1 * i2 2
(xi − zi1 v1 − zi2 v2 )T (xi − zi1 v1 − zi2 v2 ) = xTi xi − 2zi1
xTi xTi 2
(366)
 
v1 − 2zi2 v2 + zi1 + zi2
= xTi xi − 2
zi1 − zi22
(367)
= xTi xi − v1T xTi xTi v1 − v2T xTi xTi v2 (368)

This generalizes to K > 2 in the obvious way.


2. Hence
 
n K
1 X T X
JK = xi xi − vjT xi xTi vj  (369)
n i=1 j=1
n n
1X T X 1X
= xi xi − vjT ( xi xTi )vj (370)
n i=1 j
n i=1
n
1X T X
= xi xi − vjT Cvj (371)
n i=1 j
n K
1X T X
= xi xi − λj (372)
n i=1 j=1

3. If K = d there is no truncation and Jd = 0. Hence


n K d d
1X T X X X
0= xi xi − λj − λj = JK − λj (373)
n i=1 j=1 j=K+1 j=K+1

Hence the residual error arising from only using K < d terms is
d
X
JK = λj (374)
j=K+1

20.5 PCA via successive deflation


1. We have
1
(I − v1 v1T )XT X(I − v1 v1T ) (375)

C̃ =
n
1 T
(X X − v1 v1T XT X)(I − v1 v1T ) (376)

=
n
1 T
X X − v1 (v1T XT X) − (XT Xv1 )v1T + v1 (v1T XT Xv1 )v1T (377)

=
n
1 T
X X − v1 (nλ1 v1T ) − (nλ1 v1 )v1T + v1 (v1T nλ1 v1 )v1T (378)

=
n
1 T
X X − nλ1 v1 v1T − nλ1 v1 v1T + nλ1 v1 v1T (379)

=
n
1 T
= X X − λ1 v1 v1T (380)
n

52
2. Since X̃ lives in the d − 1 subspace orthogonal to v1 , the vector u must be orthogonal to v1 . Hence
uT v1 = 0 and uT u = 1, so u = v2 .
3. We have

function [V, lambda] = simplePCA(C, K, f)


d = length(C);
V = zeros(d,K);
for j=1:K
[lambda(j), V(:,j) = f(C);
C = C - lambda(j)*V(:,j)*V(:,j)’; % deflation
end

20.6 PPCA variance terms


1
Define A = (ΛK − σ 2 I) 2 .
1. We have

v T Cv = v T (UARRT AT UT + σ 2 I)v = v T σ 2 Iv = σ 2 (381)

2. We have

v T Cv = uTi (UARRT AT UT + σ 2 I)ui (382)


= uTi U(Λ − σ 2 I)UT ui + σ 2 (383)
= eTi (Λ − σ 2 I)ei + σ 2 = λi − 2
σ +σ 2
(384)

20.7 Posterior inference in PPCA


PRIVATE

Using Bayes rule for Gaussians with the substitutions Λ = I, L−1 = σ 2 I, A = W, b = µ, µ = 0, we have
 −1
T 1 1 2
Cov [z|y] = (I + W 2 w) = −1 T
(σ I + W W) = σ 2 M−1 (385)
σ σ2
and

E[z|y] = Cov [z|y] WT σ −2 (x − µ) = M−1 WT (x − µ) (386)




20.8 Imputation in a FA model


PRIVATE

Letting C = WWT + Ψ, we have

p(xh |xv , θ) = N (xh |µh + Chv C−1 −1


vv (xv − µv ), Chh − Chv Cvv Cvh ) (387)

20.9 Efficiently evaluating the PPCA density


Since C is not full rank, we can use matrix inversion lemma to invert it efficiently:
1 
C−1 = I − W WT W + σ 2 I WT (388)
 
σ 2

53
Plugging in the MLE we find
1
W = UK (ΛK − σ 2 I) 2 (389)
1 
C−1 = 2 I − UK (ΛK − σ 2 I)Λ−1 T
(390)

K UK
σ
1 
= 2 I − UK JUTK (391)

σ
J = diag(1 − σ 2 /λj ) (392)

Similarly it can be shown that


K
X
log |WWT + σ 2 I| = (d − K) log σ 2 + log λi (393)
i=1

54
21 Solutions

55
22 Solutions

56
23 Solutions

57

You might also like