Running Master
Running Master
1
1 Solutions
2
Part I
Foundations
3
2 Solutions
2.1 Conditional independence
PRIVATE
so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P (E1 , E2 ) using
normalization.
We have p(X1 ) = p(X2 ) = p(X3 ) = 0.5. Also, p(X1 , X2 ) = p(X2 , X3 ) = p(X1 , X3 ) = 0.25. Hence
p(X1 , X2 ) = p(X1 )p(X2 ), and similarly for the other pairs. Hence the events are pairwise independent.
However, p(X1 , X2 , X3 ) = 0 6= 1/8 = p(X1 )p(X2 )p(X3 ).
4
Factorization ⇒ Independency. If p(x, y|z) = g(x, z)h(y, z) then
X X X X
1= p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (4)
x,y x,y x y
X X X
p(x|z) = p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (5)
y y y
X X X
p(y|z) = p(x, y|z) = g(x, z)h(y, z) = h(y, z) g(x, z) (6)
x x x
X X
p(x|z)p(y|z) = g(x, z)h(y, z) g(x, z) g(y, z) (7)
x y
We can bring out the dependency on x1 by rearranging the quadratic form inside the exponent as follows
2 2
x − µ1 y − (x1 − µ2 ) w1 w2
+ = (w1 + w2 )(x1 − x̂) + (y − (µ1 − µ2 ))2 (13)
σ1 σ2 w1 + w2
where we have defined the precision or weighting terms w1 = 1/σ12 , w2 = 1/σ22 , and the term
w1 µ1 + w2 y − w2 µ2
x̂ = (14)
w1 + w2
Note that
w1 w2 1 σ2 σ2 1
= 2 2 21 2 2 = 2 (15)
w1 + w2 σ1 σ2 σ1 + σ2 σ1 + σ22
Hence
Z
1 1 1 w1 w2
p(y) = exp − (w1 + w2 )(x1 − x̂)2 − (y − (µ1 − µ2 ))2 dx1 (16)
2πσ1 σ2 2 2 w1 + w2
The integral over x1 becomes one over the normalization constant for the Gaussian:
21
σ22 σ22
Z
1 1
exp[− (w1 + w2 )(x1 − x̂)2 ] dx1 = (2π) 2 (17)
2 σ12 + σ22
5
Hence
21
σ22 σ22
1 1 1
p(y) = (2π)−1 (σ12 σ22 )− 2 (2π) 2 exp[− (y − µ1 − µ2 )2 ] (18)
σ12 + σ22 2(σ12 + σ22 )
1 − 12 1
= (2π)− 2 σ12 + σ22 exp[− (y − µ1 − µ2 )2 ] (19)
2(σ1 + σ22 )
2
6
2.8 Mean, mode, variance for the beta distribution
For the mode we can use simple calculus, as follows. Let f (x) = Beta(x|a, b). We have
df
0= (31)
dx
= (a − 1)xa−2 (1 − x)b−1 − (b − 1)xa−1 (1 − x)b−2 (32)
= (a − 1)(1 − x) − (b − 1)x (33)
a−1
x= (34)
a+b−2
For the mean we have
Z 1
E [θ|D] = θp(θ|D)dθ (35)
0
Z
Γ(a + b)
= θ(a+1)−1 (1 − θ)b−1 dθ (36)
Γ(a)Γ(b)
Γ(a + b) Γ(a + 1)Γ(b) a
= = (37)
Γ(a)Γ(b) Γ(a + 1 + b) a+b
where we used the definition of the Gamma function and the fact that Γ(x + 1) = xΓ(x).
We can find the variance in the same way, by first showing that
We are asked to compute P (D = 1|T = 1), which we can do using Bayes’ rule:
P (T = 1|D = 1)P (D = 1)
P (D = 1|T = 1) = (43)
P (T = 1|D = 1)P (D = 1) + P (T = 1|D = 0)P (D = 0)
0.99 × 0.0001
= (44)
0.99 × 0.0001 + 0.01 × 0.9999
= 0.009804 (45)
So although you are much more likely to have the disease (given that you have tested positive) than a random
member of the population, you are still unlikely to have it.
7
2.10 Legal reasoning
Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and
G = ¬I be the event that the defendant is guilty.
1. The prosecutor is confusing p(E|I) with p(I|E). We are told that p(E|I) = 0.01 but the relevant
quantity is p(I|E). By Bayes rule, this is
p(E|I)p(I) 0.01p(I)
p(I|E) = = (46)
p(E|I)p(I) + p(E|G)p(G) 0.01p(I) + (1 − p(I))
since p(E|G) = 1 and p(G) = 1 − p(I). So we cannot determine p(I|E) without knowing the prior
probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of
innocence.
To understand this more intuitively, consider the following isomorphic problem (from http://en.
wikipedia.org/wiki/Prosecutor’s_fallacy):
A big bowl is filled with a large but unknown number of balls. Some of the balls are made of
wood, and some of them are made of plastic. Of the wooden balls, 100 are white; out of the
plastic balls, 99 are red and only 1 are white. A ball is pulled out at random, and observed
to be white.
Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it
is that the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls,
for instance, then a white ball pulled from the bowl at random is far more likely to be a white plastic
ball than a white wooden ball — even though white plastic balls are a minority of the whole set of
plastic balls.
2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are
p(G) 1
= (47)
p(I) 799, 999
2.11 Probabilities are sensitive to the form of the question that was used to
generate the answer
PRIVATE
1. The event space is shown below, where X is one child and Y the other.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4
8
Let Ng be the number of girls and Nb the number of boys. We have the constraint (side information)
that Nb + Ng = 2 and 0 ≤ Nb , Ng ≤ 2. We are told Nb ≥ 1 and are asked to compute the probability of
the event Ng = 1 (i.e., one child is a girl). By Bayes rule we have
p(Nb ≥ 1|Ng = 1)p(Ng = 1)
p(Ng = 1|Nb ≥ 1) = (49)
p(Nb ≥ 1)
1 × 1/2
= = 2/3 (50)
3/4
2. Let Y be the identity of the observed child and X be the identity of the other child. We want
p(X = g|Y = b). By Bayes rule we have
p(Y = b|X = g)p(X = g)
p(X = g|y = b) = (51)
p(Y = b)
(1/2) × (1/2)
= = 1/2 (52)
1/2
Tom Minka [Min98] has written the following about these results:
This seems like a paradox because it seems that in both cases we could condition on the fact that
"at least one child is a boy." But that is not correct; you must condition on the event actually
observed, not its logical implications. In the first case, the event was "He said yes to my question."
In the second case, the event was "One child appeared in front of me." The generating distribution
is different for the two events. Probabilities reflect the number of possible ways an event can
happen, like the number of roads to a town. Logical implications are further down the road and
may be reached in more ways, through different towns. The different number of ways changes the
probability.
9
3 Solutions
3.1 Uncorrelated does not imply independent
PRIVATE
We have
E[X] = 0 (62)
2
V [X] = (1 − −1) /12 = 1/3 (63)
2
E[Y ] = E[X ] = V [X] + (E[X]) = 1/3 + 0 2
(64)
Z Z 0 Z 1
1
E[XY ] = p(x)x3 dx = −( x3 dx) + ( x3 dx) (65)
2 −1 0
1 −1 4 0 1 41 1
= [x ]−1 + [x ]0 = [−1 + 1] = 0 (66)
2 4 4 8
where we have split the integral into two pieces, since x3 changes sign in the interval [−1, 1]. Hence
Cov [X, Y ] E[XY ] − E[X]E[Y ] 0
ρ= = =0 (67)
σX σY σX σY σX σY
2 2
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = a(µ + σ ) + bµ − µ(aµ + b) = aσ 2
(76)
2
Cov [X, Y ] aσ
ρ= p =√ √ =1 (77)
V [X] V [Y ] σ a2 σ 2
2
√
If Y = aX + b, where a < 0, we find ρ = −1, since a/ a2 = −a/|a| = −1.
10
3.4 Linear combinations of random variables
1. Let y = Ax. Then
= AΣA T
(81)
2. C = AB has entries cij = k aik bkj . The diagonalPelements of C (when i = j) are given by k aik bki .
P P
So the sum of the diagonal elements is tr(AB) = ik aik bki which is symmetric in A and B.
3. We have
= tr(AΣ) + m Am T
(84)
V [Y ] = E [V [Y |W ]] + V [E [Y |W ]] (86)
= E [W [V [X]]W ] + V [W E [X]] (87)
= E W2 + 0 = 1 (88)
Then we have
E [Y ] = E [E [Y |W ]] (93)
X
= p(w)E [Y |w] (94)
w∈{−1,1}
Hence
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = 0 (97)
So X and Y are uncorrelated even though they are dependent.
11
3.6 Normalization constant for a multidimensional Gaussian
Let Σ = UΛUT , so
p
−1 −1 −1
X 1
Σ =U −T
Λ U −1
= UΛ U= ui uTi (98)
λ
i=1 i
Hence
p
!
−1
X 1
T
(x − µ) Σ (x − µ) = (x − µ) T
ui uTi (x − µ) (99)
i=1
λ i
p p
X 1 X yi2
= T T
(x − µ) ui ui (x − µ) = (100)
λ
i=1 i
λ
i=1 i
where yi , uTi (x − µ). The y variables define a new coordinate system that is shifted (by µ) and rotated
(by U) with respect to the original x coordinates: y = U(x − µ). Hence x = UT y + µ.
The Jacobian of this transformation, from y to x, is a matrix with elements
∂xi
Jij = = Uji (101)
∂yj
so J = UT and |J| = 1.
So
1 X yi2
Z Z Y
1
exp(− (x − µ)T Σ−1 (x − µ))dx = exp(− )dyi |J| (102)
2 i
2 i λi
Yp
= 2πλi = |2πΣ| (103)
i
Now we just apply the equations. The posterior precision is a sum of the precisions of each sensor:
nv1
−1 T −1 0 1 n1 n2
Σµ|y = A Σy A = 1 1 1 = + (107)
0 nv22 1 v1 v2
The posterior mean is a weighted sum of the observed values from each sensor:
nv1
−1 0 y1 −1 n1 y 1 n2 y 2
µµ|y = Σµ|y 1 1 1
= Σµ|y + (108)
0 nv22 y2 v1 v2
12
3.8 Show that the Student distribution can be written as a Gaussian scale
mixture
Z ∞
p(w|µ, a, b) = N (w|µ, α−1 )Ga(α|a, b)dα (109)
0
∞
ba e−bα αa−1 α 21
Z α
= exp − (w − µ)2 dα (110)
0 Γ(a) 2π 2
1 Z
∞
ba
1 2 1 1
= αa− 2 exp −α b + (w − µ)2 dα (111)
Γ(a) 2π 0 2
Let us define
R∞ ∆ = b + (w − µ)2 /2 and z = α∆. Then, using dz = ∆dα, and the definition of the Gamma
function, 0 u e du = Γ(x), the integral becomes
x−1 −u
Z ∞ z a+ 12 −1 1 1
e−z ∆−1 dz = ∆−a− 2 Γ(a + ) (112)
0 ∆ 2
so we have
12
Γ(a + 1/2) a 1 1
p(w|µ, a, b) = b ∆−a− 2 (113)
Γ(a) 2π
13
4 Solutions
4.1 MLE for the univariate Gaussian
PRIVATE
Starting with the mean, we have
∂` 2 X
= (xi − µ) = 0 (118)
∂µ 2σ 2 i
N
1 X
µ̂ = xi = x (119)
N i=1
1. Since the mean and mode of a Gaussian are the same, we can just write down the MAP estimate using
the standard equations for the posterior mean of a Gaussian, which is
ns2 σ2
E[µ|D] = x+ m (124)
ns2 + σ 2 ns2 + σ 2
However, we can also derive the MAP estimate explicitly, as follows. The prior distribution over the
mean is
(µ − m)2
p(µ) = (2πs2 )−1/2 exp[− ] (125)
2s2
Since the samples xi are taken to be independent, we have:
1
Pn 2
p(D|θ) = e− 2σ2 i=1 (xi −µ) (126)
14
Taking the derivative with respect to µ :
n
∂ log(p(µ)p(D|µ)) 2(µ − m)(1) 1 X
=0− − 0 − (2)(xi − µ)(−1) (128)
∂µ 2s2 2σ 2 i=1
n
µ−m 1 X
=− 2 + 2 (xi − µ) (129)
s σ i=1
n
µ m 1 X nµ
=− 2
+ 2
+ 2
xi − 2 (130)
s s σ i=1 σ
σ2 ns2
2. Notice that as n increases, while s, m and σ remain constant, we have ns2 +σ 2 → 0 while ns2 +σ 2 → 1,
P
xi
yielding µ̂M AP → n = µ̂M LE . That is, when there are many samples, the prior knowledge becomes
i
less and less relevant, and all MAP estimators (for any prior) converge to the maximum likelihood
estimator µ̂M LE . This is not surprising: as we have more data, it outweighs our prior speculations.
σ2
3. As the prior variance s2 increases, even for a small number of samples n, we have ns2 +σ 2 → 0 while
P
ns2 xi
→ 1, again yielding µ̂M AP → n = µ̂M LE . That is, when we have an uninformative, almost
ns2 +σ 2
i
uniform, prior, we are left only with our data to base out estimation on.
σ2
4. On the other hand, if the prior variance is very small, s2 → 0, then we have ns2 +σ 2 → 1 while
ns2
→ 0, yielding µ̂M AP → m. That is, if our prior is very concentrated, we essentially already
ns2 +σ 2
know the answer, and can ignore the data.
where Φ is the cumulative distribution function for the standard normal N (0, 1) distribution, and Φ−1 (0.025) is
the value below which 2.5% of the probability mass lies (in Matlab, norminv(0.025)=-1.96). and Φ−1 (0.975)
is the value below which 97.5% of the probability mass lies (in Matlab, norminv(0.975)=1.96). We want to
find n such that
u−`=1 (138)
15
Hence we solve
2(1.96)σn = 1 (139)
1
σn2 = (140)
4(1.96)2
where
σ 2 σ02
σn2 = (141)
nσ02 + σ 2
Hence
N d
BIC = − D + log |Σ̂| − log N (148)
2 2
where d = D + D(D + 1)/2.
tr(Σ̂−1 −1
diag S) = tr(Σ̂diag Sdiag ) = tr(ID ) = D
because the off-diagonal elements of the matrix product get ignored. Hence
N d
BIC = − D + log |Σ̂| − log N (149)
2 2
where d = D + D parameters (for the mean and diagonal).
16
4.5 BIC for a 2d discrete distribution
1. The joint distribution is p(x, y|θ) = p(x|θ1 )p(y|x, θ2 ):
y=0 y=1
x=0 (1 − θ1 )θ2 (1 − θ1 )(1 − θ2 )
x=1 θ1 (1 − θ2 ) θ1 θ2
2. The log likelihood is
X X
log p(D|θ) = log p(xi |θ1 ) + log p(yi |xi , θ2 ) (150)
i i
17
For M4 , we have dof = 3 because of the sum-to-one constraint, so
2 1 3
BIC(m4 ) = 6 log( ) + 1 log( ) − log 7 = −12.3814 (160)
7 7 2
So BIC also prefers m2 .
p(D|θ)p(θ)
p(θ|D) = (161)
p(D)
P
p(D|θ) k p(Z = k)p(θ|Z = k)
=R (162)
p(D|θ0 ) k0 p(Z = k 0 )p(θ0 |Z = k 0 )dθ0
P
P
k p(Z = Rk)p(D, θ|Z = k)
=P 0
(163)
0 p(Z = k ) p(D, θ0 |Z = k 0 )dθ0
Pk
p(Z = k)p(θ|D, Z = k)p(D|Z = k)
= k P 0 0
(164)
k0 p(Z = k )p(D|Z = k )
X p(Z = k)p(D|Z = k)
= P 0 0
p(θ|D, Z = k) (165)
k k0 p(Z = k )p(D|Z = k )
X
= p(Z = k|D)p(θ|D, Z = k) (166)
k
where p(Z = k) = πk are the prior mixing weights, and p(Z = k|D) are the posterior mixing weights given by
p(Z = k)p(D|Z = k)
p(Z = k|D) = P 0 0
(167)
k0 p(Z = k )p(D|Z = k )
2
4.7 ML estimator σmle is biased
Because the variance of any random variable R is given by var(R) = E[R2 ]−(E[R])2 , the expected value of the
square of a Gaussian random variable Xi with mean µ and variance σ 2 is E[Xi2 ] = var(Xi )+(E[Xi ])2 = σ 2 +µ2 .
n Pn
2 1X j=1 Xj
EX1 ,...,Xn ∼N (µ,σ) [σ (X1 , . . . , Xn )] = E[ (Xi − )2 ]
n i=1 n
Pn
1X j=1 Xj
= nE[(Xi − )2 ]
n i=1 n
Pn Pn
1X j=1 Xj j=1 Xj
= nE[(Xi − )(Xi − )]
n i=1 n n
n n n
1X 2 2 X 1 XX
= nE[Xi − Xi Xj + 2 Xj Xk ]
n i=1 n j=1 n j=1
k=1
n n n n n
1X 2 XX 1 XXX
= nE[Xi2 ] − 2 E[Xi Xj ] + 3 [Xj Xk ]
n i=1 n i=1 j=1 n i=1 j=1
k=1
18
Pn Pn Pn Pn
Consider the two summations i=1 j=1 E[Xi Xj ] and j=1 k=1 [Xj Xk ]. Of the n2 terms in each of
these summations, n of them satisfy i = j or j = k, so these terms are of the form E[Xi2 ]. By linearity
of expectation, these terms contribute nE[Xi2 ] to the sum. The remaining n2 − n terms are of the form
E[Xi Xj ] or E[Xj Xk ] for i =
6 j or j =
6 k. Because the Xi are independent samples, it follows from linearity of
expectation that these terms contribute (n2 − n)E[Xi ]E[Xj ] to the summation.
n X
X n n X
X n
E[Xi Xj ] = E[Xj Xk ]
i=1 j=1 j=1 k=1
Since the expected value of σˆ2 (X1 , . . . , Xn ) is not equal to the actual variance σ 2 , σ 2 is not an unbiased
estimator. In fact, the maximum likelihood estimator tends to underestimate the variance. This is not
surprising: consider the case of only a single sample: we will never detect any variance. If there are multiple
samples, we will detect variance, but since our estimate for the mean will tend to be shifted from the true
mean in the direction of our samples, we will tend to underestimate the variance.
19
random variable:
n
1X
EX1 ,...,Xn ∼N (µ,σ) [σ̂ 2 (X1 , X2 , . . . , Xn ) = EXi ∼N (µ,σ) [(Xi − µ)2 ]
n i=1
n
1X
= V ar[N (µ, σ)]
n i=1
n
1X 2
= σ
n i=1
= σ2
4.9 Variance and MSE of the unbiased estimator for Gaussian variance
PRIVATE
We have
N −1 2
V σ unb = 2(N − 1) (168)
σ2
(N − 1)2 2
V σunb = 2(N − 1) (169)
σ4
2σ 4
(170)
2
V σunb =
N −1
Hence the variance of the ML estimator is
2
2σ 4
N −1 2 N −1 2(N − 1) 4
(171)
2
V σ̂M L = V σ̂N −1 = = σ
N N N −1 N2
Finally, we have
MSE(σmle
2
) = bias(σmle
2
)2 + V σmle (172)
2
2
−σ 2 2(N − 1) 4
=( ) + σ (173)
N N2
2N − 1 4
= σ (174)
N2
and
MSE(σunb
2
) = bias(σunb
2
)2 + V σunb (175)
2
2σ 4
=0+ (176)
N −1
20
5 Solutions
5.1 Reject option in classifiers
1. We have to choose between rejecting, with risk λr , and choosing the most probable class, jmax =
arg maxj p(Y = j|x), which has risk
X
λs p(Y = j|x) = λs (1 − p(Y = jmax |x)) (177)
j6=jmax
Z ∞ Z Q Z Q
Eπ(Q) = (P − C)Qf (D)dD + (P − C)Df (D) − C(Q − D)f (D)dD (182)
Q 0 0
Z Q Z Q
= (P − C)Q(1 − F (Q)) + P Df (D) − CQ f (D)dD (183)
0 0
Z Q
= (P − C)Q(1 − F (Q)) + P Df (D) − CQF (Q) (184)
0
Hence
d
Eπ(Q) = (P − C)(1 − F (Q)) − (P − C)Qf (Q) + P Qf (Q) − CF (Q) − CQf (Q) (185)
dQ
= (P − C) − P F (Q) (186)
=0 (187)
So
P −C
F (Q) = (188)
P
21
5.3 Bayes factors and ROC curves
PRIVATE
p(H1 |D) is a monotonically increasing function of B, and therefore cannot affect the shape of the ROC
curve. This makes sense intuitively, since, in the two class case, it should not matter whether we threshold the
ratio p(H1 |D)/p(H0 |D), or the posterior, p(H1 |D), since they contain the same information, just measured
on different scales.
where φ0 (a, θ) = da
d
φ(a, θ). Applying this to the first integral in Equation 190, with A(a) = a, B(a) = ∞,
φ(a, θ) = (θ − a)p(θ|x)dy, we have
Z ∞ Z ∞
(θ − a)p(θ|x)dθ = −p(θ|x)dθ + 0 + 0 (192)
a a
Hence
Z ∞ Z a
0
ρ (a|x) = − p(θ|x)dθ + p(θ|x)dθ (194)
a −∞
= −P (θ ≥ a|x) + P (θ ≤ a|x) = 0 (195)
22
6 Solutions
6.1 Expressing mutual information in terms of entropies
PRIVATE
X p(x, y)
I(X, Y ) = p(x, y) log (197)
x,y
p(x)p(y)
X p(x|y)
= p(x, y) log (198)
x,y
p(x)
X X
=− p(x, y) log p(x) + p(x, y) log p(x|y) (199)
x,y x,y
!
X X
=− p(x) log p(x) − − p(x, y) log p(x|y) (200)
x x,y
!
X X X
=− p(x) log p(x) − − p(y) p(x|y) log p(x|y) (201)
x y x
Simplifying yields
23
6.2 Relationship between D(p||q) and χ2 statistic
We have
X p(x)
D(p||q) = p(x) log (208)
x
q(x)
X ∆(x)
= (Q(x) + ∆(x)) log 1 + (209)
x
q(x)
∆(x) ∆(x)2
X
= (Q(x) + ∆(x)) − + ··· (210)
x
q(x) 2q(x)
X ∆(x)2 ∆(x)2
= ∆(x) + − + ··· (211)
x
q(x) 2q(x)
X ∆(x)2
=0+ + ··· (212)
x
2q(x)
since X X
∆(x) = p(x) − q(x) = 0 (213)
x x
Here is p(x, y) and the marginals p(x) and p(y) written in the margins (hence the name).
x p(y)
1 2 3 4
1 1/8 1/16 1/32 1/32 1/4
y 2 1/16 1/8 1/32 1/32 1/4
3 1/16 1/16 1/16 1/16 1/4
4 1/4 0 0 0 1/4
p(x) 1/2 1/4 1/8 1/8
Here is p(x|y) and H(X|y)
x H(X|y)
1 2 3 4
1 1/2 1/4 1/8 1/8 7/4
y 2 1/4 1/2 1/8 1/8 7/4
3 1/4 1/4 1/4 1/4 2
4 1 0 0 0 0
3. H(X|y) = 1.75, 1.75, 2.0, 0.0. Note that H(X|Y = 4) = 0 < H(X), so observing Y = 4 decreases
our entropy, but H(X|Y = 3) = 2.0 > H(X), so observing Y = 3 increases our entropy. Also,
H(X|Y = 1) = H(X|Y = 2) = H(X), so observing Y = 1 or Y = 2 does not change our entropy (but
does change our posterior distribution!).
4. H(X|Y ) = 11/8 = 1.375 < H(X), so on average, we are less uncertain after seeing Y .
24
6.4 Forwards vs reverse KL divergence
1. We have
X
KL (pkq) = p(x, y)[log p(x, y) − log q(x) − log q(y)] (214)
xy
X X X
= p(x, y) log p(x, y) − p(x) log q(x) − p(y) log q(y) (215)
xy x y
We can
P optimize wrt q(x) and q(y) separately. Imposing a Lagrange multiplier to enforce the constraint
that x q(x) = 1 we have the Lagrangian
X X
L(q, λ) = p(x) log q(x) + λ(1 − q(x)) (216)
x x
Taking derivatives wrt q(x) (thinking of the function as a finite length vector, for simplicity), we have
∂L p(x)
= −λ=0 (217)
∂q(x) q(x)
p(x)
q(x) = (218)
λ
Summing both sides over x we get λ = 1 and hence
2. We require q(x, y) = 0 whenever p(x, y) = 0, otherwise log q(x, y)/p(x, y) = ∞. Since q(x, y) =
qx (x)qy (y), it must be that qx (x) = qy (y) whenever x = y, and hence qx = qy are the same distribution.
There are only 3 possible distributions that put 0s in the right places and yet sum to 1. The first is:
x
1 2 3 4 q(y)
1 0 0 0 0 0
y 2 0 0 0 0 0
3 0 0 1 0 1
4 0 0 0 0 0
q(x) 0 0 1 0
The second one is
x
1 2 3 4 q(y)
1 0 0 0 0 0
y 2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 1 1
q(x) 0 0 0 1
For both of these, we have KL (qkp) = 1 × log 1/4
1
= log 4. Furthermore, any slight perturbation of
these probabilities away from the designated values will cause the KL to blow up, meaning these are
local minima.
The final local optimum is
25
x
1 2 3 4 q(y)
1 1/4 1/4 0 0 1/2
y 2 1/4 1/4 0 0 1/2
3 0 0 0 0 0
4 0 0 0 0 0
q(x) 1/2 1/2 0 0
1/4
This has KL (qkp) = 4( 14 log 1/8 ) = log 2, so this is actually the global optimum.
To see that there are no other solutions, one can do a case analysis, and see that any other distribution
will not put 0s in the right places. For example, consider this:
x
1 2 3 4 q(y)
1 1/4 0 1/4 0 1/2
y 2 0 0 0 0 0
3 1/4 0 1/4 0 1/2
4 0 0 0 0 0
q(x) 1/2 0 1/2 0
Obviously if we set q(x, y) = p(x)p(y) = 1/16, we get KL (qkp) = ∞.
26
7 Solutions
7.1 Orthogonal matrices
1. Let c = cos(α) and s = sin(α). Using the fact that c2 + s2 = 1, we have
2
c + s2 + 0 −cs + sc + 0 0
c s 0 c −s 0 1 0 0
RT R = −s c 0 s c 0 = −sc + sc + 0 c2 + s2 + 0 0 = 0 1 0 (220)
0 0 1 0 0 1 0 0 1 0 0 1
2. The z-axis v = (0, 0, 1) is not affected by a rotation around z. We can easily check that (0, 0, 1) is an
eigenvector with eigenvalue 1 as follows:
c s 0 0 0
−s c 0 0 = 0 (221)
0 0 1 1 1
Of course, (0, 0, −1) is also a valid solution. We can check this using the symbolic math toolbox in
Matlab:
syms c s x y z
S=solve(’c*x-s*y=x’,’s*x+c*y=y’,’x^2+y^2+z^2=1’)
>> S.x
ans =
0
0
>> S.y
ans =
0
0
>> S.z
ans =
1
-1
If we ignore the unit norm constraint, we find that (0, 0, z) is a solution for any z ∈ R. We can see this
as follows:
c s 0 x x
−s c 0 y = y (222)
0 0 1 z z
becomes
cx − sy = x (223)
sx + cy = y (224)
z=z (225)
Solving gives
x(c − 1)
y= (226)
s
cx(x − 1)
sx + cy = sx + (227)
s
x(c − 1)
y= (228)
s
and hence x = y = 0. We can check this in Matlab:
27
S=solve(’c*x-s*y=x’,’s*x+c*y=y’)
S =
x: [1x1 sym]
y: [1x1 sym]
>> S.x
0
>> S.y
0
In matlab, we find
>> [V,D]=eig([2 0; 0 3])
V =
1 0
0 1
D =
2 0
0 3
so λ1 = 2, λ2 = 3, v1 = (1, 0)T , v2 = (0, 1)T . Let us check this by hand. We want to solve
Avi = λi vi (229)
(A − λi I)vi = 0 (230)
|A − λI| = 0 (231)
2 − λ1 0
(232)
=0
0 3 − λ2
(2 − λ1 )(3 − λ2 ) = 0 (233)
yields
28
8 Solutions
8.1 Subderivative of the hinge loss function
PRIVATE
At x = 0 and x = 2 the function is differentiable and ∂f (0) = {−1} and ∂f (2) = {0}. At x = 1 there is a
kink, and ∂f (1) = [−1, 0].
Let us first derive the algorithm with ν assumed known, for simplicity. In this case, we can ignore the LG
term, so we only need to figure out how to compute E [zn ] wrt the old parameters.
One can show that
ν + D ν + δn
p(zn |xn , θ) = Ga(zn | , ) (245)
2 2
Now if zn ∼ Ga(a, b), then E [zn ] = a/b. Hence the E step at iteration t is
h i ν (t) + D
z (t)
n , E zn |xn , θ
(t)
= (t)
(246)
ν (t) + δn
29
The M step is obtained by maximizing E [LN (µ, Σ)] to yield
P (t)
z n xn
(t+1)
µ̂ n
= P (t)
(247)
zn n
1 X (t)
Σ̂(t+1) = z (xn − µ̂(t+1) )(xn − µ̂(t+1) )T (248)
N n n
" N
! #
1 X (t) X
= z xn xTn − z (t) µ̂(t+1) (µ̂(t+1) )T (249)
N n n n=1
n
These results are quite intuitive: the quantity z n is the precision of measurement n, so if it is small,
the corresponding data point is down-weighted when estimating the mean and covariance. This is how the
Student achieves robustness to outliers.
To compute the MLE for the degrees of freedom, we first need to compute the expectation of LG (ν),
which involves zn and log zn . Now if zn ∼ Ga(a, b), then one can show that
(t)
h i
`n , E log zn |θ (t) = ψ(a) − log b (250)
where ψ(x) = d
dx log Γ(x) is the digamma function. Hence, from Equation (245), we have
(t)
(t) ν (t) + D ν (t) + δn
`n = Ψ( ) − log( ) (251)
2 2
ν (t) + D ν (t) + D
= log(z (t)
n ) + Ψ( ) − log( ) (252)
2 2
Substituting into Equation (244), we have
Nν ν X (t)
E [LG (ν)] = −N log Γ(ν/2) + log(ν/2) + (` − z (t)
n ) (253)
2 2 n n
This has a unique solution in the interval (0, +∞] which can be found using a 1d constrained optimizer.
Performing a gradient-based optimization in the M step, rather than a closed-form update, is an example
of what is known as the generalized EM algorithm.
30
Part II
Linear models
31
9 Solutions
9.1 Derivation of Fisher’s linear discriminant
We have
f = wT SB w (255)
0
f = 2SB w (256)
T
g = w SW w (257)
0
g = 2SW w (258)
T T
dJ(w) (2sB w)(w SW w) − (w SB w)(2SW w)
= =0 (259)
dw (wT SW w)T (wT SW w)
Hence
32
10 Solutions
10.1 Gradient and Hessian of log-likelihood for multinomial logistic regression
1. Let us drop the i subscript for simplicity. Let S = k eηk be the denominator of the softmax.
P
∂µk ∂ ηk −1 ∂
= e S + e ηk · [ S] · −S −2 (263)
∂ηj ∂ηj ∂ηj
= (δij eηk )S −1 − eηk eηj S −2 (264)
e ηk e ηj
= δij − (265)
S S
= µk (δkj − µj ) (266)
2. We have
X X ∂` ∂µik ∂ηij X X yik
∇wj ` = = µik (δjk − µij )xi (267)
i
∂µik ∂ηij ∂wj i
µik
k k
XX X XX
= yik (δjk − µij )xi = yij xi − ( yik )µij xi (268)
i k i i k
X
= (yij − µij )xi (269)
i
3. We consider a single term xi in the log likelihood; we can sum over i at the end. Using the Jacobian
expression from above, we have
∇wc0 (∇wc `)T = ∇wc0 ((yic − µic )xTi ) (270)
= −(∇wc0 µic )xTi (271)
= −(µic (δc,c0 − µi,c0 )xi )xTi (272)
1. Unregularized solution: see Figure 1(a). There is a unique ML decision boundary which makes 0 errors,
but there are several possible decision boundaries which all makes 0 errors.
2. Heavily regularizing w0 sets w0 = 0, so the line must go through the origin. There are several possible
lines with different slopes, but all will make 1 error. see Figure 1(b).
3. Regularizing w1 makes the line horizontal since x1 is ignored. This incurs 2 errors. see Figure 1(c).
4. Regularizing w2 makes the line vertical, since x2 is ignored. This incurs 0 errors. see Figure 1(d).
1. GaussI ≤ LinLog. Both have logistic (sigmoid) posteriors p(y|x, w) = σ(ywT x), but LinLog is the
logistic model which is trained to maximize p(y|x, w). (GaussI may have high joint p(y, x), but this
does not necessarily mean p(y|x) is high; LinLog can achieve the maximum of p(y|x), so will necessarily
do at least as well as GaussI.)
33
(a) (b)
(c) (d)
2. GaussX ≤ QuadLog. Both have logistic posteriors with quadratic features, but QuadLog is the model
of this class maximizing the average log probabilities.
3. LinLog ≤ QuadLog. Logistic regression models with linear features are a subclass of logistic regression
models with quadratic functions. The maximum from the superclass is at least as high as the maximum
from the subclass.
4. GaussI ≤ QuadLog. Follows from above inequalities.
5. Although one might expect that higher log likelihood results in better classification performance, in
general, having higher average log p(y|x) does not necessarily translate to higher or lower classification
error. For example, consider linearly separable data. We have L(linLog) > L(GaussI), since maximum
likelihood logistic regression will set the weights to infinity, to maximize the probability of the correct
labels (hence p(yi |xi , ŵ) = 1 for all i). However, we have R(linLog) = R(gaussI), since the data is
linearly separable. (The GaussI model may or may not set σ very small, resulting in possibly very large
class conditional pdfs; however, the posterior over y is a discrete pmf, and can never exceed 1.)
As another example, suppose the true label is always 1 (as opposed to 0), but model M always predicts
p(y = 1|x, M ) = 0.49. It will always misclassify, but it is at least close to the decision boundary.
By contrast, there might be another model M 0 that predicts p(y = 1|x, M 0 ) = 1 on even-numbered
inputs, and p(y = 1|x, M 0 ) = 0 on odd-numbered inputs. Clearly R(M 0 ) = 0.5 < R(M ) = 1, but
L(M 0 ) = −∞ < L(M ) = log(0.49).
34
11 Solutions
11.1 Multi-output linear regression
PRIVATE
We have
w0 1T y = w0 ny (282)
X
w0 1T Xw = w0 xTi w = nxT w = 0 (283)
i
w0 1T 1w0 = nw02 (284)
35
11.3 Partial derivative of the RSS
PRIVATE
1. We have
X
RSS(w) = (ri − wk xik )(ri − wk xik ) (289)
i
X
= [ri2 − wk2 x2ik − 2wk xik ri ] (290)
i
So
∂ X
RSS(w) = [−2wk x2ik − 2xik ri ] (291)
∂wk i
Recall that [XT X]jk = xij xik . For simplicity, suppose there are two features. Then
P
i
X X X X X
[∇w ]1 = 2[ x2i1 , xi1 xi2 ]T w − 2[ xi1 yi ] = 2w1 x2i1 + 2 xi1 (w2 xi2 − yi ) = a1 w1 − c1 (293)
i i i i i
so
xT:,k r
wk = (295)
xT:,k x:,k
1. From the book, we have that ŵkOLS = xTk y = ck /2. Hence the solid line (1) is OLS and has slope 1/2.
Also, ŵkridge = ŵkOLS /(1 + λ2 ) = 2(1+λ
ck
2)
. Hence the dotted line (2) is ridge and has slope 1/4. Finally,
lasso must be the dashed line (3), since it sets coefficients to zero if −λ1 ≤ ck ≤ λ1 .
2. λ1 = 1
3. λ2 = 1
36
11.6 EM for mixture of linear regression experts
In the E step, we compute the conditional responsibilities
In the M step, we update the parameters of the gating funtion by maximizing the weighted likelihood
XX
`(θg ) = rnk log p(zn = k|xn , θg ) (299)
n k
and the parameters of the k’th expert by maximizing the weighted likelihood
X
`(θk ) = rnk log p(yn |xn , zn = k, θk ) (300)
n
If the gating function and experts are linear models, these M steps correspond to convex subproblems that
can be solved efficiently.
For example, consider a mixture of linear regression experts using logistic regression gating functions. In
the M step, we need to maximize Q(θ, θ old ) wrt wk , σk2 and V. For the regression parameters for model k,
the objective has the form
N
X 1
Q(θk , θ old ) = rnk − 2 (yn − wTk xn ) (301)
n=1
σk
We recognize this as a weighted least squares problem, which makes intuitive sense: if rnk is small, then data
point n will be downweighted when estimating model k’s parameters. From the weighted least square section,
we can immediately write down the MLE as
We replace the estimate of the unconditional mixing weights π with the estimate of the gating parameters,
V. The objective has the form
XX
`(V) = rnk log πn,k (304)
n k
We recognize this as equivalent to the log-likelihood for multinomial logistic regression, except we replace the
“hard” 1-of-C encoding yi with the “soft” 1-of-K encoding ri . Thus we can estimate V by fitting a logistic
regression model to soft target labels.
37
12 Solutions
38
Part III
Deep neural networks
39
13 Solutions
13.1 Backpropagation for a 1 layer MLP
To compute δ 1 , let L = CrossEntropyWithLogits(y, a). Then
∂L
δ1 = = (p − y)T (305)
∂a
where p = S(a).
To compute δ 2 , we have
∂L ∂L ∂a ∂h ∂a ∂h
δ2 = = = δ1 (306)
∂z ∂a ∂h ∂z ∂h ∂z
∂h
= δ1 U since a = Uh + b2 (307)
∂z
= δ 1 U ◦ ReLU0 (z) since h = ReLU(z) (308)
= δ 1 U ◦ H(h) (309)
40
14 Solutions
41
15 Solutions
42
Part IV
Nonparametric models
43
16 Solutions
44
17 Solutions
17.1 Fitting an SVM classifier by hand
PRIVATE
1. The perpendicular to the decision boundary is a line through φ(x1 ) to φ(x2 ), and hence is parallel to
φ(x2 ) − φ(x1 ) = (1, 2, 2) − (1, 0, 0) = (0, 2, 2). Of course, any scalar multiple of this is acceptable, e.g.,
(0, 1, 1).
2. There are 2 support vectors, namely the two data points. The decision boundary will be half way
between them. This midpoint is m = (1,2,2)+(1,0,0)
2 = (1, 1, 1). The distance of each of the training
points to this midpoint is
p √
||φ(x1 ) − m|| = ||φ(x2 ) − m|| = ||(0, 1, 1)|| = 02 + 12 + 12 = 2 (315)
√
Hence the margin is 2.
√
3. We have w = (0, 1/2, 1/2), which is parallel to (0, 2, 2) and has ||w|| = (1/2)2 + (1/2)2 = 1/ 2 as
p
required.
4. Using the equations we have
Hence w0 = −1.
5. We have √
√ 2 1
T T 2
w0 + w φ(x) = −1 + (0, 1/2, 1/2) (1, 2x, x ) = −1 + x + x2 (318)
2 2
45
18 Solutions
46
Part V
Beyond supervised learning
47
19 Solutions
19.1 Information gain equations
To see this, let us define pn , p(θ|D) as the current belief state, and pn+1 , p(θ|D, x, y) as the belief state
after observing (x, y). Then
Z
pn+1
KL (pn+1 kpn ) = pn+1 log dθ (319)
pn
Z Z
= pn+1 log pn+1 dθ − pn+1 log pn dθ (320)
Z
= − H pn+1 − pn+1 log pn dθ (321)
Next we need to take expectations wrt p(y|x, D). We will use the following identity:
X Z
p(y|x, D) pn+1 log pn dθ (322)
y
X Z
= p(y|x, D) p(θ|D, x, y) log p(θ|D)dθ (323)
y
Z X
= p(y, θ|x, D) log p(θ|D)dθ (324)
y
Z
= p(θ|D) log p(θ|D)dθ (325)
48
20 Solutions
20.1 EM for FA
In the E step, we can compute the posterior for zn as follows:
We now discuss the M step. We initially assume µ = 0. Using the trace trick we have
X X h T −1 i
E (x̃i − Wzi )T Ψ−1 (x̃i − Wzi ) = x̃i Ψ x̃i + E ziT WT Ψ−1 Wzi − 2x̃Ti Ψ−1 WE [zi ] (331)
i i
Xh
tr(Ψ−1 x̃i x̃Ti ) + tr(Ψ−1 WE zi ziT WT ) (332)
=
i
i
−tr(2Ψ−1 WE [zi ] x̃Ti ) (333)
, tr(Ψ−1 G(W)) (334)
1
∇W Q(W) = − Ψ−1 ∇W G(W) = 0 (336)
2
X X
E zi ziT − 2( E [zi ] x̃Ti )T (337)
∇W G(W) = 2W
i i
" #" #−1
X T
X
T
(338)
Wmle = x̃i E [zi ] E zi zi
i i
Using the facts that ∇X log |X| = X−T and ∇X tr(XA) = AT we have
N 1
∇Ψ−1 Q = Ψ − G(Wmle ) = 0 (339)
2 2
1
Ψ = diag(G(Wmle )) (340)
N
We can simplify this as follows, by plugging in the MLE (this simplification no longer holds if we use MAP
estimation). First note that X X
E zi ziT Wmle E [zi ] x̃Ti (341)
T
=
i i
so
1 X
Ψ= x̃i x̃Ti + Wmle E [zi ] x̃Ti − 2Wmle E [zi ] x̃Ti (342)
N i
!
1 X X
= T
x̃i x̃i − Wmle T
E [zi ] x̃i (343)
N i i
49
To estimate µ and W at the same time, we can define W̃ = (W, µ) and z̃ = (z, 1), Also, define
It is interesting to apply the above equations to the PPCA case in the limit where σ 2 → 0. This provides
an alternative way to fit PCA models, as shown by [Row97]. Let Z̃ be a L × N matrix storing the posterior
means (low-dimensional representations) along its columns. Similarly, let X̃ = XT be an D × N matrix
storing the original data along its columns. From the PPCA posterior, when σ 2 = 0, we have
This constitutes the E step. Notice that this is just an orthogonal projection of the data.
From Equation (346), the M step is given by
" #" #−1
X T
X T
Ŵ = xn E [zn ] E [zn ] E [zn ] (349)
n n
[TB99] showed that the only stable fixed point of the EM algorithm is the globally optimal solution.
That is, the EM algorithm converges to a solution where W spans the same linear subspace as that defined
by the first L eigenvectors. However, if we want W to be orthogonal, and to contain the eigenvectors in
descending order of eigenvalue, we have to orthogonalize the resulting matrix (which can be done quite
cheaply). Alternatively, we can modify EM to give the principal basis directly [AO03].
50
The conditional posterior for zn is given by
For the M step, we define W̃c = (Wc , µc ), and z̃ = (z, 1). Also, define
2. We have
∂ J˜
= −2Cv2 + 2λ2 v2 + λ12 v1 = 0 (362)
∂v2
Premultiplying by v1T yields
Now v1T Cv2 = v1T (λ1 v2 ) = 0, and v1T v2 = 0, and v1T v1 = 1, so λ12 = 0. Hence
So v2 is an eigenvector of C. Since we want to maximize the variance, we want to pick the eigenvector
with the largest eigenvalue, but the first one is already taken. Hence v2 is the evector with the second
largest evalue.
51
20.4 Deriving the residual error for PCA
PRIVATE
Hence the residual error arising from only using K < d terms is
d
X
JK = λj (374)
j=K+1
52
2. Since X̃ lives in the d − 1 subspace orthogonal to v1 , the vector u must be orthogonal to v1 . Hence
uT v1 = 0 and uT u = 1, so u = v2 .
3. We have
2. We have
Using Bayes rule for Gaussians with the substitutions Λ = I, L−1 = σ 2 I, A = W, b = µ, µ = 0, we have
−1
T 1 1 2
Cov [z|y] = (I + W 2 w) = −1 T
(σ I + W W) = σ 2 M−1 (385)
σ σ2
and
53
Plugging in the MLE we find
1
W = UK (ΛK − σ 2 I) 2 (389)
1
C−1 = 2 I − UK (ΛK − σ 2 I)Λ−1 T
(390)
K UK
σ
1
= 2 I − UK JUTK (391)
σ
J = diag(1 − σ 2 /λj ) (392)
54
21 Solutions
55
22 Solutions
56
23 Solutions
57