Problem sheet 1, Information Theory, MT 2022
Designed for the first tutorial class
Question 1 We are given a deck of n cards in order 1, 2, · · · , n. Then a randomly chosen card
is removed and placed at a random position in the deck. What is the entropy of the resulting
deck of card?
Answer 1 There are evenly n cards be picked up, and n places to be placed evenly. So there are
1/n2 different actions with even probability 1/n2 and some of them result in the same outcome as
following:
(a) The original order can be resulted by any card be picked up and placed at its original place, so
the probability of the original order is 1/n;
(b) a swap of two adjacent card can be resulted by two different operations. There are n − 1 of
these results, each with probability 2/n2 .
(c) a card is moved at least 2 positions away: there are n × (n − 3) + 2 possible results, each with
probability 1/n2 .
So there are 1 + (n − 1) + (n2 − 3n + 2) different results with probabilities as above, whose entropy
is
1 2 1
H(“deck”) = log(n) + (n − 1) 2 log(n2 /2) + (n2 − 3n + 2) 2 log(n2 )
n n n
1
n log(n) + 2(n − 1)(2 log(n) − 1) + (n2 − 3n + 2)2 log(n)
= 2
n
2n − 1 2n − 2
= log(n) + .
n n2
Question 2 (Polling inequalities) Let a ≥ 0, b ≥ 0 are given with a + b > 0. Show that
a+b
−(a + b) log(a + b) ≤ −a log(a) − b log(b) ≤ −(a + b) log( )
2
and that the first inequality becomes an equality iff ab = 0, the second inequality becomes an
equality iff a = b.
1
a
Answer 2 Denote p = a+b . Divided by a + b and then add log(a + b) on all three terms, the
equalities are equivalent to
1
0 ≤ −p log(p) − (1 − p) log(1 − p)) ≤ − log( ),
2
which is obvious according to the first basic property of entropy.
Question 3 Let X, Y, Z be discrete random variables. Prove or provide a counterexample to
the following statements:
(a) H(X) = H(42X);
(b) H(X|Y ) ≥ H(X|Y, Z);
(c) H(X, Y ) = H(X) + H(Y ).
Answer 3 The first one is true : f (x) = −42x is a bijective.
The second is true: H(X|Y ) − H(X|Y, Z) = I(X, Z|Y ), and the interpretation by informa-
tion/surprise works.
The third is wrong: By the chain rule, H(X, Y ) = H(Y |X) + H(X), and H(Y |X) = H(Y ) if
and only if X, Y are independent. An easy counter example is when Y = X and H(X) > 0, we
have H(X, Y ) = H(X, X) = H(X) < H(Y ) + H(X).
Question 4 Does there exist a discrete random variable X with a distribution such that
H(X) = +∞? If so, describe it as explicitly as possible.
Answer 4 Obviously, H(X) < +∞ for any case with finite image space. So we assume the
image space is the natural nubmers. Here is an counter example: P(X = n) = n logc2 (n) with c =
2 P h 2c 2c(log(log(n))
i
P 1 2 > 0. then H(X) = c
P
2
n n log (n) [log(n log (n))−log(c)] = n n log(n) + 2 −
n n log n n log (n)
P 1
log(c) = +∞ since log(log(n)) → +∞ and n n log(n) = +∞.
2
Question 5 Let X be a finite set, f a real-valued function f : X 7→ R and fix α ∈ R. We
want to maximise the entropy H(X) of a random variable X taking values in X subject to the
constraint
E[f (X)] ≤ α. (1)
Denote by U a uniformly distributed random variable over X . Prove the following optimal
solutions for the maximisation.
(a) If α ∈ [E[f (U )], maxx∈X f (x) ], then the entropy is maximised subject to (1) by the
uniformly distributed random variable U .
(b) If f is non-constant and α ∈ [minx∈X f (x), E[f (U )] ], then the entropy is maximised subject
to (1) by the random variable Z given by
eλf (x)
P (Z = x) = P λf (y)
for x ∈ X .
y∈X e
where λ < 0 is chosen such that E[f (Z)] = α.
(c) (Optional) Prove that under the assumptions of (b), the choice for λ is unique and we
have λ < 0.
Answer 5 (a) Since the uniform distribution achieves the maximal entropy without any con-
strained, so we just need to verify it satisfies the constraint (1), which is obvious.
(b) Recall the Gibbs’ inequality that for any pmf p and q,
X X
− p(x) log(p(x)) ≤ − p(x) log(q(x)).
x∈X x∈X
So we can try to write E[f (X)] into the form of − x∈X p(x) log(q(x))+c(λ)
P
−λ for some constant
λ < 0 and c with p(·) being the pmf of X, for which we should write
λf (x) = log(q(x)) + c(λ) ⇐⇒ eλf (x) e−c(λ) = q(x).
With the fact that q is a pmf, we have
X eλf (x)
c(λ) = log( eλf (x) ), and hence q(x) = P λf (x)
and .
x∈X x∈X e
λf (x)
So for any λ < 0, define the pmf q(x) := P e λf (x) , then
x∈X e
1 X c(λ)
E[f (X)] = p(x) log(q(x)) + .
λ λ
x∈X
P
With λ < 0, we have that E[f (X)] ≤ α is equivalent to − x∈X p(x) log(q(x)) ≤ −λα+c(λ).
P
Hence H(X) ≤ − x∈X p(x) log(q(x)) ≤ −λα + c(λ), (which means −λα + c(λ) is up-
per bound for any λ < x0), and both equalities hold (hence the upper bound is achieved)
3
λf (x)
P e
P
iff p(x) = q(x), i.e., P(X = x) = q(x) = λf (x) and α = x∈X p(x)f (X) =
x∈X e
λf (x)
P e
P
x∈X f (x) eλf (y)
.
y∈X
P
To make sure the existence of λ < 0 such that E[f (X)] = x∈X q(x)f (x) = α, we leave the
proof to part (c).
λf (x)
(c) Denote g(λ) := x∈X f (x) P e eλf (y) . Then g is a differentiable function with
P
y∈X
X eλf (x) X eλf (x) X
g 0 (λ) = f (x)2 P λf (y)
− f (x) P λf (y) )2
f (y)eλf (y)
x∈X y∈X e x∈X
( y∈X e y∈X
= E[f (X)2 ] − (E[f (X)])2 .
Since f is not a constant, so g 0 (λ) > 0, which means g is a strictly increasing and contin-
uous function. Furthermore, g(0) = E[f (U )], g(−∞) = minx∈X f (x). So g(λ) = α ∈
(min f (x), E[f (U )]) admits a unique solution λ < 0.
Question 6 (A revision on strong law of large numbers (SLLN) in probability theory, please
take this question as a reference) Let X be a real-valued random variable.
(a) Assume additionally that X is non-negative. Show that for every x > 0, we have
E[X]
P(X ≥ x) ≤ .
x
(b) Let X be a random variable of mean µ and variance σ 2 . Show that
σ2
P(|X − µ| > ε) ≤ .
2
)n≥1 be a sequence of i.i.d random variables with mean µ and variance σ 2 . Show
(c) Let (XnP
1 m
that m n=1 Xn converges to µ in probability, i.,e., for every ε > 0,
m
!
1 X
lim P Xn − µ > = 0.
m→+∞ m
n=1
This is a weak version of SLLN. It can bePstrengthen by Borel-Cantelli lemma
1 m
to the often-used version: P(limm→+∞ m n=1 Xn = µ) = 1.
Answer 6 (a) E[X] = E[X1X≥x ] + E[X1X<x ] ≥ E[X1X≥x ] ≥ E[x1X≥x ] = xP(X ≥ x), so
we have the inequality.
E[Y 2 ]
(b) Similar to part (a), for any random variable Y and constant ε > 0, P(|Y | > ε) ≤ ε2
. Apply
Y = X − µ in this inequality, we get the one in the question.
4
1 Pm σ2
(c) For any integer m, denote Ym = m n=1 Xn − µ, then E[Ym ] = 0, Var(Ym ) = m. Hence
σ 2 m→+∞
P(|Ym | > ε) ≤ mε −→ 0.
Question 7 (Optional) Partition the interval [0, 1] into n disjoint sub-intervals of length p1 , · · · , pn .
Let X1 , X2 , · · · be i.i.d. random variables, uniformly distributed on [0, 1], and Zm (i) be the
number of the X1 , · · · , Xm that lie in the ith interval of the partition. Show that the random
variables
n
Z (i) 1 m→+∞
X
Rm = Πni=1 pi m satisfy log(Rm ) −→ pi log(pi ) with probability 1.
m
i=1
Answer 7 Denote Ii as the ith subinterval. By the definition of Zm (i), we have Zm (i) =
Pm
j=1 1Xj ∈Ii , and by the law of large numbers,
Pm !
j=1 1Xj ∈Ii
P lim = pi = 1.
m→+∞ m
It is easy to see that
n n Pm n
1 1 X j=1 1Xj ∈Ii m→+∞
X X
log(Rm ) = Zm (i) log(pi ) = log(pi ) −→ pi log(pi ).
m m m
i=1 i=1 i=1