COS597D: Information Theory in Computer Science September 21, 2011
Lecture 2
Lecturer: Mark Braverman Scribe: Mark Braverman∗
In the last lecture, we introduced entropy H(X), and conditional entry H(X|Y ), and showed how they
are related via the chain rule. We also proved the inequality H(X1 , . . . , Xn ) ≤ H(X1 ) + · · · + H(Xn ). We
also showed that H(X|Y ) ≤ H(X). Equality holds if and only if X and Y are independent. Similarly to
this inequality, one can show that more generally,
Lemma 1. H(X|Y Z) ≤ H(X|Y ).
In this lecture, we will derive one more useful inequality and then give some examples of applying entropy
in combinatorics. We start with some simple examples.
1 Warm-up examples
Example 2. Prove that for any deterministic function g(y), H(X|Y ) ≤ H(X|g(Y )).
Solution We have
H(X|Y ) = H(X|Y g(Y )) ≤ H(X|g(Y )).
Here the equality holds because the value of Y completely determines the value of g(Y ). The inequality is
an application of Lemma 1.
Example 3. Consider a bin with n balls of various colors. In one experiment k ≤ n balls are taken out
with replacement. In another the balls are taken out without replacement. In which case does the resulting
sequence have higher entropy?
Solution One quick way to get the answer (but not the proof) is to test the question for very small
numbers. Suppose n = k = 2 and the bin contains one red and one blue ball. Then the possible outcomes
of the first experiment are RR, RB, BR, and BB – all with equal probabilities. Hence the entropy of the
first experiment is 2 bits. The possible outcomes of the second experiment are RB and BR, and hence the
entropy is only 1 bit. Thus we should try to prove that the first experiment has the higher entropy.
Denote the random variables representing the colors of the balls in the first experiment by X1 , . . . , Xk
and in the second experiment by Y1 , . . . , Yk . The first experiment is run with replacements, and hence all
the Xi ’s are independent and identically distributed. Thus we have:
H(X1 . . . Xk ) = H(X1 )+H(X2 |X1 )+. . .+H(Xk |X1 . . . Xk−1 ) = H(X1 )+H(X2 )+. . .+H(Xk ) = k ·H(X1 ),
where the first equality is just the Chain Rule, and the second equality follows from the fact that the variables
are independent.
Next, we observe that the distribution of the i-th ball drawn in the second experiment is the same
as the original distribution of the colors in the bag, which is the same as the distribution of X1 . Thus
H(Yi ) = H(X1 ) for all i. Finally, we get
H(Y1 . . . Yk ) = H(Y1 ) + H(Y2 |Y1 ) + . . . + H(Yk |Y1 . . . Yk−1 ) ≤ H(1 ) + H(Y2 ) + . . . + H(Yk ) = k · H(X1 ),
which shows that the second experiment has the lower entropy. Note that equality holds only when the Yi ’s
are independent, which would only happen if all the balls are of the same color, and thus both entropies are
0!
∗ Based on lecture notes by Anup Rao and Kevin Zatloukal
1
Example 4. Let X and Y be two (not necessarily independent) random variables. Let Z be a random
variable that is obtained by tossing a fair coin, and setting Z = X if the coin comes up heads and Z = Y if
the coin comes up tails. What can we say about H(Z) in terms of H(X) and H(Y )?
Solution We claim that
(H(X) + H(Y ))/2 ≤ H(Z) ≤ (H(X) + H(Y ))/2 + 1.
These bounds are tight. To see this consider the case when X = 0 and Y = 1 are just two constant
random variables and thus H(X) = H(Y ) = 0. We then have Z ∼ B1/2 distributed as a fair coin with
H(Z) = 1 = (H(X) + H(Y ))/2 + 1. At the same time, if X and Y are any i.i.d. (independent, identically
distributed) random variables, then Z will have the same distribution as X and Y and thus H(Z) = H(X) =
H(Y ) = (H(X) + H(Y ))/2.
To prove the inequalities denote by S the outcome of the coin toss when we select whether Z = X or
Z = Y . Then we have
1 1 1 1
H(Z) ≥ H(Z|S) = H(Z|S = 0) + H(Z|S = 1) = H(X) + H(Y ).
2 2 2 2
For the other direction, note that
1 1
H(Z) ≤ H(ZS) = H(S) + H(Z|S) = 1 + H(X) + H(Y ).
2 2
2 One more inequality
We show that the uniform distribution has the highest entropy. In fact, we will see that a distribution has
the maximum entropy if and only if it is uniform.
Lemma 5. Let X be the support of X. Then H(X) ≤ log |X |.
Proof We write H(X) as an expectation and then apply Jensen’s inequality (from Lecture 1) to the
convex function log(1/t):
! !
X X X
H(X) = p(x) log(1/p(x)) ≤ log p(x)(1/p(x)) = log 1 = log |X |
x∈X x∈X x∈X
3 Applications
3.1 Bounding the Binomial Tail
Suppose 2n + 1 people have each watched a subset of n movies. Since there are only 2n possible subsets
of these movies, there must be two people that have watched exactly the same subset by the pigeonhole
principle.
We shall show how to argue something similar in less trivial cases where the pigeonhole principle does
not apply. This first example will show one way to do that using information theory.
Suppose 2n people have each watched a subset of 2n movies and every person has watched at least 90%
of the movies. If the number of possible subsets meeting this constraint is less than 2n , then we must have
two people who have watched exactly the same subset, of movies as before. The following result will give us
what we need.
2
Pk n
Lemma 6. If k ≤ n/2, then i=0 ≤ 2nH(k/n) .
i
P2n P0.1(2n) 2n
We would like to compute i=0.9(2n) 2n 2n 2n
i . Since i = 2n−i , this sum is equal to i=0 i ≤
22nH(1/10) < 2n since we can compute H(0.1) < 0.469 < 0.5.
It remains only to prove the lemma.
Proof Let X1 X2 · · · Xn be a uniformly random string sampled from the set of n-bit strings of weight
Pk n
at most k. Thus, H(X1 X2 · · · Xn ) = log i=0 k . Further, we have that Pr[Xi = 1] = E [Xi ]. By
hP i
n n
symmetry, this probability is equal to n1 j=1 E [Xj ] = n1 E k
P
j=1 Xj ≤ n , where we have used linearity
of expectation. We can relate this to the entropy by using the fact that p ≤ H(p) for 0 ≤ p ≤ 21 . Finally,
applying our inequality from last see that H(X1 X2 · · · Xn ) ≤ H(X1 ) + · · · + H(Xn ) ≤ nH(k/n).
P time, we
k n
Thus, we have shown that log i=0 i ≤ nH(k/n), proving the lemma.
3.2 Triangles and Vees [KR10]
Let G = (V, E) be a directed graph. We say that vertices (x, y, z) (not necessarily distinct) form a triangle
if {(x, y), (y, z), (z, x)} ⊂ E. Similarly, we say they form a vee if {(x, y), (x, z)} ⊂ E. Let T be the number
of triangles in the graph, and V be the number of vees.
We are interested in the relationship between V and T . From any particular triangle, we can get one vee
from each edge, say (x, y), by repeating the second vertex: (x, y, y) is a vee. If the vertices of a triangle are
distinct, the number of vees in the triangle is equal to the number of triangles contributed by the vertices
of the triangle, since the three cyclic permutations of (x, y, z) are distinct triangles. However, the same edge
could be used in many different triangles, so that this simple counting argument does not tell us anything
about the relationship between V and T . We shall use an information theory based argument to show:
Lemma 7. In any directed graph, V ≥ T .
Proof Let (X, Y, Z) be a uniformly random triangle. Then by the chain rule, log(T ) = H(X, Y, Z) =
H(X) + H(Y |X) + H(Z|X, Y ). We will construct a distribution on vees with at least log T entropy, which
together with Lemma 5, would imply that log V ≥ log T ⇒ V ≥ T .
Since conditioning can only reduce entropy, we have that H(X, Y, Z) ≤ H(X) + H(Y |X) + H(Z|Y ). Now
observe that by symmetry, the joint distribution of Y, Z is exactly the same as that of X, Y . Thus we can
simply the bound to log T ≤ H(X) + 2H(Y |X).
Sample a random vee, (A, B, C) with the distribution
q(a, b, c) = Pr[X = a] · Pr[Y = b|X = a] · Pr[Y = c|X = a].
In words, we sample the first vertex with the same distribution as X, and then sample two independent copies
of Y to use as the second and third vertices. Observe that if q(a, b, c) > 0, then (a, b, c) must be a vee, so this
is a distribution on vees. On the other hand, the entropy of this distribution is H(A)+H(B|A)+H(C|AB) =
H(A) + H(B|A) + H(C|A) = H(X) + 2H(Y |X), which is at least log T as required.
The lemma is tight. Consider a graph with 3n vertices partitioned into three sets A, B, C with |A| =
|B| = |C| = n. Suppose that the edges are {(a, b) | a ∈ A, b ∈ B} and similarly for B, C and C, A. For each
triple (a, b, c), we get three triangles – (a, b, c), (b, c, a), and (c, a, b) – so T = 3n3 . On the other hand, each
vertex a ∈ A is involved in n2 vees of the form (a, b1 , b2 ), and similarly for B, C. So V = 3n3 .
3.3 Counting Perfect Matchings [Rad97]
Suppose that we have a bipartite graph G = (A∪B, E), with |A| = |B| = n. Let A = [n]. A perfect matching
is a subset of E that is incident to every vertex exactly once. Hence, it is a bijection between the sets A
3
and B. How many possible perfect matchings can there be in a givenQgraph? If we let dv be the degree of
vertex v, then a trivial bound on the number of perfect matchings is i∈A di . A tighter bound was proved
by Brégman:
Q 1/d
Theorem 8 (Brégman). The number of perfect matchings is at most i∈A (di !) i .
To see that this is tight, consider the complete bipartite graph. Any bijection can be chosen, and the
number of bijections
Qn is the number of permutations of n letters, which is n!. In this case, the bound of the
theorem is i=1 (n!)1/n = (n!)(1/n)·n = n!.
We give a simple proof of this theorem using information theory, due to Radhakrishnan [Rad97].
Proof Let ρ be a uniformly random perfect matching, and for simplicity, assume that A = 1, 2, . . . , n.
Write ρ(i) for the neighbor of i under the matching ρ. Given a permutation τ , we write τ (i) to denote the
concatenation τ (i), τ (i − 1), . . . , τ (1).
Then, using the chain rule, the fact that conditioning only reduces entropy, and the fact that the entropy
of a variable is at most the logarithm of the size of its support,
n
X n
X n
X n
Y
H(ρ) = H(ρ(i)|ρ(i − 1)) ≤ H(ρ(i)) ≤ log di = log di
i=1 i=1 i=1 i=1
Qn
This proves that the number of matchings is at most i=1 di . Can we improve the proof? In computing
H(ρ(i)| . . . ), we are losing too much by throwing away all the previous values.
To improve the bound, let us start by symmetrizing over the order in which we condition the individual
values ofPρ. If π is any permutation, then conditioning in the order of π(1), π(2), . . . , π(n) shows that
n
H(ρ) = i=1 H(ρπ(i)|ρπ(i − 1)), where here ρπ(i) denotes ρ(π(i)). Since this is true for any choice of π,
we can take the expectation over a uniformly random choice of π without changing the value. Let L be a
uniformly random index in {1, 2, . . . , n}. Then,
n
X 1 1
H(ρπ(L)|L, π, ρπ(L − 1)) = H(ρπ(i)|π, ρπ(i − 1)) = H(ρ).
i=1
n n
Now we rewrite this quantity according to the contribution of each vertex in A:
Xn h i
H(ρπ(L)|L, π, ρπ(L − 1)) = Pr[π(L) = i] E H(ρπ(L)|L, π, ρπ(L − 1))
π,L s.t. π(L)=i
i=1
n
X h i
= (1/n) E H(ρπ(L)|L, π, ρπ(L − 1))
π,L s.t. π(L)=i
i=1
Consider any fixed perfect matching ρ. We are interested in the number of possible choices for ρπ(L)
conditioned on π(L) = i, after ρπ(1), . . . , ρπ(L − 1) have been revealed. Let a1 , a2 , . . . , adi −1 be such that
the set of neighbors of i in the graph is exactly {ρ(a1 ), ρ(a2 ), . . . , ρ(adi −1 ), ρ(i)}. π, L in the expectation can
be sampled by first sampling a uniformly random permutation π and then setting L so that π(L) = i. Thus,
the ordering of {a1 , a2 , . . . , adi −1 , i} induced by π is uniformly random, and
|{ρ(a1 ), ρ(a2 ), . . . , ρ(adi −1 )} ∩ {ρπ(L − 1), . . . , ρπ(1)}| = |{a1 , a2 , . . . , adi −1 } ∩ {π(L − 1), . . . , π(1)}|
is equally likely to be 0, 1, 2, . . . , di − 1. The number of available choices for ρ(π(L)) is equally likely to be
bounded by 1, 2, . . . , di . This allows us to bound
Xn h i
(1/n)H(ρ) = (1/n) E H(ρπ(L)|L, π, ρπ(L − 1))
π,Ls.t.π(L)=i
i=1
di
n X n
!
X X Y
≤ (1/n) (1/di ) log(j) = (1/n) log (di !)1/di = (1/n) log (di !)1/di ,
i=1 j=1 i=1 i∈A
1/di
Q
which proves that the number of perfect matchings is at most i∈A (di !) .
4
References
[KR10] Swastik Kopparty and Benjamin Rossman. The homomorphism domination exponent. Technical
report, ArXiv, April 14 2010.
[Rad97] Jaikumar Radhakrishnan. An entropy proof of bregman’s theorem. J. Comb. Theory, Ser. A,
77(1):161–164, 1997.