0% found this document useful (0 votes)
21 views9 pages

Entropy & Data Compression Lecture

Done

Uploaded by

Nimesh Deepika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Entropy & Data Compression Lecture

Done

Uploaded by

Nimesh Deepika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture 1: Entropy and Data Compression

The fundamental concepts of information theory can be motivated by


the problem of data compression. Suppose that we have a countable set M
of messages. Suppose that we want to transmit a sequence of b messages
m1 , m2 , . . . , mb where the messages mi are drawn IID according to P . The
main theorem we consider in this section is the following (stated somewhat
informally).

Shannon’s Source Coding Theorem [Informal Version] In the


limit as the block size goes to infinity the number of bits required
per message in the block is exactly the entropy H(P ) of P defined
as follows.

H(P ) = Em∼P [log2 (1/P(m))]

As a simple example suppose that P is the uniform distribution on 2k


messages. In this case we have that H(P ) = k. As another example suppose
that P assigns nonzero weight to 2k messages but half of the weight is on
a single message. In that case we can use the bit string 0 to represent the
common message and codes of length k + 1, each starting with the the bit 1,
for all other messages. In this case the average number of bits per message is
1/2 + 1/2(k + 1) = 1 + k/2. In general if the distribution is non-uniform we
get greater compression by assigning fewer bits to more common messages.
We now state the main theorem more precisely. For a countable set M of
messages define a code for M to be an assignment c(m) of a bit string (code
word) for each m ∈ M with the property that if m 6= m0 then c(m) 6= c(m0 ).
We will also consider coding a long sequence of messages by coding each
message individually. In order for this to work one must be able to tell when
one code word ends and the next begins. This can be done provided that no
two distinct codes have the property that one is a prefix of the other.

Definition 1 A code is called prefix free if all code words are distinct and
no code word is a prefix of any other code word.

As an example of a prefix-free code we can consider the set of all null-


terminated byte (or character) strings. In this case every code word is a
certain byte string and hence the length of each code is a multiple of 8.
We can now state our main theorem more precisely. We now let |c(m)|
be the length of the code word c(m) (the number of bits used in c(m)). We
let C(M ) be the set of all prefix-free codes on M . We let C ∗ (P ) be defined
as follows.

C ∗ (P ) = inf Em∼P [|c(m)|]


c∈C(M )

Let b be a positive integer block size. Let M b be the set of all tuples
hm1 , . . . , mb i with mi ∈ M . Let P b be the probability distribution on M b
where each mi is selected independently with probability distribution P . (We
say that the mi are drawn “IID” for Independently Identically Distributed).
The above definition implies the following.

C ∗ (P b ) = inf Ehm , . . . , m i∼Pb [|c(hm1 , . . . , mb i)|]


c∈C(M b ) 1 b

Note that a code for M b is coding for b messages from M so to get the
number of bits per message we should divide the length of a code word by b.
We can now state the main theorem as follows.

Theorem 1 (Shannon’s Source Coding Theorem)


1 ∗ b
lim C (P ) = H(P )
b→∞ b

To prove this theorem we start with the following.

Lemma 2 (Kraft Inequality) For any prefix-free code c we have the fol-
lowing.
2−|c(m)| ≤ 1
X

m∈M

Proof: Suppose we generate a bit string by repeatedly flipping an unbiased


coin and stopping as soon as the bit string we have generated is a code word.
We then have that the probability of stopping with code word c(m) is exactly
2−|c(m)| . The Kraft inequality then follows form the fact that probabilities
sum to 1 (and there can be some nonzero probability that we miss all the
code words and never stop the process).
Theorem 3 Let ` be an assignment of a positive integer `(m) to each m ∈ M
satisfying the Kraft inequality:
2−`(m) ≤ 1
X

m∈M

For any such assignment of lengths there exists a prefix-free code c with
|c(m)| = `(m).

Proof: Arrange the messages m in a sequence m1 , m2 , m3 , . . . of nonde-


creasing length according to the assignment `, i.e., such that `(mi+1 ) ≥ `(mi ).
Pick code words in the order given subject to the constraint the selected code
word can not have as a prefix any previously selected code word. We view
a code word c as having probability mass 2−|c| . When a code word c(m) is
selected the amount of probability mass that becomes unavailable for future
assignments is 2−`(m) . By the Kraft inequality the amount of probability
mass remaining must be sufficient for the remainder of the code words. Fur-
thermore, when selecting a code word for m we can consider the remaining
mass to be uniformly distributed among the remaining code words of length
`(m) and hence such a code word can always be selected.

Theorem 4 For probability distribution P on a countable set of messages


M there exists a code c assigning code word c(m) to each m ∈ M satisfying
the following.
Em∼P [|c(m)|] ≤ H(P) + 1

Proof: Let `(m) be dlog2 1/P (m)e. We have the following.


2−`(m) = 2−dlog2 1/P (m)e
X X

m∈M m∈M
X
≤ P (m)
m∈M
= 1
Therefore by theorem 3 there exists a code c with |c(m)| = `(m) and hence
we have the following.
|c(m)| ≤ log2 (1/P (m)) + 1
Em∼P [|c(m)|] ≤ Em∼P [log2 (1/P(m)) + 1]
= Em∼P [log2 (1/P(m))] + Em∼P [1]
= H(P ) + 1
Now consider the distribution P b on the message blocks M b . The expected
number of bits per message in a message block using a Shannon code for
message blocks is the following.
1
E [|c(hm1 , . . . , mb i)|] ≤ 1/b(H(P b ) + 1)
b hm1 , . . . , mb i∼Pb
= H(P ) + 1/b

This implies that for arbitrarily long blocks the number of bits per mes-
sage can be made arbitrarily close to H(P ). In the next lecture we will prove
that no code can achieve fewer bits per code word than H(P ).

1 Cross Entropy and KL Divergence


We will now prove the other half of the Shannon source coding theorem by
applying Jensen’s inequality to KL divergence. We will have to consider
functions g(x) which can be infinity. Wehn g can be infinite we define the
expectation of g as follows.

X
Em∼P [g(m)] = P(m)g(m)
m∈M,P(m)>0

Under this definition, if g(m) in infinite only when P (M ) is zero then we


have that the expectation ignores the infinte values of g and is still finite (for
finite M ).
Now we assume, without loss of generality, that M contains a special
element ⊥ and P (⊥) = 0 (we can always add one if no such element exists).
Without loss of generality we can also restrict our attention to prefix-free
codes c that assign code words to elements of M other than ⊥. For any such
code c we can define Pc (m) as follows.

Pc (m) = 2−|c(m)| if m 6= ⊥
X
Pc (⊥) = 1 − Pc (m)
m6=⊥

The Kraft inequality implies that Pc (⊥) as defined above is non-negative.


By introducing ⊥ we have arranged that Pc is a probability distribution on
M (it sums to one). Although we are particularly interested in distributions
of the form Pc for some code c, we now consider an arbitrary distribution Q
on M and define the cross entropy H(P, Q) as follows.
" #
1
H(P, Q) = Em∼P log2
Q(m)

Note that messages with zero probability under P (such as ⊥) are ignored
in this definition. If Q assigns zero probability to a message that has nonzero
probability under P then H(P, Q) is infinite. We now have the following.

Em∼P [|c(m)|] = H(P, Pc )

In general H(P, Q) can be viewed as the expected code length when we


select messages according to P but use a code that is optimal for Q. If Q is
different from P then we are using the wrong code so we should expect that
H(P, Q) is at least H(P ). We will now prove this as a consequence of Jensen’s
inequality. First we define the Kulback-Leibler divergenge , KL(P, Q) as
follows.
KL(P, Q) = H(P, Q) − H(P )
It now suffices to show that KL(P, Q) ≥ 0. We can do this as follows where
the first inequality is derrived from Jensen’s inequality and the fact that the
function f (x) = − log2 (x) is convex.

KL(P, Q) = H(P, Q) − H(P )


" !# " !#
1 1
= Em∼P log2 − Em∼P log2
Q(m) P(m)
" ! !#
1 1
= Em∼P log2 − log2
Q(m) P(m)
" !#
P(m)
= Em∼P log2
Q(m)
" !#
Q(m)
= Em∼P − log2
P(m)
" #!
Q(m)
≥ − log2 Em∼P
P(m)
 
X Q(m) 
= − log2  P (m)
m∈M,P (m)>0
P (m)
 
X
= − log2  Q(m)
m∈M,P (m)>0
≥ 0

We have now proved that for any prefix-free code c we have that H(P, Pc ) ≥
H(P ) which proves that any code must use at least H(P ) bits per message.
Kulback-Leibler divergence is one of the central concepts of information
theory. Intuitively, KL(P, Q) measures the degree to which P and Q are
different. But in general we have that KL(P, Q) does not equal KL(Q, P ).
Note that if U is the uniform distribution on a finite set of size k then
H(P, U ) = H(U ) = log k while H(U, P ) = ∞ whenever there exists a value
⊥ with P (⊥) = 0.

2 Mutual Information
For two random variables X and Y we define the conditional entropy H(Y |X)
as follows.

" #
1
H(Y |X) = E log2
P(y|x)
X X 1
= P (x) P (y|x) log2
x y P (y|x)
X
= P (x)H(Y |x)
x

In the last line above H(Y |x) is a quantity that is different for different
values of x while H(Y |X) is a fixed quantity independent of any particular
value for X or Y .

Definition 2 The mutual information between X and Y , denoted I(X, Y )


is defined as follows.

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)

If X and Y are correlated then H(hX, Y i) will be smaller than H(X) +


H(Y ).

Lemma 5

I(X, Y ) = H(X) − H(X|Y )


= H(Y ) − H(Y |X)

Proof:

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)


" # " #
1 1
= H(X) + E log2 − E log2
P(y) P(hx, yi)
" #
1 1
= H(X) + E log2 − log2
P(y) P(hx, yi)
" #
P(hx, yi)
= H(X) + E log2
P(y)
= H(X) + E [log2 P(x|y)]
" #
1
= H(X) − E log2
P(x|y)
= H(X) − H(X|Y )
3 Problems
1. Let the set of messages M be the positive integers 1, 2, 3, . . .. Suppose we
pick an integer by flipping an unbiased coin and stopping as soon as we get
the first heads. We then output the number of flips. This gives P (i) = (1/2)i .
Give a prefix-free code c(i) with |c(i)| = log2 (1/P (i)). What is the entropy
of this distribution?
2. Suppose that there is a popular 10 megapixel digital camera where
a pixel is represented by 16 bits. Consider the probability distribution over
images to be taken by this model of camera over the next year (we call these
“natural” images). Give an upper bound on the entropy of this distribution.
Ignore the question of whether the distribution of natural images is really
well defined.
3. Suppose that we use rendering software to construct images of solid
models where a model consists of a set of objects each at a certain configura-
tion and under certain lighting conditions and from a certain camera position.
Suppose that the models are generated by kids using modeling software on
the web where each model must fit in a single ten kilobyte message. Consider
the probability distribution over the images rendered in this process. Give an
upper bound on the entropy of this distribution. (Again ignore the question
of whether the distribution is well defined).
4. Consider climate simulation software that samples future weather pat-
terns by using a random number generator to add noise to the process of
weather formation. If the program always starts with the same current state
but uses a 32 bit random number seed (selected uniformly from all such seeds)
what is the entropy of the probability distribution over future whether pat-
terns produced by this program assuming that each different random seed
produces a different weather pattern.
5. Use the fact that K(P, Q) ≥ 0 to prove that the expected length of
any prefix code is greater than H(X), i.e.:
X
p(x)l(x) ≥ H(p)
x

Hint: Use the Kraft inequality and set q(x) = 2−l(x) . Make sure you add a
“dummy” element to so that q(x) is a valid distribution.
1. Show that H(hX, Y i) = H(X) + H(Y |X).
2. Show that I(X, Y ) = KL(PhX, Y i , PX PY ) where PhX, Y i is the distri-
bution on pairs hx, yi given by world states and PX PY is the distribution on
pairs hx, yi given by P (hx, yi) = PX (x)PY (y).
3. Explain why 2. implies that I(X, Y ) ≥ 0
4. Explain why problem 3. implies that H(Y |X) ≤ H(Y ).

You might also like