0% found this document useful (0 votes)

21 views9 pages

Entropy & Data Compression Lecture

Done

Uploaded by

Nimesh Deepika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views9 pages

Entropy & Data Compression Lecture

Done

Uploaded by

Nimesh Deepika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Lecture 1: Entropy and Data Compression

The fundamental concepts of information theory can be motivated by

the problem of data compression. Suppose that we have a countable set M
of messages. Suppose that we want to transmit a sequence of b messages
m1 , m2 , . . . , mb where the messages mi are drawn IID according to P . The
main theorem we consider in this section is the following (stated somewhat
informally).

Shannon’s Source Coding Theorem [Informal Version] In the

limit as the block size goes to infinity the number of bits required
per message in the block is exactly the entropy H(P ) of P defined
as follows.

H(P ) = Em∼P [log2 (1/P(m))]

As a simple example suppose that P is the uniform distribution on 2k

messages. In this case we have that H(P ) = k. As another example suppose
that P assigns nonzero weight to 2k messages but half of the weight is on
a single message. In that case we can use the bit string 0 to represent the
common message and codes of length k + 1, each starting with the the bit 1,
for all other messages. In this case the average number of bits per message is
1/2 + 1/2(k + 1) = 1 + k/2. In general if the distribution is non-uniform we
get greater compression by assigning fewer bits to more common messages.
We now state the main theorem more precisely. For a countable set M of
messages define a code for M to be an assignment c(m) of a bit string (code
word) for each m ∈ M with the property that if m 6= m0 then c(m) 6= c(m0 ).
We will also consider coding a long sequence of messages by coding each
message individually. In order for this to work one must be able to tell when
one code word ends and the next begins. This can be done provided that no
two distinct codes have the property that one is a prefix of the other.

Definition 1 A code is called prefix free if all code words are distinct and
no code word is a prefix of any other code word.

As an example of a prefix-free code we can consider the set of all null-

terminated byte (or character) strings. In this case every code word is a
certain byte string and hence the length of each code is a multiple of 8.
We can now state our main theorem more precisely. We now let |c(m)|
be the length of the code word c(m) (the number of bits used in c(m)). We
let C(M ) be the set of all prefix-free codes on M . We let C ∗ (P ) be defined
as follows.

C ∗ (P ) = inf Em∼P [|c(m)|]

c∈C(M )

Let b be a positive integer block size. Let M b be the set of all tuples
hm1 , . . . , mb i with mi ∈ M . Let P b be the probability distribution on M b
where each mi is selected independently with probability distribution P . (We
say that the mi are drawn “IID” for Independently Identically Distributed).
The above definition implies the following.

C ∗ (P b ) = inf Ehm , . . . , m i∼Pb [|c(hm1 , . . . , mb i)|]

c∈C(M b ) 1 b

Note that a code for M b is coding for b messages from M so to get the
number of bits per message we should divide the length of a code word by b.
We can now state the main theorem as follows.

Theorem 1 (Shannon’s Source Coding Theorem)

1 ∗ b
lim C (P ) = H(P )
b→∞ b

To prove this theorem we start with the following.

Lemma 2 (Kraft Inequality) For any prefix-free code c we have the fol-
lowing.
2−|c(m)| ≤ 1
X

m∈M

Proof: Suppose we generate a bit string by repeatedly flipping an unbiased

coin and stopping as soon as the bit string we have generated is a code word.
We then have that the probability of stopping with code word c(m) is exactly
2−|c(m)| . The Kraft inequality then follows form the fact that probabilities
sum to 1 (and there can be some nonzero probability that we miss all the
code words and never stop the process).
Theorem 3 Let ` be an assignment of a positive integer `(m) to each m ∈ M
satisfying the Kraft inequality:
2−`(m) ≤ 1
X

m∈M

For any such assignment of lengths there exists a prefix-free code c with
|c(m)| = `(m).

Proof: Arrange the messages m in a sequence m1 , m2 , m3 , . . . of nonde-

creasing length according to the assignment `, i.e., such that `(mi+1 ) ≥ `(mi ).
Pick code words in the order given subject to the constraint the selected code
word can not have as a prefix any previously selected code word. We view
a code word c as having probability mass 2−|c| . When a code word c(m) is
selected the amount of probability mass that becomes unavailable for future
assignments is 2−`(m) . By the Kraft inequality the amount of probability
mass remaining must be sufficient for the remainder of the code words. Fur-
thermore, when selecting a code word for m we can consider the remaining
mass to be uniformly distributed among the remaining code words of length
`(m) and hence such a code word can always be selected.

Theorem 4 For probability distribution P on a countable set of messages

M there exists a code c assigning code word c(m) to each m ∈ M satisfying
the following.
Em∼P [|c(m)|] ≤ H(P) + 1

Proof: Let `(m) be dlog2 1/P (m)e. We have the following.

2−`(m) = 2−dlog2 1/P (m)e
X X

m∈M m∈M
X
≤ P (m)
m∈M
= 1
Therefore by theorem 3 there exists a code c with |c(m)| = `(m) and hence
we have the following.
|c(m)| ≤ log2 (1/P (m)) + 1
Em∼P [|c(m)|] ≤ Em∼P [log2 (1/P(m)) + 1]
= Em∼P [log2 (1/P(m))] + Em∼P [1]
= H(P ) + 1
Now consider the distribution P b on the message blocks M b . The expected
number of bits per message in a message block using a Shannon code for
message blocks is the following.
1
E [|c(hm1 , . . . , mb i)|] ≤ 1/b(H(P b ) + 1)
b hm1 , . . . , mb i∼Pb
= H(P ) + 1/b

This implies that for arbitrarily long blocks the number of bits per mes-
sage can be made arbitrarily close to H(P ). In the next lecture we will prove
that no code can achieve fewer bits per code word than H(P ).

1 Cross Entropy and KL Divergence

We will now prove the other half of the Shannon source coding theorem by
applying Jensen’s inequality to KL divergence. We will have to consider
functions g(x) which can be infinity. Wehn g can be infinite we define the
expectation of g as follows.

X
Em∼P [g(m)] = P(m)g(m)
m∈M,P(m)>0

Under this definition, if g(m) in infinite only when P (M ) is zero then we

have that the expectation ignores the infinte values of g and is still finite (for
finite M ).
Now we assume, without loss of generality, that M contains a special
element ⊥ and P (⊥) = 0 (we can always add one if no such element exists).
Without loss of generality we can also restrict our attention to prefix-free
codes c that assign code words to elements of M other than ⊥. For any such
code c we can define Pc (m) as follows.

Pc (m) = 2−|c(m)| if m 6= ⊥
X
Pc (⊥) = 1 − Pc (m)
m6=⊥

The Kraft inequality implies that Pc (⊥) as defined above is non-negative.

By introducing ⊥ we have arranged that Pc is a probability distribution on
M (it sums to one). Although we are particularly interested in distributions
of the form Pc for some code c, we now consider an arbitrary distribution Q
on M and define the cross entropy H(P, Q) as follows.
" #
1
H(P, Q) = Em∼P log2
Q(m)

Note that messages with zero probability under P (such as ⊥) are ignored
in this definition. If Q assigns zero probability to a message that has nonzero
probability under P then H(P, Q) is infinite. We now have the following.

Em∼P [|c(m)|] = H(P, Pc )

In general H(P, Q) can be viewed as the expected code length when we

select messages according to P but use a code that is optimal for Q. If Q is
different from P then we are using the wrong code so we should expect that
H(P, Q) is at least H(P ). We will now prove this as a consequence of Jensen’s
inequality. First we define the Kulback-Leibler divergenge , KL(P, Q) as
follows.
KL(P, Q) = H(P, Q) − H(P )
It now suffices to show that KL(P, Q) ≥ 0. We can do this as follows where
the first inequality is derrived from Jensen’s inequality and the fact that the
function f (x) = − log2 (x) is convex.

KL(P, Q) = H(P, Q) − H(P )

" !# " !#
1 1
= Em∼P log2 − Em∼P log2
Q(m) P(m)
" ! !#
1 1
= Em∼P log2 − log2
Q(m) P(m)
" !#
P(m)
= Em∼P log2
Q(m)
" !#
Q(m)
= Em∼P − log2
P(m)
" #!
Q(m)
≥ − log2 Em∼P
P(m)
 
X Q(m) 
= − log2  P (m)
m∈M,P (m)>0
P (m)
 
X
= − log2  Q(m)
m∈M,P (m)>0
≥ 0

We have now proved that for any prefix-free code c we have that H(P, Pc ) ≥
H(P ) which proves that any code must use at least H(P ) bits per message.
Kulback-Leibler divergence is one of the central concepts of information
theory. Intuitively, KL(P, Q) measures the degree to which P and Q are
different. But in general we have that KL(P, Q) does not equal KL(Q, P ).
Note that if U is the uniform distribution on a finite set of size k then
H(P, U ) = H(U ) = log k while H(U, P ) = ∞ whenever there exists a value
⊥ with P (⊥) = 0.

2 Mutual Information
For two random variables X and Y we define the conditional entropy H(Y |X)
as follows.

In the last line above H(Y |x) is a quantity that is different for different
values of x while H(Y |X) is a fixed quantity independent of any particular
value for X or Y .

Definition 2 The mutual information between X and Y , denoted I(X, Y )

is defined as follows.

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)

If X and Y are correlated then H(hX, Y i) will be smaller than H(X) +

H(Y ).

Lemma 5

I(X, Y ) = H(X) − H(X|Y )

= H(Y ) − H(Y |X)

Proof:

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)

" # " #
1 1
= H(X) + E log2 − E log2
P(y) P(hx, yi)
" #
1 1
= H(X) + E log2 − log2
P(y) P(hx, yi)
" #
P(hx, yi)
= H(X) + E log2
P(y)
= H(X) + E [log2 P(x|y)]
" #
1
= H(X) − E log2
P(x|y)
= H(X) − H(X|Y )
3 Problems
1. Let the set of messages M be the positive integers 1, 2, 3, . . .. Suppose we
pick an integer by flipping an unbiased coin and stopping as soon as we get
the first heads. We then output the number of flips. This gives P (i) = (1/2)i .
Give a prefix-free code c(i) with |c(i)| = log2 (1/P (i)). What is the entropy
of this distribution?
2. Suppose that there is a popular 10 megapixel digital camera where
a pixel is represented by 16 bits. Consider the probability distribution over
images to be taken by this model of camera over the next year (we call these
“natural” images). Give an upper bound on the entropy of this distribution.
Ignore the question of whether the distribution of natural images is really
well defined.
3. Suppose that we use rendering software to construct images of solid
models where a model consists of a set of objects each at a certain configura-
tion and under certain lighting conditions and from a certain camera position.
Suppose that the models are generated by kids using modeling software on
the web where each model must fit in a single ten kilobyte message. Consider
the probability distribution over the images rendered in this process. Give an
upper bound on the entropy of this distribution. (Again ignore the question
of whether the distribution is well defined).
4. Consider climate simulation software that samples future weather pat-
terns by using a random number generator to add noise to the process of
weather formation. If the program always starts with the same current state
but uses a 32 bit random number seed (selected uniformly from all such seeds)
what is the entropy of the probability distribution over future whether pat-
terns produced by this program assuming that each different random seed
produces a different weather pattern.
5. Use the fact that K(P, Q) ≥ 0 to prove that the expected length of
any prefix code is greater than H(X), i.e.:
X
p(x)l(x) ≥ H(p)
x

Hint: Use the Kraft inequality and set q(x) = 2−l(x) . Make sure you add a
“dummy” element to so that q(x) is a valid distribution.
1. Show that H(hX, Y i) = H(X) + H(Y |X).
2. Show that I(X, Y ) = KL(PhX, Y i , PX PY ) where PhX, Y i is the distri-
bution on pairs hx, yi given by world states and PX PY is the distribution on
pairs hx, yi given by P (hx, yi) = PX (x)PY (y).
3. Explain why 2. implies that I(X, Y ) ≥ 0
4. Explain why problem 3. implies that H(Y |X) ≤ H(Y ).

Entropy & Coding in Info Theory
No ratings yet
Entropy & Coding in Info Theory
10 pages
Entropy 2
No ratings yet
Entropy 2
3 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
No ratings yet
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
25 pages
Information Theory: Entropy & Coding
No ratings yet
Information Theory: Entropy & Coding
63 pages
Materi Source Coding
No ratings yet
Materi Source Coding
39 pages
Information Theory 1
No ratings yet
Information Theory 1
37 pages
EE 376A: Information Theory: Lecture Notes
No ratings yet
EE 376A: Information Theory: Lecture Notes
75 pages
15ec54 PDF
No ratings yet
15ec54 PDF
56 pages
Info Theory & Entropy Basics
No ratings yet
Info Theory & Entropy Basics
44 pages
Information Theory and Coding PDF
No ratings yet
Information Theory and Coding PDF
150 pages
Information Theory Lecture Notes
No ratings yet
Information Theory Lecture Notes
37 pages
Shannon Source Coding Theorem
No ratings yet
Shannon Source Coding Theorem
3 pages
Entropy, Coding and Data Compression
No ratings yet
Entropy, Coding and Data Compression
33 pages
Coding
No ratings yet
Coding
61 pages
Unit I Information Theory & Coding Techniques P I
No ratings yet
Unit I Information Theory & Coding Techniques P I
48 pages
Proof To Shannon's Source Coding Theorem
No ratings yet
Proof To Shannon's Source Coding Theorem
5 pages
Information Theory
No ratings yet
Information Theory
38 pages
Chap 2
No ratings yet
Chap 2
47 pages
The Information Theory: C.E. Shannon, A Mathematical Theory of Communication'
No ratings yet
The Information Theory: C.E. Shannon, A Mathematical Theory of Communication'
43 pages
Kraft-McMillan Inequality
No ratings yet
Kraft-McMillan Inequality
5 pages
Source Coding and Shannon's Theorem
No ratings yet
Source Coding and Shannon's Theorem
30 pages
5 Data Compression
No ratings yet
5 Data Compression
6 pages
Uniquely Decodable Codes
No ratings yet
Uniquely Decodable Codes
10 pages
Kjom351 09 PDF
No ratings yet
Kjom351 09 PDF
7 pages
Lecture 1
No ratings yet
Lecture 1
75 pages
Info Theory for Telecom Students
No ratings yet
Info Theory for Telecom Students
28 pages
Information Theory and Coding 2marks
No ratings yet
Information Theory and Coding 2marks
12 pages
Lossless Data Compression
No ratings yet
Lossless Data Compression
24 pages
IICT Notes Unit-2
No ratings yet
IICT Notes Unit-2
17 pages
Information Coding Techniques
No ratings yet
Information Coding Techniques
42 pages
Understanding Source Coding & Compression
No ratings yet
Understanding Source Coding & Compression
48 pages
Lecture 6 PDF
No ratings yet
Lecture 6 PDF
5 pages
Introduction to Information Theory
No ratings yet
Introduction to Information Theory
45 pages
Info Theory
No ratings yet
Info Theory
59 pages
Concepts & Information Theory
No ratings yet
Concepts & Information Theory
68 pages
Introduction To Information Theory and Coding
No ratings yet
Introduction To Information Theory and Coding
46 pages
Info Theory & Noise Analysis
No ratings yet
Info Theory & Noise Analysis
34 pages
Lect 02
No ratings yet
Lect 02
20 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
Notes For EE 229A: Information and Coding Theory UC Berkeley Fall 2020
100% (1)
Notes For EE 229A: Information and Coding Theory UC Berkeley Fall 2020
70 pages
Noiseless Coding
No ratings yet
Noiseless Coding
5 pages
Unit 1
No ratings yet
Unit 1
94 pages
Shannon's Theorems: Math and Science Summer Program 2020
No ratings yet
Shannon's Theorems: Math and Science Summer Program 2020
28 pages
Mobile Communicaton Engineering: Review On Fundamental Limits On Communications
No ratings yet
Mobile Communicaton Engineering: Review On Fundamental Limits On Communications
31 pages
Data Compression Can Be Achieved by Assigning To of The Data Source and
No ratings yet
Data Compression Can Be Achieved by Assigning To of The Data Source and
42 pages
Source Coding Techniques
No ratings yet
Source Coding Techniques
18 pages
T4 NoiseAndMutualInformation
No ratings yet
T4 NoiseAndMutualInformation
8 pages
Information Theory and Coding System: PMSCS 676 Summer 2016
No ratings yet
Information Theory and Coding System: PMSCS 676 Summer 2016
64 pages
Week 2
No ratings yet
Week 2
16 pages
Lossless Data Compression Basics
100% (2)
Lossless Data Compression Basics
26 pages
Unit 5 - Part-Ii
No ratings yet
Unit 5 - Part-Ii
41 pages
MIT16 36s09 Lec05
No ratings yet
MIT16 36s09 Lec05
16 pages
EE4740 Lecture4 Slides
No ratings yet
EE4740 Lecture4 Slides
43 pages
Information Theory Basics
No ratings yet
Information Theory Basics
26 pages
Input Source Encoder Channel Encoder Binary Interface
No ratings yet
Input Source Encoder Channel Encoder Binary Interface
29 pages
AI HL Functions
No ratings yet
AI HL Functions
93 pages
Day 12 Problem Set
No ratings yet
Day 12 Problem Set
4 pages
Revision WS - Final Exam - 1
No ratings yet
Revision WS - Final Exam - 1
5 pages
Chapter - Inverse Circular Function
No ratings yet
Chapter - Inverse Circular Function
3 pages
Math Book Contents Page
No ratings yet
Math Book Contents Page
2 pages
LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
100% (4)
LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
559 pages
6.2. Feature Extraction - Scikit-Learn 0.23.2 Documentation
No ratings yet
6.2. Feature Extraction - Scikit-Learn 0.23.2 Documentation
28 pages
Differentiation of Algebraic
No ratings yet
Differentiation of Algebraic
15 pages
Nadi Shine FPM Sep Q
No ratings yet
Nadi Shine FPM Sep Q
14 pages
小學三年級 Primary 3: 120 120 試題 Question Paper
0% (1)
小學三年級 Primary 3: 120 120 試題 Question Paper
10 pages
11 Maths NcertSolutions Chapter 5 3
No ratings yet
11 Maths NcertSolutions Chapter 5 3
6 pages
10th Grade Math Practice Questions
No ratings yet
10th Grade Math Practice Questions
31 pages
IB DP Mathematics AA Topic 1 Syllabus
No ratings yet
IB DP Mathematics AA Topic 1 Syllabus
3 pages
Problem Set No. 3
No ratings yet
Problem Set No. 3
2 pages
Course Description: CEE511 - Structural Dynamics (3 Credits) Fall Semester 2013-2014
No ratings yet
Course Description: CEE511 - Structural Dynamics (3 Credits) Fall Semester 2013-2014
1 page
12 - Mathematics
No ratings yet
12 - Mathematics
4 pages
Goldenstate College General Santos City Issue No. 01 Rev No. 00 Reference No. Lesson Plan Effectivity Date: June 15, 2020 Page No. 1 of 3
No ratings yet
Goldenstate College General Santos City Issue No. 01 Rev No. 00 Reference No. Lesson Plan Effectivity Date: June 15, 2020 Page No. 1 of 3
3 pages
M.Sc. Math Entrance: Sets & Relations
No ratings yet
M.Sc. Math Entrance: Sets & Relations
2 pages
ML Aggarwal Class10 Practice Questions
No ratings yet
ML Aggarwal Class10 Practice Questions
3 pages
1993 2021年历年真题 BMO Round 1
No ratings yet
1993 2021年历年真题 BMO Round 1
35 pages
MSO 201A: Probability and Statistics 2021 (2nd Semester) Assignment-II
No ratings yet
MSO 201A: Probability and Statistics 2021 (2nd Semester) Assignment-II
3 pages
MEC603
No ratings yet
MEC603
2 pages
Some Basic C++ Programs
No ratings yet
Some Basic C++ Programs
32 pages
Solomon H QP - C1 Edexcel
No ratings yet
Solomon H QP - C1 Edexcel
4 pages
6 Graph Properties
No ratings yet
6 Graph Properties
9 pages
Numerical Analysis - I. Jacques and C. Judd PDF
No ratings yet
Numerical Analysis - I. Jacques and C. Judd PDF
109 pages
26 Chain Rule and Implicit Differentiation PDF
No ratings yet
26 Chain Rule and Implicit Differentiation PDF
17 pages
Solution Manual For Precalculus Functions and Graphs 12th Edition Swokowski Cole 0840068573 9780840068576 Download
No ratings yet
Solution Manual For Precalculus Functions and Graphs 12th Edition Swokowski Cole 0840068573 9780840068576 Download
75 pages
W-Second Q
100% (1)
W-Second Q
7 pages
MTH121 Questions 2018-2019
No ratings yet
MTH121 Questions 2018-2019
3 pages

Entropy & Data Compression Lecture

Uploaded by

Entropy & Data Compression Lecture

Uploaded by

Lecture 1: Entropy and Data Compression

The fundamental concepts of information theory can be motivated by

Shannon’s Source Coding Theorem [Informal Version] In the

H(P ) = Em∼P [log2 (1/P(m))]

As a simple example suppose that P is the uniform distribution on 2k

As an example of a prefix-free code we can consider the set of all null-

C ∗ (P ) = inf Em∼P [|c(m)|]

C ∗ (P b ) = inf Ehm , . . . , m i∼Pb [|c(hm1 , . . . , mb i)|]

Theorem 1 (Shannon’s Source Coding Theorem)

To prove this theorem we start with the following.

Proof: Suppose we generate a bit string by repeatedly flipping an unbiased

Proof: Arrange the messages m in a sequence m1 , m2 , m3 , . . . of nonde-

Theorem 4 For probability distribution P on a countable set of messages

Proof: Let `(m) be dlog2 1/P (m)e. We have the following.

1 Cross Entropy and KL Divergence

Under this definition, if g(m) in infinite only when P (M ) is zero then we

The Kraft inequality implies that Pc (⊥) as defined above is non-negative.

Em∼P [|c(m)|] = H(P, Pc )

In general H(P, Q) can be viewed as the expected code length when we

KL(P, Q) = H(P, Q) − H(P )

Definition 2 The mutual information between X and Y , denoted I(X, Y )

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)

If X and Y are correlated then H(hX, Y i) will be smaller than H(X) +

I(X, Y ) = H(X) − H(X|Y )

I(X, Y ) = H(X) + H(Y ) − H(hX, Y i)

You might also like