Information theory lecture-4
Source coding
An information source is a device which delivers symbols (or letters) randomly from a
set of symbols (or letters) called an alphabet. A code is a set of words called codewords.
A codeword can be defined as a combination of code symbols (alphabet) generated
from a source output.
The most common codes are binary codes, i.e. codes whose code alphabet is {0, 1}.
The ASCII code: stands for American Standard for Communication Information
Interchange. Originally intended to represent the whole set of characters of a
typewriter, It consists of 128 binary codewords having the same length (7-bits).
Later on, additional and non printing characters were added to meet new
demands. This gave birth to the extended ASCII code, a 256 fixed length binary
code whose. Nowadays, keyboards still communicate to computers with ASCII
codes and when saving document in “plain text”, characters are encoded with
ASCII codes.
1/5
Information theory lecture-4
The Morse code. Invented by Samuel Morse in the 1840’s, it allows letters of the
alphabet {a, b, … , z, “ space” , “ full stop” , “ comma” , … } to be sent as short
electrical signals (dots) and long electrical signals (dashes).
Morse code differs from ASCII code in the sense that shorter words are assigned to
more frequent letters.
Coding Definitions
1. A code is said to be uniquely decodable if any sequence of codewords can be
interpreted in only one way.
- The code {1, 10, 11} is not uniquely decodable as the sequence “ 1111” can be
interpreted in “ 1” ,“ 11”, “ 1” or “ 1”, “ 1” ,“ 11” or …
- The code {1, 10} is uniquely decodable. In the sequence 11011, the first codeword
is “ 1” since the following symbol is “ 1” whereas the second codeword is “ 10”
since the third symbol is “ 0” and so on …
2. An instantaneous code is a code in which any sequence of codewords can be
interpreted codeword by codeword, as soon as they are received.
- The code {0, 10} is instantaneous
- The code {1, 10} is not instantaneous. For instance, in the sequence 1110, we
need to know whether the second symbol is “ 0” or “ 1” before interpreting the
first symbol. This is due to the fact that a codeword (“ 1” ) is the beginning of
another codeword (“ 10” ). Instantaneous code is also known as prefix code.
2/5
Information theory lecture-4
3. A code is a prefix code if and only if no codeword is the beginning of another
codeword.
- The code {1, 01, 000, 001} is a prefix code.
An instantaneous (or prefix) code is uniquely decodable but some uniquely decodable
codes do not have the prefix property.
Example-1
An information source has an output of four symbols u1, u2, u3 and u4. The each code
symbol consist of alphabets 0 and 1. From the coding definitions discuss the coding
systems A, B, C and D given in the table below
A B C D
u1 0 00 0 0
u2 11 01 10 01
u3 00 10 110 011
u4 01 11 1110 0111
Solution
- Code A is not uniquely decodable because
u1u1u2=0011 and u3u2=0011
- Codes B, C and D are uniquely decodable codes
Code B is instantaneous code because of the fix length
Code C is instantaneous code because each codeword end with 0
Code D is not instantaneous because one must always wait for the first alphabet
of the next codeword before the current codeword can be decoded.
The source coding problem:
3/5
Information theory lecture-4
Let U be a source generating values {A, B, C, D} with probabilities 1/2, 1/4, 1/8 and 1/8.
1,000 outputs of U are to be stored in the form of binary file, and one seeks to reduce
the file to its smallest possible size.
Solution
Using the probability of the symbols, we deduce that in the sequence of 1,000 symbols,
there are roughly:
1000∗1
symbols of type A= =500
2
1000∗1
symbols of type B= =250
4
1000∗1
symbols of type C= =125
8
1000∗1
symbols of type D= =125
8
- First coding solution
There are 4 symbols to be encoded. Thus, each of them can be associated with a word
of two binary digits (2-bits) as follows:
A→ 00 , B→01, C→ 10 , C→11
All the codewords having equal number of bits( the same length), this code is known as
a fixed length code.
Hence,
file ¿ 500∗2+ 250∗2+125∗2+125∗2=2000 bits
- Second coding solution
4/5
Information theory lecture-4
The different symbols do not occur with the same probabilities. Therefore, we can think
of a code which assigns shorter words to more frequent symbols as :
A→ 1 , B→01, C→ 000 , C→001
This code is said to be a variable length code, as the codewords do not have the same
length.
Hence,
file ¿ 500∗1+ 250∗2+125∗3+125∗3=1750 bits
The size of the file is reduced on average
1750
=1.75 bits are necessary to represent one symbol.
1000
We are now faced with three questions:
Given an information source,
1. Is it possible to compress its data?
2. What is the minimum average number of bits necessary to represent one
symbol?
3. How do we design algorithms to achieve effective compression of the data?
These three questions constitute the source coding problem.
5/5