DSP: Huffman coding for data compression
1. Introduction
Data compression is a crucial aspect of modern computing, enabling efficient storage and
transmission of information. Huffman coding is a widely used technique for lossless data
compression. In this report, we explore Huffman data compression, including its algorithm,
implementation in MATLAB/Python, experimental results, and observations.
2. Problem Description
The goal of Huffman coding is to efficiently represent data by assigning variable-length codes
to input symbols based on their frequencies. Symbols with higher frequencies are assigned
shorter codes, reducing the overall size of the encoded data.
3. Algorithm (Pseudocode)
1.Read the number of symbols and their corresponding probabilities.
2.Sort the probabilities in descending order.
3.Create a priority queue and insert the symbols and their probabilities.
4.Repeat the following steps until the priority queue contains only one element:
a. Remove the two elements with the lowest probabilities from the priority queue.
b. Create a new node with the sum of the two probabilities and the concatenation of their symbols.
c. Insert the new node into the priority queue.
5. The remaining element in the priority queue is the root of the Huffman tree.
6. Generate the Huffman codebook by traversing the Huffman tree.
7. Encode the input signal using the Huffman codebook.
8. Decode the encoded signal using the Huffman tree.
9. Check if the decoded signal is equal to the original input signal.
Algorithm 1: User-Interactive Huffman Encoding
1. Prompt the user to enter the number of symbols (x).
2. Create a vector n with values from 1 to x (symbols).
3. Prompt the user to enter the probabilities for each symbol (p).
4. Sort probabilities in descending order (optional, for visualization).
5. Create a Huffman code dictionary (dict) using `huffmandict(n, p)`.
6. (Optional) Calculate the minimum bits per symbol (bps).
7. Generate a random symbol sequence (inputsig) with a defined length.
8. Encode the symbol sequence using the dictionary to get encoded data (code) with
`huffmanenco(inputsig, dict)`.
9. (Optional) Display the encoded data.
This pseudocode highlights the user interaction for defining symbols and probabilities.
Algorithm 2: Core Huffman Encoding Process
1. Input: Number of symbols (x), symbol probabilities (p).
2. Create a vector n containing symbols from 1 to x.
3. Build a Huffman tree using the probabilities (internal process of
`huffmandict`).
4. Generate Huffman codewords for each symbol based on the tree (internal process
of `huffmandict`).
5. Create a dictionary (dict) to map symbols to codewords.
6. Output: Huffman code dictionary (dict).
This focuses on the core logic of Huffman coding without user interaction details.
Algorithm 3: Encoding a Symbol Sequence with Huffman Coding
1. Input: Symbol sequence (inputsig), Huffman code dictionary (dict).
2. Loop through each symbol in inputsig:
- Find the corresponding Huffman codeword in dict.
- Append the codeword to encoded data (code).
3. Output: Encoded data (code) as a sequence of 0s and 1s.
This pseudocode outlines the process of encoding a symbol sequence using a predefined Huffman
code dictionary.
START
Enter No. of Symbols
1
and Probabilities
Display Symbols and
2
Probabilities
Sort Probabilities in
3
Descending Order
Generate Huffman
4
Dictionary
Generate Random
5
Input Signal
6 Encode the Signal
7 Decode the Signal
Check if Input equals
8
Decoded Signal
Convert Input &
9
Encoded Signal-Binary
Calculate
10
Sequence Length
Calculate Encoded
11
Length
12 Display Results
End
4. MATLAB Implementation
x=input('enter the number of
symbol') n=1:x;
disp('the number of symbols are n:')
disp(n)
p=input('enter the probabilities ')
disp(p)
s=sort(p,'descend')
disp("the sorted probabilites are ")
disp(s)
dict= huffmandict(n,p);
bps = ceil(log2(max(n)));
inputsig= randsrc(100,1,[n;p]);
code = huffmanenco(inputsig,dict)
sig = huffmandeco(code,dict);
isequal(inputsig,sig)
binarySig = int2bit(inputsig,bps);
seqLen = numel(binarySig)
binaryComp = int2bit(code,bps);
encodedLen = numel(binaryComp)
5. Experimental Results and Observations
Sample Input: Let's assume we have 5 symbols with probabilities [0.4, 0.3, 0.2, 0.1, 0.0].
Experimental Results: After applying Huffman coding, the average code length reduces, and the
encoded data size decreases significantly.
Observations: Huffman coding effectively reduces redundancy in the data by assigning
shorter codes to more probable symbols and longer codes to less probable symbols.
THEORY
1. input('enter the number of symbol');
This function prompts the user to enter the number of symbols in your data. It captures the
user's input as a string and stores it in the variable x.
It's mandatory because you need to define the number of symbols you're working with to create
the Huffman code dictionary.
2. n = 1:x;
This creates a row vector n containing sequential integers from 1 to the value stored in x.
It's typically used to represent the symbols in your data. Here, it assumes your symbols are
simply numbered from 1 to x.
3. disp('the number of symbols are n:');
disp(n);
These lines use the disp function to display informative messages to the user.
The first line displays "the number of symbols are n:" on the console.
The second line displays the contents of the vector n, showing the user the defined symbols.
These are not mandatory for functionality but improve clarity and user interaction.
4. p = input('enter the probabilities ');
disp(p);
Similar to input, this prompts the user to enter the probabilities associated with each symbol. It
stores the user's input as a row vector in p.
The values in p should sum to 1 and represent the likelihood of each symbol appearing in your
data.
This is mandatory because Huffman coding uses probabilities to create optimal codewords.
5. s = sort(p,'descend'); disp("the sorted probabilites are ");
disp(s);
sort(p,'descend') sorts the elements in p in descending order and stores the result in s.
disp functions again display messages and the sorted probabilities for user reference.
Sorting probabilities can be helpful for visualization and understanding the distribution of symbol
occurrences, but it's not strictly necessary for Huffman encoding itself.
6. dict = huffmandict(n, p);
This line is where the core Huffman coding happens.
huffmandict is a MATLAB function that creates a Huffman code dictionary based on the
symbols (n) and their probabilities (p).
The internal process (not directly accessible) involves building a Huffman tree using the
probabilities. The tree structure determines the optimal codewords (binary representations)
for each symbol to minimize the average codeword length.
The output, dict, is a cell array with two columns:
The first column lists the symbols (values from n).
The second column contains the corresponding Huffman codewords as row vectors of 0s
and 1s.
This is mandatory because the Huffman code dictionary is essential for encoding and
decoding your data.
7. bps = ceil(log2(max(n)));
This calculates the minimum number of bits required to represent the symbols in binary
form using ceil(log2(max(n))).
log2 calculates the base-2 logarithm, and ceil rounds the result up to the nearest integer.
It's informative to know the minimum bits needed, but not crucial for encoding/decoding.
8. inputsig = randsrc(100, 1, [n; p]);
This generates a random sequence of symbols (inputsig) with a length of 100.
randsrc is a random number generator for symbols.
The first argument (100) specifies the number of symbols to generate.
The second argument (1) sets the number of output channels (single channel here).
The third argument is a cell array [n; p].
The first row (n) defines the possible symbols (same as before).
The second row (p) specifies the probabilities for each symbol.
This step might not be mandatory if you have your own data sequence, but it's useful for
demonstration and testing.
9. code = huffmanenco(inputsig, dict);
This encodes the random symbol sequence (inputsig) using the Huffman code dictionary
(dict).
huffmanenco is another MATLAB function that performs the encoding based on the
codewords in dict.
The internal process (not directly accessible) involves replacing each symbol in inputsig with
its corresponding Huffman codeword from dict.
The output, code, is a row vector containing the encoded data as a sequence of 0s and 1s.
This is mandatory for encoding data using Huffman coding
EXAMPLE : 1
EXAMPLE : 2
6. Conclusion
Huffman coding is a powerful technique for lossless data compression. This report presented an
overview of Huffman coding, its algorithm, implementation, and experimental results. Huffman coding
efficiently reduces the size of data by assigning shorter codes to frequently occurring symbols, thus
improving storage and transmission efficiency.
7. References
Huffman, D. A. (1952). "A Method for the Construction of Minimum-Redundancy Codes".
Proceedings of the IRE.
Sayood, K. (2017). Introduction to Data Compression (5th ed.). Morgan Kaufmann.
MATLAB Documentation: Huffman Encoding