Submitted by :
Usama Dastagir
14030027011
Hassan Humayoun 14030027043
University of Management & Technology
History
In 1951, David A. Huffman and his MIT information theory classmates were given the choice of
a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the
problem of finding the most efficient binary code. Huffman, unable to prove any codes were the
most efficient, was about to give up and start studying for the final when he hit upon the idea of
using a frequency-sorted binary tree and quickly proved this method the most efficient.
Introduction
A Huffman Tree is a special type of binary tree used for data compression. A small Huffman
Tree appears below:
Each leaf in the tree contains a symbol (in this case a Character) and an integer value. Each nonleaf (internal node) contains just references to its left and right children. The 0's and 1's that you
see are not stored anywhere, we just mentally associate 0 with any left branch and 1 with any
right branch.
Each character that appears in the tree is assigned a unique code (a sequence of 0's and 1's)
obtained by following the path from the root of the tree to the leaf containing that character.
Below are the codes for the four characters found in this sample tree:
e is coded by "0"
b is coded by "100"
c is coded by "101"
d is coded by "11"
Basic Technique
Encoding
The technique works by creating a binary tree of nodes. These can be stored in a regular array,
the size of which depends on the number of symbols, . A node can be either a leaf node or
an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself,
theweight (frequency of appearance) of the symbol and optionally, a link to a parent node which
makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain
symbol weight, links to two child nodes and the optional link to a parent node. As a common
convention, bit '0' represents following the left child and bit '1' represents following the right
child. A finished tree has up to leaf nodes and
internal nodes. A Huffman tree that
omits unused symbols produces the most optimal code lengths.
The process essentially begins with the leaf nodes containing the probabilities of the symbol they
represent, then a new node whose children are the 2 nodes with smallest probability is created,
such that the new node's probability is equal to the sum of the children's probability. With the
previous 2 nodes merged into one node (thus not considering them anymore), and with the new
node being now considered, the procedure is repeated until only one node remains, the Huffman
tree.
The simplest construction algorithm uses a priority queue where the node with lowest probability
is given highest priority:
1. Create a leaf node for each symbol and add it to the priority queue.
2. While there is more than one node in the queue:
1. Remove the two nodes of highest priority (lowest probability) from the queue
2. Create a new internal node with these two nodes as children and with probability
equal to the sum of the two nodes' probabilities.
3. Add the new node to the queue.
3. The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree
with n leaves has 2n1 nodes, this algorithm operates in O(n log n) time, where n is the number
of symbols.
If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman
tree using two queues, the first one containing the initial weights (along with pointers to the
associated leaves), and combined weights (along with pointers to the trees) being put in the back
of the second queue. This assures that the lowest weight is always kept at the front of one of the
two queues:
1. Start with as many leaves as there are symbols.
2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the
least likely item is in the head of the queue).
3. While there is more than one node in the queues:
1. Dequeue the two nodes with the lowest weight by examining the fronts of both
queues.
2. Create a new internal node, with the two just-removed nodes as children (either
node can be either child) and the sum of their weights as the new weight.
3. Enqueue the new node into the rear of the second queue.
4. The remaining node is the root node; the tree has now been generated
Example
Decoding
Generally speaking, the process of decompression is simply a matter of translating the stream of
prefix codes to individual byte values, usually by traversing the Huffman tree node by node as
each bit is read from the input stream (reaching a leaf node necessarily terminates the search for
that particular byte value). Before this can take place, however, the Huffman tree must be
somehow reconstructed. In the simplest case, where character frequencies are fairly predictable,
the tree can be preconstructed (and even statistically adjusted on each compression cycle) and
thus reused every time, at the expense of at least some measure of compression efficiency.
Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might
be to prepend the frequency count of each character to the compression stream. Unfortunately,
the overhead in such a case could amount to several kilobytes, so this method has little practical
use. If the data is compressed using canonical encoding, the compression model can be precisely
reconstructed with just
bits of information (where is the number of bits per symbol).
Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For
example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the
latter is encountered the tree building routine simply reads the next 8 bits to determine the
character value of that particular leaf. The process continues recursively until the last leaf node is
reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using
such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other
techniques are possible as well. In any case, since the compressed data can include unused
"trailing bits" the decompressor must be able to determine when to stop producing output. This
can be accomplished by either transmitting the length of the decompressed data along with the
compression model or by defining a special code symbol to signify the end of input (the latter
method can adversely affect code length optimality, however).
Time Complexity
The time complexity of the Huffman algorithm is O(nlogn). Using a heap to store the weight of
each tree, each iteration requires O(logn) time to determine the cheapest weight and insert the
new weight. There are O(n) iterations, one for each item.
Applications
Arithmetic coding can be viewed as a generalization of Huffman coding, in the sense that they
produce the same output when every symbol has a probability of the form 1/2 k; in particular it
tends to offer significantly better compression for small alphabet sizes. Huffman coding
nevertheless remains in wide use because of its simplicity, high speed, andlack of patent
coverage. Intuitively, arithmetic coding can offer better compression than Huffman coding
because its "code words" can have effectively non-integer bit lengths, whereas code words in
Huffman coding can only have an integer number of bits. Therefore, there is an inefficiency in
Huffman coding where a code word of length k only optimally matches a symbol of probability
1/2k and other probabilities are not represented as optimally; whereas the code word length in
arithmetic coding can be made to exactly match the true probability of the symbol.
Huffman coding today is often used as a "back-end" to some other compression
methods. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a
front-end model and quantization followed by Huffman coding (or variable-length prefix-free
codes with a similar structure, although perhaps not necessarily designed by using Huffman's
algorithm.
References
https://en.wikipedia.org/wiki/Huffman_coding#Formalized_description
https://www.cs.auckland.ac.nz/software/AlgAnim/huffman.html
https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman
_tutorial.html
http://www.cs.umd.edu/class/fall2012/cmsc132h/Projects/P7/project7.html
http://www.csee.umbc.edu/courses/undergraduate/341/fall11/projects/project3/HuffmanE
xplanation.shtml