Text Compression: Examples Huffman coding: go go gophers
“abcde” in the different formats ASCII 3 bits Huffman
Encodings ASCII: g 103 1100111 000 00
01100001011000100110001101100100… o 111 1101111 001 01
13
ASCII: 8 bits/character p 112 1110000 010 1100
Fixed: h 104 1101000 011 1101 6 7
Unicode: 16 bits/character
000001010011100 e 101 1100101 100 1110 3 4
10 g o
r 114 1110010 101 1111
Symbol ASCII Fixed Var. 3 3 s * 2 2
s 115 1110011 110 101
length length 0 1 sp. 32 1000000 111 101 1
0 0 2
a 01100001 000 000 p h e r
0 1 0 1
0
Encoding uses tree: 1 1 1 1
b 01100010 001 11 0 left/1 right
a b c d e 6
Var:
c 01100011 010 01 How many bits? 37!! 4
000110100110 g o
0 1 Savings? Worth it?
d 01100100 011 001 2 2 3 3
3
e 01100101 100 10 0 1 0 1
p h e r
c e b 1 1 1 s *
0 1 1
1 2
CPS 100 a d 8.1 CPS 100 8.2
Huffman Coding Building a Huffman tree
D.A Huffman in early 1950’s Begin with a forest of single-node trees (leaves)
Before compressing data, analyze the input stream Each node/tree/leaf is weighted with character count
Represent data using variable length codes Node stores two values: character and count
Variable length codes though Prefix codes There are n nodes in forest, n is size of alphabet?
Each letter is assigned a codeword
Codeword is for a given letter is produced by traversing Repeat until there is only one node left: root of tree
the Huffman tree Remove two minimally weighted trees from forest
Property: No codeword produced is the prefix of another Create new tree with minimal trees as children,
Letters appearing frequently have short codewords, • New tree root's weight: sum of children (character ignored)
while those that appear rarely have longer ones
Huffman coding is optimal per-character coding method Does this process terminate? How do we get minimal trees?
Remove minimal trees, hummm……
CPS 100 8.3 CPS 100 8.4
Building a tree Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
1 1
F C
11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 2 1
CPS 100 8.5 CPS 100 8.6
I E N S M A B O T G D L R U P F C I E N S M A B O T G D L R U P
Building a tree Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2 3 2 3 2 2
L R
1 1 1 2 1 1 1 2
C F P U C F P U
11 6 5 5 4 4 3 3 3 3 3 2 2 2 2 2 11 6 5 5 4 4 4 3 3 3 3 3 2 2 2
CPS 100 8.7 CPS 100 8.8
I E N S M A B O T G D L R I E N S M A B O T G D
Building a tree Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
23 37 23
21 16 12 11 21 16 12 11
10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U
23 21 16 37 23
CPS 100 8.9 CPS 100 8.10
Building a tree Mary Shaw
Software engineering and
software architecture
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
Tools for constructing
60 large software systems
Development is a small
37 23 piece of total cost,
maintenance is larger,
21 16 12 11
depends on well-designed
and developed techniques
10 11 8 8 6 6
I Interested in computer
5 5 5 6 4 4 4 4 3 3
science, programming,
E N S M A B curricula, and canoeing,
3 2 3 3 2 2 2 2 health-care costs
O T G D L R ACM Fellow, Alan Perlis
1 1 1 2
Professor of Compsci at CMU
C F P U
60
CPS 100 8.11 CPS 100 8.12
Huffman Complexities Writing code out to file
How do we measure? Size of input file, size of alphabet How do we go from characters to encodings?
Which is typically bigger? Build Huffman tree
Root-to-leaf path generates encoding
Accumulating character counts: ______
How can we do this in O(1) time, though not really Need way of writing bits out to file
Building the heap/priority queue from counts ____ Platform dependent?
Initializing heap guaranteed Complicated to write bits and read in same ordering
Building Huffman tree ____
Why? See BitInputStream and BitOutputStream classes
Create table of encodings from tree ____ Depend on each other, bit ordering preserved
Why?
Write tree and compressed file _____ How do we know bits come from compressed file?
Store a magic number
CPS 100 8.13 CPS 100 8.14
Decoding a message Decoding a message
01100000100001001101 1100000100001001101
60 60
37 23 37 23
21 16 12 11 21 16 12 11
10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U
CPS 100 8.15 CPS 100 8.16
Decoding a message Decoding a message
100000100001001101 00000100001001101
60 60
37 23 37 23
21 16 12 11 21 16 12 11
10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U
CPS 100 8.17 CPS 100 8.18
Decoding a message Decoding a message
0000100001001101 1
60 60
37 23 37 23
21 16 12 11 21 16 12 11
10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U
CPS 100 G 8.19 CPS 100 GOOD 8.20
Decoding a message Huffman coding: go go gophers
ASCII 3 bits Huffman
g 103 1100111 000 ?? g o p h e r s *
01100000100001001101 o 111 1101111 001 ?? 3 3 1 1 1 1 1 2
p 112 1110000 010
60 h 104 1101000 011 2 2 3
e 101 1100101 100
37 23 r 114 1110010 101 p h e r s *
s 115 1110011 110 1 1
21 16 12 11 1 1 1 2
sp. 32 1000000 111
choose two smallest weights
10 11 8 8 6 6
combine nodes + weights
I
Repeat
5 5 5 6 4 4 4 4 3 3
Priority queue? 4 6
E N S M A B
3 2 3 3 2 2 2 2 Encoding uses tree: 2 2 g o
O T G D L R 0 left/1 right
3 3
1 1 1 2 How many bits? p h e r
C F P U 1 1 1 1
CPS 100 GOOD 8.21 CPS 100 8.22
Huffman Tree 2 Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
E.g. “ A SIMPLE” ⇔ E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”
CPS 100 8.23 CPS 100 8.24
Huffman Tree 2 Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
E.g. “ A SIMPLE” ⇔ E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”
CPS 100 8.25 CPS 100 8.26
Huffman Tree 2 Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
E.g. “ A SIMPLE” ⇔ E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”
CPS 100 8.27 CPS 100 8.28
Huffman Tree 2 Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
E.g. “ A SIMPLE” ⇔ E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”
CPS 100 8.29 CPS 100 8.30
Huffman Tree 2 Other methods
“A SIMPLE STRING TO BE ENCODED USING A Adaptive Huffman coding
MINIMAL NUMBER OF BITS” Lempel-Ziv algorithms
E.g. “ A SIMPLE” ⇔ Build the coding table on the fly while reading
“10101101001000101001110011100000” document
Coding table changes dynamically
Protocol between encoder and decoder so that everyone
is always using the right coding scheme
Works well in practice (compress, gzip, etc.)
More complicated methods
Burrows-Wheeler (bunzip2)
PPM statistical methods
CPS 100 8.31 CPS 100 8.32