Algorithms for Data Science
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Tuesday, September 29, 2015
Outline
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Review of the last lecture
DFS applications
I
I
I
Cycle detection
Topological sorting
Strongly connected components in directed graphs
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Motivation
Data compression: find compact representations of data
Data compression standards
I
jpeg for image transmission
mp3 for audio content, mpeg2 for video transmission
utilities: gzip, bzip2
All of the above use the Huffman algorithm as a basic building
block.
Data representation
Chromosome maps: sequences of hundreds of millions of
bases (symbols from {A, C, G, T }).
Goal: store a chromosome map with 200 million bases.
How do we represent a chromosome map?
Data representation
Chromosome maps: sequences of hundreds of millions of
bases (symbols from {A, C, G, T }).
Goal: store a chromosome map with 200 million bases.
How do we represent a chromosome map?
I
Encode every symbol that appears in the sequence
separately by a fixed length binary string.
Codeword c(x) for symbol x: a binary string encoding x
Example code
I
I
Alphabet A = {A, C, G, T } with 4 symbols
Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T
codeword c(x)
00
01
10
11
Example code
I
I
Alphabet A = {A, C, G, T } with 4 symbols
Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T
codeword c(x)
00
01
10
11
Output of encoding: the concatenation of the codewords for
every symbol in the input sequence
Example code
I
I
Alphabet A = {A, C, G, T } with 4 symbols
Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T
codeword c(x)
00
01
10
11
Output of encoding: the concatenation of the codewords for
every symbol in the input sequence
Example
I
Input sequence: ACGT AA
Example code
I
I
Alphabet A = {A, C, G, T } with 4 symbols
Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T
codeword c(x)
00
01
10
11
Output of encoding: the concatenation of the codewords for
every symbol in the input sequence
Example
I
Input sequence: ACGT AA
Output: c(A)c(C)c(G)c(T )c(A)c(A) = 000110110000
Total length of encoding = 6 2 = 12 bits.
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I
C0 = {00, 01, 10, 11} is a symbol code for {A, C, G, T }.
ASCII encoding system: every character and special
symbol on the computer keyboard is encoded by a different
7-bit binary string.
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I
C0 = {00, 01, 10, 11} is a symbol code for {A, C, G, T }.
ASCII encoding system: every character and special
symbol on the computer keyboard is encoded by a different
7-bit binary string.
Remark 1.
C0 and ASCII are fixed-length symbol codes: each codeword has
the same length.
Unique decodability
Decoding C0 ?
Unique decodability
Decoding C0
I
read 2 bits of the output;
print the symbol corresponding to this codeword;
continue with the next 2 bits.
Unique decodability
Decoding C0
I
read 2 bits of the output;
print the symbol corresponding to this codeword;
continue with the next 2 bits.
Remark 2.
I
This decoding algorithm works for ASCII (replace 2 by 7)
C0 , ASCII: distinct input sequences have distinct encodings
Unique decodability
Decoding C0
I
read 2 bits of the output;
print the symbol corresponding to this codeword;
continue with the next 2 bits.
Remark 2.
I
This decoding algorithm works for ASCII (replace 2 by 7)
C0 , ASCII: distinct input sequences have distinct encodings
Definition 1.
A symbol code is uniquely decodable if, for any two distinct
input sequences, their encodings are distinct.
Lossless compression
Lossless compression: compress and decompress without
errors.
Uniquely decodable codes allow for lossless compression.
A symbol code achieves optimal lossless compression when
it produces an encoding of minimum length for its input
(among all uniquely decodable symbol codes).
Huffman algorithm: provides a symbol code that achieves
optimal lossless compression.
Fixed-length vs variable-length codes
Chromosome map consists of 200 million bases as follows:
alphabet symbol x
A
C
G
T
frequency f req(x)
110 million
5 million
25 million
60 million
Fixed-length vs variable-length codes
Chromosome map consists of 200 million bases as follows:
alphabet symbol x
A
C
G
T
I
frequency f req(x)
110 million
5 million
25 million
60 million
A appears much more often than the other symbols.
It might be best to encode A with fewer bits.
I
Unlikely that the fixed-length encoding C0 is optimal.
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1
Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1
I
C1 is not unique decodable! E.g., 101110: how to decode it?
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I
C2 is uniquely decodable.
C2 is such that no codeword is a prefix of another.
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I
C2 is uniquely decodable.
C2 is such that no codeword is a prefix of another.
Definition 2 (prefix codes).
A symbol code is a prefix code if no codeword is a prefix of
another.
Decoding prefix codes
1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I
Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.
3. Continue starting from the next bit of the bit string.
Decoding prefix codes
1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I
Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.
3. Continue starting from the next bit of the bit string.
Thus prefix codes allow for
I
unique decoding;
fast decoding (the end of a codeword is instantly
recognizable).
Decoding prefix codes
1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I
Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.
3. Continue starting from the next bit of the bit string.
Thus prefix codes allow for
I
unique decoding;
fast decoding (the end of a codeword is instantly
recognizable).
Examples of prefix codes: C0 , C2
Prefix codes and optimal lossless compression
Decoding a prefix code is fast.
Would like to focus on prefix codes (rather than
all uniquely decodable symbol codes) for achieving
optimal lossless compression.
I
Information theory guarantees that.
So we can solely focus on prefix codes for optimal
compression.
Compression gains from variable-length prefix codes
Chromosome map: do we gain anything by using C2 instead of
C0 when compressing the map of 200 million bases?
Input
symbol x
f req(x)
A
110 million
C
5 million
G
25 million
T
60 million
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
Compression gains from variable-length prefix codes
Chromosome map: do we gain anything by using C2 instead of
C0 when compressing the map of 200 million bases?
Input
symbol x
f req(x)
A
110 million
C
5 million
G
25 million
T
60 million
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
C0 : 2 bits 200 million symbols = 400 million bits
C2 : 1 110 + 3 5 + 3 25 + 2 60 = 320 million bits
Improvement of 20% in this example
The optimal prefix code problem
Input:
I
Alphabet A = {a1 , . . . , an }
Set P = {p1 , . . . , pn } of probabilities over A such that
pi = Pr[ai ]
Output: a binary prefix code C = {c(a1 ), c(a2 ), . . . , c(an )} for
(A, P ), where codeword c(ai ) has length `i and the expected
length of the code
X
L(C ) =
pi `i
ai A
is minimum among all prefix codes.
Example
Chromosome example
Input
symbol x
Pr(x)
A
110/200
C
5/200
G
25/200
T
60/200
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
L(C0 ) = 2
L(C2 ) = 1.6
Coming up: C2 is the output of the Huffman algorithm,
hence an optimal encoding for (A, P ).
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Prefix codes and trees
I
A binary tree T is a rooted tree such that each node that is
not a leaf has at most two children.
Binary tree for a prefix code: a branch to the left represents
a 0 in the encoding and a branch to the right a 1.
T0
Code C0
x c(x)
A
00
C
01
G 10
T
11
0
Pr[A]
Pr[C]
Pr[G]
Pr[T]
Binary tree for prex code C0
Prefix codes and trees
I
A binary tree T is a rooted tree such that each node that is
not a leaf has at most two children.
Binary tree for a prefix code: a branch to the left represents
a 0 in the encoding and a branch to the right a 1.
T2
0
Code C2
x c(x)
A
0
C 110
G 111
T
10
Pr[A]
Pr[T]
Pr[C]
Pr[G]
Binary tree for prex code C2
Properties of binary trees representing prefix codes
1. Where do alphabet symbols appear in the tree?
2. What do codewords correspond to in the tree?
3. Consider the tree corresponding to the optimal prefix code.
Can it have internal nodes with one child?
Properties of binary trees representing prefix codes
1. Symbols must appear at the leaves of the tree T (why?)
T has n leaves.
2. Codewords c(ai ) are given by root-to-leaf paths.
Recall that `i is the length of the codeword c(ai ) for input
symbol ai . Therefore, `i corresponds to the depth of ai in
T (we assume that the root is at depth 0).
Can rewrite the expected length of the prefix code as:
X
X
pi depthT (ai ) = L(T ).
L(C) =
pi `i =
ai A
1in
3. Optimal tree must be full: all internal nodes must have
exactly two children (why?).
More on optimal tree
Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.
More on optimal tree
Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.
Proof.
By an exchange argument: start with a tree for an optimal
prefix code and transform it into T .
Proof of Claim 1
I
Let T be the tree for the optimal prefix code.
Let , be the two symbols with the smallest probabilities,
that is, Pr[] Pr[] Pr[s] for all s A {, }.
Let x and y be the two siblings at maximum depth in T .
T
T'
0
Pr[ ]
Pr[x]
Pr[]
Pr[]
Pr[x]
Pr[y]
Pr[]
Pr[y]
We want L(T) L(T)
Proof of Claim 1
I
Let T be the tree for the optimal prefix code.
Let , be the two symbols with the smallest probabilities,
that is, Pr[] Pr[] Pr[s] for all s A {, }.
Let x and y be the two siblings at maximum depth in T .
T'
T''
0
Pr[x]
Pr[x]
Pr[]
Pr[y]
y
Pr[]
Pr[y]
Pr[]
Pr[]
We want L(T) L(T)
How do the expected lengths of the two trees compare?
L(T ) L(T 0 )
Pr[ai ] depthT (ai )
ai A
Pr[ai ] depthT 0 (ai )
ai A
Pr[] depthT () + Pr[x] depthT (x)
Pr[] depthT 0 () Pr[x] depthT 0 (x)
Pr[] depthT () + Pr[x] depthT (x)
Pr[] depthT (x) Pr[x] depthT ()
(Pr[] Pr[x]) (depthT () depthT (x)) 0
The third equality follows from the exchange.
Similarly, exchanging and y in T 0 yields L(T 0 ) L(T 00 ) 0.
Hence L(T ) L(T 00 ) 0.
Since T is optimal, it must be L(T ) = L(T 00 ).
So T 00 is also optimal.
The claim follows by setting T to be T 00 .
Building the optimal tree
Claim 1 tells us how to build the optimal tree greedily!
1. Find the two symbols with the lowest probabilities.
2. Remove them from the alphabet and replace them with a
new meta-character with probability equal to the sum of
their probabilities.
I
Idea: this meta-character will be the parent of the two
deleted symbols in the tree.
3. Recursively construct the optimal tree using this process.
Greedy algorithms: make a local (myopic) decision at every
step that optimizes some criterion
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Huffman algorithm
Huffman(A, P )
if |A| = 2 then
Encode one symbol using 0 and the other using 1
end if
Let and be the two symbols with the lowest probabilities
Let be a new meta-character with probability Pr[] + Pr[]
Let A1 = A {, } + {}
Let P1 be the new set of probabilities over A1
T1 = Huffman(A1 , P1 )
return T as follows: replace leaf node in T1 by an internal
node, and add two children labelled and below .
Remark 3.
Output of Huffman procedure is a binary tree T ; the code for
(A, P ) is the prefix code corresponding to T .
Example: recursive Huffman for chromosome map
5
25
60
Recursive call 1: Huffman({A, C, G, T }, { 110
200 , 200 , 200 , 200 })
30
60
Recursive call 2: Huffman({A, CG , T }, { 110
200 , 200 , 200 })
90
Recursive call 3: Huffman({A, CGT }, { 110
200 , 200 })
0
110/200
1
90/200
CGT
110/200
90/200
60/200
End of rec. call 3
CGT
110/200
90/200
30/200
CG
CGT
60/200
30/200
CG
End of rec. call 2
5/200
25/200
End of rec. call 1
Correctness
Proof: by induction on the size of the alphabet n 2.
I
Base case. For n = 2, Huffman is optimal.
Hypothesis. Assume that Huffman returns the optimal
prefix code for alphabets of size n.
Induction Step. Let A be an alphabet of size n + 1, P
the corresponding set of probabilities.
Let T1 be the tree returned by our algorithm for (A1 , P1 ),
where A1 , P1 , T1 as in the pseudocode. By the hypothesis,
T1 is optimal. Let T be the final tree returned for (A, P )
by our algorithm. We claim that T is optimal.
We will prove the claim by contradiction. Suppose T is
the optimal tree for (A, P ) with L(T ) < L(T ).
Replacing , by some 0 in the optimal tree T
W.l.o.g., suppose T is an optimal tree where , appear as siblings
at maximum depth (Claim 1 guarantees there is such an optimal tree.)
Replace , in T by a meta-character 0 with Pr[ 0 ] = Pr[] + Pr[]
to obtain the tree T1 on n nodes with expected length L(T1 ).
T1
T1*
Pr[]+
Pr[]
Pr[]+
Pr[]
Pr[]
Pr[]
Dashed silver edges and nodes belong to T* but not T*
Dashed silver edges and
nodes belong to T but not T
Pr[]
Pr[G]
Pr[]
Expressing L(T ) in terms of L(T1 )
Shorthand notation: dT (ai ) = depthT (ai ).
L(T )
Pr[ai ]dT (ai )
ai A
Pr[ai ]dT (ai ) + Pr[] + Pr[] dT ()
ai A{,}
Pr[ai ]dT (ai ) + Pr[] + Pr[]
ai A{,}
+ Pr[] + Pr[] (dT () 1)
X
=
Pr[ai ]dT1 (ai ) + Pr[] + Pr[]
=
ai A{,}+{ 0 }
L(T1 ) + Pr[] + Pr[]
(1)
Correctness (contd)
I
Similarly, we obtain that
L(T )
L(T1 ) + Pr[] + Pr[]
(2)
By the induction hypothesis, we have L(T1 ) L(T1 ) since T1 is
optimal for alphabets with n characters and probabilities given
by P1 . Hence
L(T ) = L(T1 ) + Pr[] + Pr[] L(T1 ) + Pr[] + Pr[] = L(T ), (3)
where the first equality follows from equation (1), the inequality
from the induction hypothesis and the last equality from
equation (2).
But equation (3) contradicts optimality of T .
Hence T must be optimal.
Implementation and running time
1. Straightforward implementation: O(n2 ) time
2. Store the alphabet symbols in a priority queue implemented
as a binary min-heap with keys their probabilities
I
Operations: Initialize (O(n)), Extract-min (O(log n)),
Insert (O(log n))
Total time: O(n log n) time
For an iterative implementation of Huffman, see your textbook.
Example: iterative Huffman for chromosome map
Input (A, P )
symbol x
Pr(x)
A
110/200
C
5/200
G
25/200
T
60/200
Step 1
0
5/200
Step 2
30/200
Step 3
90/200
1
25/200
Output code
symbol x c(x)
A
0
C
111
G
110
T
10
60/200
30/200
110/200
90/200
5/200
25/200
60/200
1
30/200
5/200
25/200
Beyond Huffman coding
Huffman algorithm provides an optimal symbol code
Codes that encode larger blocks of input symbols might do
better
Storage on noisy media: what if a bit of the output of the
compressor is flipped?
I
I
decompression cannot carry through
need error correction on top of compression