Lecture 5 - Big Data
Lecture 5 - Big Data
5
THE BIG PICTURE
Candidate
Document pairs:
Locality-
those pairs
Sensitive
of signatures
Hashing
that we need
to test for
The set of Signatures: similarity
strings of length short integer
k that appear in vectors that
the document represent the
sets, and
reflect their
similarity
6
Document
SHINGLING
The Jaccard similarity of two sets is the size of their intersection divided
by the size of their union
𝑠𝑖𝑚 𝐶1, 𝐶2 = 𝐶1 ∩ 𝐶2 Τ 𝐶1 ∪ 𝐶2
12
SIMILARITY METRIC FOR SHINGLES
Caveat: You must pick k large enough, or most documents will have
most shingles
k = 5 is OK for short documents
k = 10 is better for long documents
14
MOTIVATION FOR MINHASH/LSH
Shingles
Typical matrix is sparse!
So far:
Documents → Sets of shingles
Represent sets as boolean vectors in a matrix
19
OUTLINE: FINDING SIMILAR COLUMNS
3) Optional: Check that columns with similar signatures are really similar
Warnings:
Comparing all pairs may take too much time: Job for LSH
These methods can produce false negatives, and even false positives (if the optional check is not made) 20
HASHING COLUMNS (SIGNATURES)
Key idea: “hash” each column C to a small signature h(C), such that:
(1) h(C) is small enough that the signature fits in RAM
(2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket! 21
MIN-HASHING
Define a “hash” function h(C) = the index of the first (in the
permuted order ) row in which column C has value 1:
h (C) = min (C)
Use several (e.g., 100) independent hash functions (that is, permutations)
to create a signature of a column 23
MIN-HASHING Note: Another (equivalent) way is to
store row indexes:
EXAMPLE 1 5 1 5
2nd element of the permutation
2 3 1 3
is the first to map to a 1
6 4 6 4
Permutation Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
24
4 5 5 1 0 1 0
THE MIN-HASH PROPERTY
Choose a random permutation 0 0
Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
Why? 1 1
Let X be a doc (set of shingles), y X is a shingle
Then: Pr[(y) = min((X))] = 1/|X|
0 0
It is equally likely that any y X is mapped to the min element 0 1
Let y be s. t. (y) = min((C1C2))
1 0
Then either: (y) = min((C1)) if y C1 , or (y) = min((C2)) if y C2
So, the prob. that both are true is the prob. y C1 C2 One of the two
cols had to have 25
Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2) 1 at position y
FOUR TYPES OF ROWS
Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
a = # rows of type A, etc.
Note: sim(C1, C2) = a/(a +b +c)
Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
Look down the cols C1 and C2 until we see a 1 26
If it’s a type-A row, then h(C1) = h(C2) If a type-B or type-C row, then not
SIMILARITY FOR SIGNATURES
Note: Because of the Min-Hash property, the similarity of columns is the same as
the expected similarity of their signatures
27
MIN-HASHING EXAMPLE
Permutation Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
28
MIN-HASH SIGNATURES
29
IMPLEMENTATION TRICK
Permuting rows even once is prohibitive
Row hashing!
Pick K = 100 hash functions ki
Ordering under ki gives a random row permutation!
One-pass implementation
For each column C and hash-func. ki keep a “slot” for the min-hash value
Initialize all sig(C)[i] =
Scan rows looking for 1s How to pick a random hash function h(x)?
Universal hashing:
Suppose row j has 1 in column C ha,b(x)=((a·x+b) mod p) mod N
Then for each ki : where:
a,b … random integers 30
If ki(j) < sig(C)[i], then sig(C)[i] ki(j)
p … prime number (p > N)
EXAMPLE
Sig1 Sig 2
Raw C1 C2 h(1)=1 1 ∞
1 1 0 g(1)=3 3 ∞
2 0 1 h(2)=2 1 2
3 1 1 g(2)=0 3 0
h(3)=3 1 2
4 1 0
g(3)=2 2 0
5 0 1
h(4)=4 1 2
g(4)=4 2 0
h(x) = x mod 5 h(5)=0 1 0
g(x) = (2x+1) mod 5 g(5)=1 2 0 31
Goal: Find documents with Jaccard similarity at least s (for some similarity threshold,
e.g., s=0.8)
LSH – General idea: Use a function f(x,y) that tells whether x and y is a
candidate pair: a pair of elements whose similarity must be evaluated
For Min-Hash matrices:
2 1 4 1
Hash columns of signature matrix M to many buckets
1 2 1 2
Each pair of documents that hashes into the
same bucket is a candidate pair 2 1 2 1
33
CANDIDATES FROM MIN-HASH
r rows
per band
b bands
2 1 4 1
1 2 1 2
2 1 2 1
One
signature
36
Signature matrix M
PARTITION M INTO BANDS
For each band, hash its portion of each column to a hash table with k
buckets
Make k as large as possible
Candidate column pairs are those that hash to the same bucket for ≥ 1
band
Tune b and r to catch most similar pairs, but few non-similar pairs
37
Columns 2 and 6
HASHING BANDS Buckets are probably identical
(candidate pair)
Matrix M
r rows b bands
38
SIMPLIFYING ASSUMPTION
There are enough buckets that columns are unlikely to hash to the
same bucket unless they are identical in a particular band
EXAMPLE OF BANDS 1 2 1 2
2 1 2 1
Assume the following case:
Suppose 100,000 columns of M (100k docs)
Signatures of 100 integers (rows)
Therefore, signatures take 40Mb
Want all 80% similar pair of document
5,000,000,000 pairs of signatures can take a while to compare
Choose b = 20 bands of r = 5 integers/band
Pick:
The number of Min-Hashes (rows of M)
The number of bands b, and
The number of rows r per band
Probability = 1
Similarity threshold s
if t > s
Probability
No chance
of sharing
if t < s
a bucket
44
Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity
45
46
WHAT B BANDS OF R ROWS GIVES YOU
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
47
EXAMPLE: B = 20; R = 5
s 1-(1-sr)b
.2 .006
.3 .047
Similarity threshold s
.4 .186
Prob. that at least 1 band is identical:
.5 .470
.6 .802
.7 .975
.8 .9996 48
PICKING R AND B: THE S-CURVE
0.9
0.8
0.7
50 hash-functions (r=5, b=10)
0.6
0.5
0.4
0.3
0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
49
LSH SUMMARY
Tune M, b, r to get almost all pairs with similar signatures, but eliminate
most pairs that do not have similar signatures