Mansoura University
Faculty of Computers and Information
Department of Information System
Second Semester
[AI3502] BIG DATA ANALYTICS
Grade: 3rd AI
Dr. Amira Rezk
FINDING SIMILAR ITEMS
A COMMON METAPHOR
Many problems can be expressed as finding “similar” sets:
Pages with similar words For duplicate detection, classification by topic
Customers who purchased similar products
Products with similar customer sets
Images with similar features
Users who visited similar websites
3
TASK: FINDING SIMILAR DOCUMENTS
Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs
Applications:
Mirror websites, or approximate mirrors
Application: Don’t want to show both in search results
Similar news articles at many news sites
Application: Cluster articles by “same story”
Problems:
Many small pieces of one document can appear out of order in another
Too many documents to compare all pairs
Documents are so large or so many that they cannot fit in main memory 4
THREE ESSENTIAL TECHNIQUES FOR SIMILAR DOCUMENTS
Shingling: Convert documents to sets
Min-Hashing: Convert large sets to short signatures, while preserving similarity
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar
documents
5
THE BIG PICTURE
Candidate
Document pairs:
Locality-
those pairs
Sensitive
of signatures
Hashing
that we need
to test for
The set of Signatures: similarity
strings of length short integer
k that appear in vectors that
the document represent the
sets, and
reflect their
similarity
6
Document
The set of strings
of length k that
appear in the
document
SHINGLING
STEP 1: SHINGLING: CONVERT DOCUMENTS TO SETS
8
9
10
11
12
13
DOCUMENTS AS HIGH-DIM DATA
Step 1: Shingling: Convert documents to sets
Simple approaches:
Document = set of words appearing in document
Document = set of “important” words
Don’t work well for this application. Why?
Need to account for ordering of words!
A different way: Shingles!
14
DEFINE: SHINGLES
A k-shingle (or k-gram) for a document is a sequence of k tokens that
appears in the document
Tokens can be characters, words or something else, depending on the application
Assume tokens = characters for examples
Example: k=2; document D1 = abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Option: Shingles as a bag (multiset), count ab twice: S’(D1) = {ab, bc, ca, ab}
15
SHINGLES AND SIMILARITY
Document that are intuitively similar will may shingles in common.
Changing a words only affects k-shingles within distance k from the
word.
Reordering paragraphs only affects the 2k shingles that cross paragraph
boundaries.
Example:
k=3, “The dog which chased the cat” versus “ The dog that chased the cat”
Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c. 16
COMPRESSING SHINGLES
To compress long shingles, we can hash them to (say) 4 bytes (token)
Represent a document by the set of hash values of its k-shingles
Idea: Two documents could (rarely) appear to have shingles in common, when in
fact only the hash-values were shared
Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
17
SIMILARITY METRIC FOR SHINGLES
Document D1 is a set of its k-shingles C1=S(D1)
Equivalently, each document is a 0/1 vector in the space of k-shingles
Each unique shingle is a dimension
Vectors are very sparse
A natural similarity measure is the
Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|
18
WORKING ASSUMPTION
Documents that have lots of shingles in common have similar
text, even if the text appears in different order
Caveat: You must pick k large enough, or most documents will have
most shingles
k = 5 is OK for short documents
k = 10 is better for long documents
19
MOTIVATION FOR MINHASH/LSH
Suppose we need to find near-duplicate documents among 𝑵
= 𝟏 million documents
Naïvely, we would have to compute pairwise Jaccard similarities for
every pair of docs
𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
At 105 secs/day and 106 comparisons/sec, it would take 5 days
For 𝑵 = 𝟏𝟎 million, it takes more than a year…
20
Document
The set of strings Signatures:
of length k that short integer
appear in the vectors that
document represent the
sets, and reflect
their similarity
Step 2: Minhashing: Convert large sets to short signatures,
MINHASHING
while preserving similarity
ENCODING SETS AS BIT VECTORS
Many similarity problems can be formalized as finding subsets that have
significant intersection
Encode sets using 0/1 (bit, Boolean) vectors
One dimension per element in the universal set
Interpret set intersection as bitwise AND, and
set union as bitwise OR
Example: C1 = 10111; C2 = 10011
Size of intersection = 3; size of union = 4,
Jaccard similarity (not distance) = 3/4
Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4 22
FROM SETS TO BOOLEAN MATRICES
Rows = elements (shingles) of the universal set
Columns = sets (documents)
1 in row e and column s if and only if e is a member of s
Column similarity is the Jaccard similarity of the corresponding sets
(rows with value 1)
Typical matrix is sparse!
Each document is a column:
23
EXAMPLE : COLUMN SIMILARITY
C1 C2
0 1 *
1 0 *
1 1 * *
0 0 Sim(C1, C2) = 2/5 =0.4
1 1 * *
0 1 * 24
OUTLINE: FINDING SIMILAR COLUMNS
So far:
Documents → Sets of shingles
Represent sets as Boolean vectors in a matrix
Next goal: Find similar columns while computing small
signatures
Similarity of columns == similarity of signatures
26
OUTLINE: FINDING SIMILAR COLUMNS
Next Goal: Find similar columns, Small signatures
Naïve approach:
1) Signatures of columns: small summaries of columns
2) Examine pairs of signatures to find similar columns
Essential: Similarities of signatures and columns are related
3) Optional: Check that columns with similar signatures are really similar
Warnings:
Comparing all pairs may take too much time
27
These methods can produce false negatives, and even false positives (if the optional check is not
made)
HASHING COLUMNS (SIGNATURES)
Key idea: “hash” each column C to a small signature h(C), such that:
(1) h(C) is small enough that the signature fits in RAM
(2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
Goal: Find a hash function h(·) such that:
If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket! 28
MIN-HASHING
Goal: Find a hash function h(·) such that:
if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
Clearly, the hash function depends on the similarity metric:
Not all similarity metrics have a suitable hash function
There is a suitable hash function for the Jaccard similarity: It is
called Min-Hashing 29
MIN-HASHING
Imagine the rows of the boolean matrix permuted under random
permutation
Define a “hash” function h(C) = the index of the first (in the
permuted order ) row in which column C has value 1:
h (C) = min (C)
Use several (e.g., 100) independent hash functions (that is, permutations)
to create a signature of a column 30
MIN-HASHING 2nd element of the permutation
EXAMPLE is the first to map to a 1
Permutation Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
31
THE MIN-HASH PROPERTY
Choose a random permutation 0 0
Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
Why?
1 1
Let X be a doc (set of shingles), y X is a shingle
Then: Pr[(y) = min((X))] = 1/|X|
0 0
It is equally likely that any y X is mapped to the min element 0 1
Let y be s.t. (y) = min((C1C2))
1 0
Then either: (y) = min((C1)) if y C1 , or (y) = min((C2)) if y C2
So, the prob. that both are true is the prob. y C1 C2 One of the two
cols had to have 32
Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2) 1 at position y
SIMILARITY FOR SIGNATURES
We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
Now generalize to multiple hash functions
The similarity of two signatures is the fraction of the hash functions in
which they agree
Note: Because of the Min-Hash property, the similarity of columns is the same as
the expected similarity of their signatures
33
MIN-HASHING EXAMPLE
Permutation Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0
Col/Col 0.75 0.75 0 0 34
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
MIN-HASH SIGNATURES
Pick K=100 random permutations of the rows
Think of sig(C) as a column vector
sig(C)[i] = according to the i-th permutation, the index of the first row that has a 1 in
column C
sig(C)[i] = min (i(C))
Note: The sketch (signature) of document C is small ~𝟏𝟎𝟎 bytes!
We achieved our goal! We “compressed” long bit vectors into short
signatures
35