0% found this document useful (0 votes)

5 views51 pages

Lecture 5 - Big Data

Uploaded by

amgd7683

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views51 pages

Lecture 5 - Big Data

Uploaded by

amgd7683

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Mansoura University

Faculty of Computers and Information

Department of Information System
Second Semester

[AI3502] BIG DATA ANALYTICS

Grade: 3rd AI
Dr. Amira Rezk
FINDING SIMILAR ITEMS
A COMMON METAPHOR

 Many problems can be expressed as finding “similar” sets:

 Pages with similar words For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features
 Users who visited similar websites
3
TASK: FINDING SIMILAR DOCUMENTS
 Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs
 Applications:
 Mirror websites, or approximate mirrors
 Application: Don’t want to show both in search results
 Similar news articles at many news sites
 Application: Cluster articles by “same story”
 Problems:
 Many small pieces of one document can appear out of order in another
 Too many documents to compare all pairs
 Documents are so large or so many that they cannot fit in main memory 4
THREE ESSENTIAL TECHNIQUES FOR SIMILAR DOCUMENTS

 Shingling: Convert documents to sets

 Min-Hashing: Convert large sets to short signatures, while preserving similarity

 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

documents

5
THE BIG PICTURE

Candidate
Document pairs:
Locality-
those pairs
Sensitive
of signatures
Hashing
that we need
to test for
The set of Signatures: similarity
strings of length short integer
k that appear in vectors that
the document represent the
sets, and
reflect their
similarity
6
Document

The set of strings

of length k that
appear in the
document

SHINGLING

STEP 1: SHINGLING: CONVERT DOCUMENTS TO SETS

DOCUMENTS AS HIGH-DIM DATA

 Step 1: Shingling: Convert documents to sets

 Simple approaches:
 Document = set of words appearing in document
 Document = set of “important” words
 Don’t work well for this application. Why?

 Need to account for ordering of words!

 A different way: Shingles!
8
DEFINE: SHINGLES

 A k-shingle (or k-gram) for a document is a sequence of k tokens that

appears in the document
 Tokens can be characters, words or something else, depending on the application
 Assume tokens = characters for examples

 Example: k=2; document D1 = abcab

Set of 2-shingles: S(D1) = {ab, bc, ca}
 Option: Shingles as a bag (multiset), count ab twice: S’(D1) = {ab, bc, ca, ab}
9
SHINGLES AND SIMILARITY

 Document that are intuitively similar will may shingles in common.

 Changing a words only affects k-shingles within distance k from the
word.
 Reordering paragraphs only affects the 2k shingles that cross paragraph
boundaries.
 Example:
k=3, “The dog which chased the cat” versus “ The dog that chased the cat”
 Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c. 10
COMPRESSING SHINGLES

 To compress long shingles, we can hash them to (say) 4 bytes (token)

 Represent a document by the set of hash values of its k-shingles
 Idea: Two documents could (rarely) appear to have shingles in common, when in
fact only the hash-values were shared
 Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
11
JACCARD SIMILARITY

 The Jaccard similarity of two sets is the size of their intersection divided
by the size of their union
 𝑠𝑖𝑚 𝐶1, 𝐶2 = 𝐶1 ∩ 𝐶2 Τ 𝐶1 ∪ 𝐶2

12
SIMILARITY METRIC FOR SHINGLES

 Document D1 is a set of its k-shingles C1=S(D1)

 Equivalently, each document is a 0/1 vector in the space of k-shingles
 Each unique shingle is a dimension
 Vectors are very sparse

 A natural similarity measure is the

Jaccard similarity: sim(D1, D2) = |C1C2|/|C1C2|
13
WORKING ASSUMPTION

 Documents that have lots of shingles in common have similar

text, even if the text appears in different order

 Caveat: You must pick k large enough, or most documents will have
most shingles
 k = 5 is OK for short documents
 k = 10 is better for long documents

14
MOTIVATION FOR MINHASH/LSH

 Suppose we need to find near-duplicate documents among 𝑵

= 𝟏 million documents
 Naïvely, we would have to compute pairwise Jaccard similarities for
every pair of docs
 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec, it would take 5 days

 For 𝑵 = 𝟏𝟎 million, it takes more than a year…

15
Document

The set of strings Signatures:

of length k that short integer
appear in the vectors that
document represent the
sets, and reflect
their similarity

 Step 2: Min-hashing: Convert large sets to short

signatures, while preserving similarity
ENCODING SETS AS BIT VECTORS
 Many similarity problems can be formalized as finding subsets that have
significant intersection
 Encode sets using 0/1 (bit, boolean) vectors
 One dimension per element in the universal set
 Interpret set intersection as bitwise AND, and
set union as bitwise OR
 Example: C1 = 10111; C2 = 10011
 Size of intersection = 3; size of union = 4,
 Jaccard similarity (not distance) = 3/4
 Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4 17
FROM SETS TO BOOLEAN MATRICES
 Rows = elements (shingles) Documents

 Columns = sets (documents)

1 1 1 0
 1 in row e and column s if and only if e is a member of s
 Column similarity is the Jaccard similarity of the corresponding sets (rows
1 1 0 1
with value 1) 0 1 0 1

Shingles
 Typical matrix is sparse!

 Each document is a column:

0 0 0 1

 Example: sim(C1 ,C2) = ?

1 0 0 1
 Size of intersection = 3; size of union = 6, 1 1 1 0
Jaccard similarity (not distance) = 3/6 1 0 1 0
 d(C1,C2) = 1 – (Jaccard similarity) = 3/6 18
OUTLINE: FINDING SIMILAR COLUMNS

 So far:
 Documents → Sets of shingles
 Represent sets as boolean vectors in a matrix

 Next goal: Find similar columns while computing small

signatures
 Similarity of columns == similarity of signatures

19
OUTLINE: FINDING SIMILAR COLUMNS

 Next Goal: Find similar columns, Small signatures

 Naïve approach:
 1) Signatures of columns: small summaries of columns
 2) Examine pairs of signatures to find similar columns
 Essential: Similarities of signatures and columns are related

 3) Optional: Check that columns with similar signatures are really similar

 Warnings:
 Comparing all pairs may take too much time: Job for LSH
 These methods can produce false negatives, and even false positives (if the optional check is not made) 20
HASHING COLUMNS (SIGNATURES)

 Key idea: “hash” each column C to a small signature h(C), such that:
 (1) h(C) is small enough that the signature fits in RAM
 (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)

 Goal: Find a hash function h(·) such that:

 If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket! 21
MIN-HASHING

 Goal: Find a hash function h(·) such that:

 if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Clearly, the hash function depends on the similarity metric:

 Not all similarity metrics have a suitable hash function

 There is a suitable hash function for the Jaccard similarity: It is

called Min-Hashing 22
MIN-HASHING

 Imagine the rows of the boolean matrix permuted under random

permutation 

 Define a “hash” function h(C) = the index of the first (in the
permuted order ) row in which column C has value 1:
h (C) = min (C)

 Use several (e.g., 100) independent hash functions (that is, permutations)
to create a signature of a column 23
MIN-HASHING Note: Another (equivalent) way is to
store row indexes:
EXAMPLE 1 5 1 5
2nd element of the permutation
2 3 1 3
is the first to map to a 1
6 4 6 4
Permutation  Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
24

4 5 5 1 0 1 0
THE MIN-HASH PROPERTY
 Choose a random permutation  0 0
 Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
 Why? 1 1
 Let X be a doc (set of shingles), y X is a shingle
 Then: Pr[(y) = min((X))] = 1/|X|
0 0
 It is equally likely that any y X is mapped to the min element 0 1
 Let y be s. t. (y) = min((C1C2))
1 0
 Then either: (y) = min((C1)) if y  C1 , or (y) = min((C2)) if y  C2
 So, the prob. that both are true is the prob. y  C1  C2 One of the two
cols had to have 25
 Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2) 1 at position y
FOUR TYPES OF ROWS
 Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
 a = # rows of type A, etc.
 Note: sim(C1, C2) = a/(a +b +c)
 Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
 Look down the cols C1 and C2 until we see a 1 26

 If it’s a type-A row, then h(C1) = h(C2) If a type-B or type-C row, then not
SIMILARITY FOR SIGNATURES

 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)

 Now generalize to multiple hash functions

 The similarity of two signatures is the fraction of the hash functions in

which they agree

 Note: Because of the Min-Hash property, the similarity of columns is the same as
the expected similarity of their signatures
27
MIN-HASHING EXAMPLE
Permutation  Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
28
MIN-HASH SIGNATURES

 Pick K=100 random permutations of the rows

 Think of sig(C) as a column vector
 sig(C)[i] = according to the i-th permutation, the index of the first row that has a 1 in column C

sig(C)[i] = min (i(C))

 Note: The sketch (signature) of document C is small ~𝟏𝟎𝟎 bytes!

 We achieved our goal! We “compressed”

long bit vectors into short signatures

29
IMPLEMENTATION TRICK
 Permuting rows even once is prohibitive
 Row hashing!
 Pick K = 100 hash functions ki
 Ordering under ki gives a random row permutation!
 One-pass implementation
 For each column C and hash-func. ki keep a “slot” for the min-hash value
 Initialize all sig(C)[i] = 
 Scan rows looking for 1s How to pick a random hash function h(x)?
Universal hashing:
 Suppose row j has 1 in column C ha,b(x)=((a·x+b) mod p) mod N
 Then for each ki : where:
a,b … random integers 30
 If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)
p … prime number (p > N)
EXAMPLE

Sig1 Sig 2
Raw C1 C2 h(1)=1 1 ∞
1 1 0 g(1)=3 3 ∞
2 0 1 h(2)=2 1 2

3 1 1 g(2)=0 3 0
h(3)=3 1 2
4 1 0
g(3)=2 2 0
5 0 1
h(4)=4 1 2
g(4)=4 2 0
h(x) = x mod 5 h(5)=0 1 0
g(x) = (2x+1) mod 5 g(5)=1 2 0 31

If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)

Step 3: Locality-Sensitive Hashing:

Focus on pairs of signatures likely to be from similar documents
LSH: FIRST CUT

 Goal: Find documents with Jaccard similarity at least s (for some similarity threshold,
e.g., s=0.8)

 LSH – General idea: Use a function f(x,y) that tells whether x and y is a
candidate pair: a pair of elements whose similarity must be evaluated
 For Min-Hash matrices:
2 1 4 1
 Hash columns of signature matrix M to many buckets
1 2 1 2
 Each pair of documents that hashes into the
same bucket is a candidate pair 2 1 2 1
33
CANDIDATES FROM MIN-HASH

 Pick a similarity threshold s (0 < s < 1)

 Columns x and y of M are a candidate pair if their signatures agree on
at least fraction s of their rows:
M (i, x) = M (i, y) for at least fraction s values of i
 We expect documents x and y to have the same (Jaccard) similarity as their
signatures
2 1 4 1
1 2 1 2
2 1 2 1
34
LSH FOR MIN-HASH

 Big idea: Hash columns of signature matrix M several

times
 Arrange that (only) similar columns are likely to hash to the
same bucket, with high probability
 Candidate pairs are those that hash to the same bucket
2 1 4 1
1 2 1 2
2 1 2 1 35
PARTITION M INTO B BANDS

r rows
per band

b bands
2 1 4 1
1 2 1 2
2 1 2 1

One
signature
36

Signature matrix M
PARTITION M INTO BANDS

 Divide matrix M into b bands of r rows

 For each band, hash its portion of each column to a hash table with k
buckets
 Make k as large as possible

 Candidate column pairs are those that hash to the same bucket for ≥ 1
band

 Tune b and r to catch most similar pairs, but few non-similar pairs
37
Columns 2 and 6
HASHING BANDS Buckets are probably identical
(candidate pair)

Columns 6 and 7 are

surely different.

Matrix M

r rows b bands

38
SIMPLIFYING ASSUMPTION

 There are enough buckets that columns are unlikely to hash to the
same bucket unless they are identical in a particular band

 Hereafter, we assume that “same bucket” means “identical in that

band”

 Assumption needed only to simplify analysis, not for correctness of

algorithm
39
2 1 4 1

EXAMPLE OF BANDS 1 2 1 2
2 1 2 1
Assume the following case:
 Suppose 100,000 columns of M (100k docs)
 Signatures of 100 integers (rows)
 Therefore, signatures take 40Mb
 Want all 80% similar pair of document
 5,000,000,000 pairs of signatures can take a while to compare
 Choose b = 20 bands of r = 5 integers/band

 Goal: Find pairs of documents that are at least s = 0.8 similar 40

2 1 4 1

C1, C2 ARE 80% SIMILAR 1 2 1 2

2 1 2 1

 Find pairs of  s=0.8 similarity, set b=20, r=5

 Assume: sim(C1, C2) = 0.8
 Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair: We want them to hash to at least 1
common bucket (at least one band is identical)
 Probability C1, C2 identical in one particular band: (0.8)5 = 0.328
 Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 = 0.00035
 i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them)
 We would find 99.965% pairs of truly similar documents 41
2 1 4 1
1 2 1 2
C1, C2 ARE 30% SIMILAR
2 1 2 1
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.3
 Since sim(C1, C2) < s we want C1, C2 to hash to NO common buckets (all bands
should be different)
 Probability C1, C2 identical in one particular band: (0.3)5 = 0.00243
 Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20 = 0.0474
 In other words, approximately 4.74% pairs of docs with similarity 0.3% end up
becoming candidate pairs
 They are false positives since we will have to examine them (they are candidate pairs) but then 42it
will turn out their similarity is below threshold s
2 1 4 1
1 2 1 2
LSH INVOLVES A TRADEOFF
2 1 2 1

 Pick:
 The number of Min-Hashes (rows of M)
 The number of bands b, and
 The number of rows r per band

to balance false positives/negatives

 Example: If we had only 15 bands of 5 rows, the number of false positives

would go down, but the number of false negatives would go up 43
ANALYSIS OF LSH – WHAT WE WANT

Probability = 1

Similarity threshold s
if t > s

Probability
No chance
of sharing
if t < s
a bucket

Similarity t =sim(C1, C2) of two sets

WHAT 1 BAND OF 1 ROW GIVES YOU

Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity

Similarity t =sim(C1, C2) of two sets

B BANDS, R ROWS/BAND

 Columns C1 and C2 have similarity t

 Pick any band (r rows)
 Prob. that all rows in band equal = tr
 Prob. that some row in band unequal = 1 - tr

 Prob. that no band identical = (1 - tr)b

 Prob. that at least 1 band identical = 1 - (1 - tr)b

46
WHAT B BANDS OF R ROWS GIVES YOU

At least
No bands
one band
identical
identical

Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket

All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
47
EXAMPLE: B = 20; R = 5

s 1-(1-sr)b
.2 .006
.3 .047
 Similarity threshold s
.4 .186
 Prob. that at least 1 band is identical:
.5 .470
.6 .802
.7 .975
.8 .9996 48
PICKING R AND B: THE S-CURVE

0.9

 Picking r and b to get the best S-curve

Prob. sharing a bucket

0.8

0.7
 50 hash-functions (r=5, b=10)
0.6

0.5

0.4

0.3

0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity

49
LSH SUMMARY

 Tune M, b, r to get almost all pairs with similar signatures, but eliminate
most pairs that do not have similar signatures

 Check in main memory that candidate pairs really do have similar

signatures

 Optional: In another pass-through data, check that the remaining

candidate pairs really represent similar documents
50
SUMMARY: 3 STEPS

 Shingling: Convert documents to sets

 We used hashing to assign each shingle an ID

 Min-Hashing: Convert large sets to short signatures, while preserving similarity

 We used similarity preserving hashing to generate signatures with property Pr[h(C1) = h(C2)]
= sim(C1, C2)
 We used hashing to get around generating random permutations

 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

documents
 We used hashing to find candidate pairs of similarity  s
51

Lecture 4 - Big Data
No ratings yet
Lecture 4 - Big Data
34 pages
Similarity Search For Big Data Locality Sensitive Hashing
No ratings yet
Similarity Search For Big Data Locality Sensitive Hashing
41 pages
Chapter 5
No ratings yet
Chapter 5
53 pages
03.2 03.3 Shingling MinHash
No ratings yet
03.2 03.3 Shingling MinHash
32 pages
L3: Finding Similar Items: Locality Sensitive Hashing
No ratings yet
L3: Finding Similar Items: Locality Sensitive Hashing
54 pages
Similarity Detection Techniques
No ratings yet
Similarity Detection Techniques
53 pages
MMD 02
No ratings yet
MMD 02
97 pages
ch04 LSH
No ratings yet
ch04 LSH
54 pages
Computational Tools DTU Presentation Week4
No ratings yet
Computational Tools DTU Presentation Week4
40 pages
Data Mining: Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Sketching, Locality Sensitive Hashing
61 pages
Lect 26 and 27 - Locality Sensitive Hashing
No ratings yet
Lect 26 and 27 - Locality Sensitive Hashing
43 pages
LSH Lecture
No ratings yet
LSH Lecture
101 pages
Mining Massive DataSets
No ratings yet
Mining Massive DataSets
54 pages
Locality Sensitive Hashing Towards Data Science
No ratings yet
Locality Sensitive Hashing Towards Data Science
16 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
Similarity Algo For Data Science (Section - 1)
No ratings yet
Similarity Algo For Data Science (Section - 1)
6 pages
CSE545 Sp23 (6) Similarity Search 3-8
No ratings yet
CSE545 Sp23 (6) Similarity Search 3-8
84 pages
Min-Hashing and Set Similarity Techniques
100% (1)
Min-Hashing and Set Similarity Techniques
2 pages
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
No ratings yet
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
52 pages
Data Mining for Advanced Learners
No ratings yet
Data Mining for Advanced Learners
58 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
Tut03 Soln
No ratings yet
Tut03 Soln
4 pages
Big Data Unit II
No ratings yet
Big Data Unit II
23 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
1 MinHash-1
No ratings yet
1 MinHash-1
4 pages
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
No ratings yet
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
3 pages
MinHash LSH for Plagiarism Detection
No ratings yet
MinHash LSH for Plagiarism Detection
16 pages
MMD2
No ratings yet
MMD2
13 pages
Unit II - MMD - Lecture NotesStu
No ratings yet
Unit II - MMD - Lecture NotesStu
8 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
Finding Similar Items
No ratings yet
Finding Similar Items
85 pages
DM LSH en PF
No ratings yet
DM LSH en PF
31 pages
Dealing With Massive Data - Duplicate Detection
No ratings yet
Dealing With Massive Data - Duplicate Detection
3 pages
Document Similarity Techniques
No ratings yet
Document Similarity Techniques
79 pages
Min Hash
No ratings yet
Min Hash
7 pages
Values, Hash Codes, Hash Sums, Checksums or Simply Hashes.: From Wikipedia, The Free Encyclopedia
100% (1)
Values, Hash Codes, Hash Sums, Checksums or Simply Hashes.: From Wikipedia, The Free Encyclopedia
11 pages
03 Hash
No ratings yet
03 Hash
84 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
Hash
No ratings yet
Hash
10 pages
Hash Function Instruction Count
No ratings yet
Hash Function Instruction Count
6 pages
Unit III
No ratings yet
Unit III
85 pages
12-Finding Similar Sets
No ratings yet
12-Finding Similar Sets
8 pages
1994 - Graphs, Hypergraphs and Hashing
No ratings yet
1994 - Graphs, Hypergraphs and Hashing
13 pages
On The Resemblance and Containment of Documents
No ratings yet
On The Resemblance and Containment of Documents
9 pages
Locality Sensitive Hashing
No ratings yet
Locality Sensitive Hashing
13 pages
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
No ratings yet
Locality-Sensitive Binary Codes From Shift-Invariant Kernels
9 pages
Maximally Consistent Sampling and The Jaccard Index of Probability Distributions - 2018 (1809.04052)
No ratings yet
Maximally Consistent Sampling and The Jaccard Index of Probability Distributions - 2018 (1809.04052)
11 pages
Universal Hashing Explained
No ratings yet
Universal Hashing Explained
4 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Nips 11
No ratings yet
Nips 11
9 pages
Hashing: Hash Functions Collision Resolution Applications
No ratings yet
Hashing: Hash Functions Collision Resolution Applications
50 pages
14 Hashing
No ratings yet
14 Hashing
23 pages
Mining Massive Data University of Primorska Fall 2020
No ratings yet
Mining Massive Data University of Primorska Fall 2020
2 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
19 pages
Quarter 4 - Module 5: Dimensions and Resources of Media and Information
No ratings yet
Quarter 4 - Module 5: Dimensions and Resources of Media and Information
16 pages
Missing Values Analysis & Data Imputation: Single User License. Do Not Copy or Post
No ratings yet
Missing Values Analysis & Data Imputation: Single User License. Do Not Copy or Post
26 pages
Remove IP Blacklist on Spamhaus
No ratings yet
Remove IP Blacklist on Spamhaus
8 pages
Bolt Mike III
No ratings yet
Bolt Mike III
1 page
Modéle CV Français
No ratings yet
Modéle CV Français
3 pages
Topic:: Building A Network, Requirements
No ratings yet
Topic:: Building A Network, Requirements
15 pages
NMCP
No ratings yet
NMCP
2 pages
AB AF001 - X - EASY2 Old - GB
No ratings yet
AB AF001 - X - EASY2 Old - GB
65 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Read Me
No ratings yet
Read Me
5 pages
Mbi Mbi6030
No ratings yet
Mbi Mbi6030
28 pages
ERW Steel Pipe Mill Technical Offer
No ratings yet
ERW Steel Pipe Mill Technical Offer
21 pages
Module 4
No ratings yet
Module 4
7 pages
Cyber Security En-12-13
No ratings yet
Cyber Security En-12-13
2 pages
BatchDetermination and BatchSplitinDelivery
No ratings yet
BatchDetermination and BatchSplitinDelivery
4 pages
Java Notes
No ratings yet
Java Notes
53 pages
Wa0007.
No ratings yet
Wa0007.
7 pages
g12 - First Term Revision
No ratings yet
g12 - First Term Revision
10 pages
Software Testing Principles and Practices by Srinivasan
60% (30)
Software Testing Principles and Practices by Srinivasan
69 pages
8051 Timer Programming Guide
No ratings yet
8051 Timer Programming Guide
62 pages
Twitter Emojis & Text or To Be Moo Copy & Paste
No ratings yet
Twitter Emojis & Text or To Be Moo Copy & Paste
1 page
Carding Beginners Guide
No ratings yet
Carding Beginners Guide
3 pages
Toc - Atmas
No ratings yet
Toc - Atmas
44 pages
Storage Structures and Access Methods: Appendix
No ratings yet
Storage Structures and Access Methods: Appendix
46 pages
Cyber Champ Rapid Fire
No ratings yet
Cyber Champ Rapid Fire
4 pages
Technical Communication Process and Product 9th Edition Test Bank
No ratings yet
Technical Communication Process and Product 9th Edition Test Bank
13 pages
Paper Webreco
No ratings yet
Paper Webreco
11 pages
GARP - Growth at A Reasonable Price
No ratings yet
GARP - Growth at A Reasonable Price
5 pages
Integrator Portal DFEV3 Connector SAP 60 Manual en
No ratings yet
Integrator Portal DFEV3 Connector SAP 60 Manual en
34 pages
PowerPoint Kiểu
No ratings yet
PowerPoint Kiểu
20 pages

Lecture 5 - Big Data

Uploaded by

Lecture 5 - Big Data

Uploaded by

Mansoura University

Faculty of Computers and Information

[AI3502] BIG DATA ANALYTICS

 Many problems can be expressed as finding “similar” sets:

 Shingling: Convert documents to sets

 Min-Hashing: Convert large sets to short signatures, while preserving similarity

 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

The set of strings

STEP 1: SHINGLING: CONVERT DOCUMENTS TO SETS

 Step 1: Shingling: Convert documents to sets

 Need to account for ordering of words!

 A k-shingle (or k-gram) for a document is a sequence of k tokens that

 Example: k=2; document D1 = abcab

 Document that are intuitively similar will may shingles in common.

 To compress long shingles, we can hash them to (say) 4 bytes (token)

 Document D1 is a set of its k-shingles C1=S(D1)

 A natural similarity measure is the

 Documents that have lots of shingles in common have similar

 Suppose we need to find near-duplicate documents among 𝑵

 For 𝑵 = 𝟏𝟎 million, it takes more than a year…

The set of strings Signatures:

 Step 2: Min-hashing: Convert large sets to short

 Columns = sets (documents)

 Each document is a column:

 Example: sim(C1 ,C2) = ?

 Next goal: Find similar columns while computing small

 Next Goal: Find similar columns, Small signatures

 Goal: Find a hash function h(·) such that:

 Goal: Find a hash function h(·) such that:

 Clearly, the hash function depends on the similarity metric:

 There is a suitable hash function for the Jaccard similarity: It is

 Imagine the rows of the boolean matrix permuted under random

 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)

 The similarity of two signatures is the fraction of the hash functions in

 Pick K=100 random permutations of the rows

sig(C)[i] = min (i(C))

 We achieved our goal! We “compressed”

If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)

Step 3: Locality-Sensitive Hashing:

 Pick a similarity threshold s (0 < s < 1)

 Big idea: Hash columns of signature matrix M several

 Divide matrix M into b bands of r rows

Columns 6 and 7 are

 Hereafter, we assume that “same bucket” means “identical in that

 Assumption needed only to simplify analysis, not for correctness of

 Goal: Find pairs of documents that are at least s = 0.8 similar 40

C1, C2 ARE 80% SIMILAR 1 2 1 2

 Find pairs of  s=0.8 similarity, set b=20, r=5

to balance false positives/negatives

 Example: If we had only 15 bands of 5 rows, the number of false positives

Similarity t =sim(C1, C2) of two sets

Similarity t =sim(C1, C2) of two sets

 Columns C1 and C2 have similarity t

 Prob. that no band identical = (1 - tr)b

 Prob. that at least 1 band identical = 1 - (1 - tr)b

 Picking r and b to get the best S-curve

 Check in main memory that candidate pairs really do have similar

 Optional: In another pass-through data, check that the remaining

 Shingling: Convert documents to sets

 Min-Hashing: Convert large sets to short signatures, while preserving similarity

 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

You might also like