We will be releasing HW1 today
¡ It is due in 2 weeks (1/24 at 23:59pm)
¡ The homework is long
§ Requires proving theorems as well as coding
¡ Please start early
Recitation sessions:
¡ Spark Tutorial and Clinic:
Today 4:30-5:50pm in Skilling Auditorium
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 1
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Supermarket shelf management –
Market-basket model:
¡ Goal: Identify items that are bought together by
sufficiently many customers
¡ Approach: Process the sales data collected with
barcode scanners to find dependencies among
items
¡ A classic rule:
§ If someone buys diaper and milk, then he/she is
likely to buy beer
§ Don’t be surprised if you find six-packs next to diapers!
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
Input:
¡ A large set of items Basket Items
§ e.g., things sold in a 1
2
Bread, Coke, Milk
Beer, Bread
supermarket 3 Beer, Coke, Diaper, Milk
A large set of baskets
4 Beer, Bread, Diaper, Milk
¡ 5 Coke, Diaper, Milk
§ Each basket is a Output:
small subset of items Rules Discovered:
§ e.g., the things one {Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
customer buys on one day
¡ Discover association rules:
People who bought {x,y,z} tend to buy {v,w}
§ Example application: Amazon
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
¡ A general many-to-many mapping
(association) between two kinds of things
§ But we ask about connections among “items”,
not “baskets”
¡ Items and baskets are abstract:
§ For example:
§ Items/baskets can be products/shopping basket
§ Items/baskets can be words/documents
§ Items/baskets can be basepairs/genes
§ Items/baskets can be drugs/patients
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
¡ Items = products; Baskets = sets of products
someone bought in one trip to the store
¡ Real market baskets: Chain stores keep TBs of
data about what customers buy together
§ Tells how typical customers navigate stores, lets
them position tempting items:
§ Apocryphal story of “diapers and beer” discovery
§ Used to position potato chips between diapers and beer to
enhance sales of potato chips
¡ Amazon’s ‘people who bought X also bought Y’
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
¡ Baskets = sentences; Items = documents in
which those sentences appear
§ Items that appear together too often could
represent plagiarism
§ Notice items do not have to be “in” baskets
¡ Baskets = patients; Items = drugs & side-effects
§ Has been used to detect combinations
of drugs that result in particular side-effects
§ But requires extension: Absence of an item
needs to be observed as well as presence
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
First: Define
Frequent itemsets
Association rules:
Confidence, Support, Interestingness
Then: Algorithms for finding frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithm
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
¡ Simplest question: Find sets of items that
appear together “frequently” in baskets
¡ Support for itemset I: Number of baskets
containing all items in I TID Items
§ (Often expressed as a fraction 1 Bread, Coke, Milk
2 Beer, Bread
of the total number of baskets) 3 Beer, Coke, Diaper, Milk
¡ Given a support threshold s, 4
5
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
then sets of items that appear Support of
{Beer, Bread} = 2
in at least s baskets are called
frequent itemsets
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
¡ Items = {milk, coke, pepsi, beer, juice}
¡ Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
¡ Frequent itemsets: {m}, {c}, {b}, {j},
{m,b} , {b,c} , {c,j}.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
¡ Define: Association Rules:
If-then rules about the contents of baskets
¡ {i1, i2,…,ik} → j means: “if a basket contains
all of i1,…,ik then it is likely to contain j”
¡ In practice there are many rules, want to find
significant/interesting ones!
¡ Confidence of association rule is the
probability of j given I = {i1,…,ik}
support( I È j )
conf( I ® j ) =
support( I )
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
¡ Not all high-confidence rules are interesting
§ The rule X → milk may have high confidence for
many itemsets X, because milk is just purchased very
often (independent of X) and the confidence will be
high
¡ Interest of an association rule I → j:
abs. difference between its confidence and
the fraction of baskets that contain j
Interest(I ® j ) =| conf( I ® j ) - Pr[ j ] |
§ Interesting rules are those with high positive or
negative interest values (usually above 0.5)
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
¡ Association rule: {m, b} →c
§ Support = 2
§ Confidence = 2/4 = 0.5
§ Interest = |0.5 – 5/8| = 1/8
§ Item c appears in 5/8 of the baskets
§ The rule is not very interesting!
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
¡ Problem: Find all association rules with
support ≥s and confidence ≥c
§ Note: Support of an association rule is the support
of the set of items in the rule (left and right side)
¡ Hard part: Finding the frequent itemsets!
§ If {i1, i2,…, ik} → j has high support and
confidence, then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
support( I È j )
conf(I ® j ) =
support( I )
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
support(I È j)
conf(I ® j) =
¡ Step 1: Find all frequent itemsets I support(I )
§ (we will explain this next)
¡ Step 2: Rule generation
§ For every subset A of I, generate a rule A → I \ A
§ Since I is frequent, A is also frequent
§ Variant 1: Single pass to compute the rule confidence
§ confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
§ Variant 2:
§ Observation: If A,B,C→D is below confidence, so is A,B→C,D
§ Can generate “bigger” rules from smaller ones!
§ Output the rules above the confidence threshold
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
¡ Support threshold s = 3, confidence c = 0.75
¡ Step 1) Find frequent itemsets:
§ {b,m} {b,c} {c,m} {c,j} {m,c,b}
¡ Step 2) Generate rules:
§ b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
§ m→b: c=4/5 … b,m→c: c=3/4
b→c,m: c=3/6
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
¡ To reduce the number of rules, we can
post-process them and only output:
§ Maximal frequent itemsets:
No immediate superset is frequent
§ Gives more pruning
or
§ Closed itemsets:
No immediate superset has the same support (> 0)
§ Stores not only frequent information, but exact
supports/counts
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
Frequent, but
superset BC
Support Maximal(s=3) Closed also frequent.
A 4 No No Frequent, and
B 5 No Yes its only superset,
ABC, not freq.
C 3 No No Superset BC
AB 4 Yes Yes has same support.
AC 2 No No Its only super-
set, ABC, has
BC 3 Yes Yes smaller support.
ABC 2 No Yes
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
¡ Back to finding frequent itemsets Item
Item
¡ Typically, data is kept in flat files Item
Item
rather than in a database system: Item
Item
§ Stored on disk Item
Item
§ Stored basket-by-basket Item
Item
§ Baskets are small but we have Item
Item
many baskets and many items
§ Expand baskets into pairs, triples, etc. Etc.
as you read baskets
§ Use k nested loops to generate all
sets of size k Items are positive integers,
and boundaries between
Note: We want to find frequent itemsets. To find them, we have to baskets are –1.
count them. To count them, we have to enumerate them.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
The true cost of mining disk-
Item
¡ Item
resident data is usually the number
Item
Item
of disk I/Os
Item
Item
Item
Item
¡ In practice, association-rule Item
Item
algorithms read the data in passes Item
Item
– all baskets read in turn
Etc.
¡ We measure the cost by the
number of passes an algorithm Items are positive integers,
makes over the data and boundaries between
baskets are –1.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
¡ For many frequent-itemset algorithms,
main-memory is the critical resource
§ As we read baskets, we need to count
something, e.g., occurrences of pairs of items
§ The number of different things we can count
is limited by main memory
§ Swapping counts in/out is a disaster
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
¡ The hardest problem often turns out to be
finding the frequent pairs of items {i1, i2}
§ Why? Freq. pairs are common, freq. triples are rare
§ Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
¡ Let’s first concentrate on pairs, then extend to
larger sets
¡ The approach:
§ We always need to generate all the itemsets
§ But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
¡ Naïve approach to finding frequent pairs
¡ Read file once, counting in main memory
the occurrences of each pair:
§ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
¡ Fails if (#items)2 exceeds main memory
§ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
§ Suppose 105 items, counts are 4-byte integers
§ Number of pairs of items: 105(105-1)/2 » 5*109
§ Therefore, 2*1010 (20 gigabytes) of memory is needed
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
Goal: Count the number of occurrences of
each pair of items (i,j):
¡ Approach 1: Count all pairs using a matrix
¡ Approach 2: Keep a table of triples [i, j, c] =
“the count of the pair of items {i, j} is c.”
§ If integers and item ids are 4 bytes, we need
approximately 12 bytes for pairs with count > 0
§ Plus some additional overhead for the hashtable
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
Item j
12 per
4 bytes per pair
occurring pair
Item i
Triangular Matrix Triples
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
¡ Approach 1: Triangular Matrix
Item j
§ n = total number items
§ Count pair of items {i, j} only if i<j
§ Keep pair counts in lexicographic order: Item i
§ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
§ Pair {i, j} is at position: [n(n - 1) - (n - i)(n - i + 1)]/2 + (j - i)
§ Total number of pairs n(n –1)/2; total bytes= O(n2)
§ Triangular Matrix requires 4 bytes per pair
¡ Approach 2 uses 12 bytes per occurring pair
(but only for pairs with count > 0)
¡ Approach 2 beats Approach 1 if less than 1/3 of
possible pairs actually occur
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
¡ Approach 1: Triangular Matrix
§ n = total number items
§ Count pair of items {i, j} only if i<j
§ Keep pair counts in lexicographic order:
Problem is if we have too
§ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
§ Pair {i, many
j} is at position:
items[n(nso- the
1) - (npairs
- i)(n - i + 1)]/2 + (j - i)
§ Total number of pairs n(n –1)/2; total bytes= O(n2)
do Matrix
§ Triangular not fit into4memory.
requires bytes per pair
¡ ApproachCan2 useswe do better?
12 bytes per occurring pair
(but only for pairs with count > 0)
¡ Approach 2 beats Approach 1 if less than 1/3 of
possible pairs actually occur
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
• Monotonicity of “Frequent”
• Notion of Candidate Pairs
• Extension to Larger Itemsets
¡ A two-pass approach called
A-Priori limits the need for
main memory
¡ Key idea: monotonicity
§ If a set of items I appears at
least s times, so does every subset J of I
¡ Contrapositive for pairs:
If item i does not appear in s baskets, then no
pair including i can appear in s baskets
¡ So, how does A-Priori find freq. pairs?
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
¡ Pass 1: Read baskets and count in main memory
the # of occurrences of each individual item
§ Requires only memory proportional to #items
¡ Items that appear ≥ 𝒔 times are the frequent items
¡ Pass 2: Read baskets again and keep track of the
count of only those pairs where both elements
are frequent (from Pass 1)
§ Requires memory proportional to square of frequent
items only (for counts)
§ Plus a list of the frequent items (so you know what must
be counted)
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
Item counts Frequent items
Main memory
Counts of
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2
Green box represents the amount of available main memory. Smaller boxes represent how the memory is used.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
¡ You can use the
triangular matrix
method with n = number Item counts Frequent Old
item
of frequent items items
IDs
§ May save space compared
Counts of pairs
Main memory
with storing triples of
Counts
frequent
of
¡ Trick: re-number pairs of
items
frequent items
frequent items 1,2,…
and keep a table relating
new numbers to original Pass 1 Pass 2
item numbers
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
¡ For each k, we construct two sets of
k-tuples (sets of size k):
§ Ck = candidate k-tuples = those that might be
frequent sets (support > s) based on information
from the pass for k–1
§ Lk = the set of truly frequent k-tuples
Count All pairs Count
All of items To be
the items the pairs explained
items from L1
C1 Filter L1 Construct C2 Filter L2 Construct C3
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34
** Note here we generate new candidates by
generating Ck from Lk-1 and L1.
But that one can be more careful with candidate
generation. For example, in C3 we know {b,m,j}
cannot be frequent since {m,j} is not frequent
¡ Hypothetical steps of the A-Priori algorithm
§ C1 = { {b} {c} {j} {m} {n} {p} }
§ Count the support of itemsets in C1
§ Prune non-frequent. We get: L1 = { b, c, j, m }
§ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
§ Count the support of itemsets in C2
§ Prune non-frequent. L2 = { {b,m} {b,c} {c,m} {c,j} }
§ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
§ Count the support of itemsets in C3 **
§ Prune non-frequent. L3 = { {b,c,m} }
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
¡ One pass for each k (itemset size)
¡ Needs room in main memory to count
each candidate k–tuple
¡ For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most memory
¡ Many possible extensions:
§ Association rules with intervals:
§ For example: Men over 65 have 2 cars
§ Association rules when items are in a taxonomy
§ Bread, Butter → FruitJam
§ BakedGoods, MilkProduct → PreservedGoods
§ Lower the support s as itemset gets bigger
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
• Improvement to A-Priori
• Exploits Empty Memory on First Pass
• Frequent Buckets
¡ Observation:
In pass 1 of A-Priori, most memory is idle
§ We store only individual item counts
§ Can we use the idle memory to reduce
memory required in pass 2?
¡ Pass 1 of PCY: In addition to item counts,
maintain a hash table with as many Note:
buckets as fit in memory Bucket≠Basket
§ Keep a count for each bucket into which
pairs of items are hashed
§ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in hash the pair to a bucket;
PCY add 1 to the count for that bucket;
¡ Few things to note:
§ Pairs of items need to be generated from the input
file; they are not present in the file
§ We are not just interested in the presence of a pair,
but we need to see whether it is present at least s
(support) times
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39
¡ Observation: If a bucket contains a frequent pair,
then the bucket is surely frequent
¡ However, even without any frequent pair,
a bucket can still be frequent L
§ So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
¡ But, for a bucket with total count less than s,
none of its pairs can be frequent J
§ Pairs that hash to this bucket can be eliminated as
candidates (even if the pair consists of 2 frequent items)
¡ Pass 2:
Only count pairs that hash to frequent buckets
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40
¡ Replace the buckets by a bit-vector:
§ 1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not
¡ 4-byte integer counts are replaced by bits,
so the bit-vector requires 1/32 of memory
¡ Also, decide which items are frequent
and list them for the second pass
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41
¡ Count all pairs {i, j} that meet the
conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in
the bit vector is 1 (i.e., a frequent bucket)
¡ Both conditions are necessary for the
pair to have a chance of being frequent
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
Item counts Frequent items
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs
candidate
pairs
Pass 1 Pass 2
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43
¡ The MMDS book covers several other extensions
beyond the PCY idea: “Multistage” and
“Multihash”
¡ For reading on your own, Sect. 6.4 of MMDS
¡ Recommended video (starting about 10:10):
https://www.youtube.com/watch?v=AGAkNiQnbjY
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
• Simple Algorithm
• Savasere-Omiecinski- Navathe (SON) Algorithm
• Toivonen’s Algorithm
¡ A-Priori, PCY, etc., take k passes to find
frequent itemsets of size k
¡ Can we use fewer passes?
¡ Use 2 or fewer passes for all sizes,
but may miss some frequent itemsets
§ Random sampling
§ Do not sneer; “random sample” is often a cure for the
problem of having too large a dataset.
§ SON (Savasere, Omiecinski, and Navathe)
§ Toivonen
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47
¡ Take a random sample of the market baskets
¡ Run a-priori or one of its improvements
in main memory Copy of
sample
§ So we don’t pay for disk I/O each
Main memory
baskets
time we increase the size of itemsets
§ Reduce support threshold Space
proportionally for
counts
to match the sample size
§ Example: if your sample is 1/100 of the baskets, use
s/100 as your support threshold instead of s.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48
¡ To avoid false positives: Optionally, verify that
the candidate pairs are truly frequent in the
entire data set by a second pass
¡ But you don’t catch sets frequent in the
whole but not in the sample
§ Smaller threshold, e.g., s/125, helps catch more
truly frequent itemsets
§ But requires more space
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49
¡ SON Algorithm: Repeatedly read small
subsets of the baskets into main memory and
run an in-memory algorithm to find all
frequent itemsets
§ Note: we are not sampling, but processing the
entire file in memory-sized chunks
¡ An itemset becomes a candidate if it is found
to be frequent in any one or more subsets of
the baskets.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50
¡ On a second pass, count all the candidate
itemsets and determine which are frequent in
the entire set
¡ Key “monotonicity” idea: An itemset cannot
be frequent in the entire set of baskets unless
it is frequent in at least one subset
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51
Pass 1:
¡ Start with a random sample, but lower the
threshold slightly for the sample:
§ Example: if the sample is 1% of the baskets, use
s/125 as the support threshold rather than s/100
¡ Find frequent itemsets in the sample
¡ Add to the itemsets that are frequent in the
sample the negative border of these itemsets:
§ Negative border: An itemset is in the negative
border if it is not frequent in the sample, but all its
immediate subsets are
§ Immediate subset = “delete exactly one element”
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 52
¡ {A,B,C,D} is in the negative border if and only if:
1. It is not frequent in the sample, but
2. All of {A,B,C}, {B,C,D}, {A,C,D}, and {A,B,D} are.
Negative Border
tripletons
doubletons
Frequent Itemsets
singletons from Sample
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 53
¡ Pass 1:
§ Start with the random sample, but lower the threshold
slightly for the subset
§ Add to the itemsets that are frequent in the sample the
negative border of these itemsets
¡ Pass 2:
§ Count all candidate frequent itemsets from the first pass,
and also count sets in their negative border
¡ If no itemset from the negative border turns out to be
frequent, then we found all the frequent itemsets.
§ What if we find that something in the negative border is
frequent?
§ We must start over again with another sample!
§ Try to choose the support threshold so the probability of failure is low,
while the number of itemsets checked on the second pass fits in main-
memory.
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 54
We broke through the
negative border. How
far does the problem
go?
…
Negative Border
tripletons
doubletons
singletons
Frequent Itemsets
from Sample
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 55
¡ Skipped the def of maximal sets etc
¡ 2 min over
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 56
¡ If there is an itemset S that is frequent in full data, but not
frequent in the sample, then the negative border contains
at least one itemset that is frequent in the whole.
Proof by contradiction:
¡ Suppose not; i.e.;
1. There is an itemset S frequent in the full data but not
frequent in the sample, and
2. Nothing in the negative border is frequent in the full data
¡ Let T be a smallest subset of S that is not frequent in the
sample (but every subset of T is)
¡ T is frequent in the whole (S is frequent + monotonicity).
¡ But then T is in the negative border (contradiction)
9/25/19 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 57