Association Analysis Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Association Analysis Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Association Analysis
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} →
{Beer},
1 Bread, Milk {Milk, Bread} →
{Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} →{Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
d
d1 d k
d k
R
k j
k 1 j 1
3 2 1
d d 1
Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper} →
{Beer} (s=0.4, c=0.67)
{Milk,Beer} →
{Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} →{Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} →{Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Diaper} →
{Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} →
{Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
For items, I = { a, b, c, d, e}
Itemsets could be a, b, c, d, e
ab, ac, ad, ae
bc, bd, be
cd, ce
de
abc, abd, abe, acd, ace, and all such combinations
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
For example:
– {Milk, Bread, Diapers} is a frequent itemset
means Milk, Bread, Diapers support is high
thus {Milk, Bread, Diapers} occur in many
transaction
Apriori principle:
and
Apriori principle:
– If an itemset is infrequent, then all of its supersets must
also be infrequent
For example:
– if {Coke} , {Jam} is infrequent itemset
means {Coke} , {Jam} support is low
thus {Coke, Jam} also don’t occur in many
transaction
Support-Based Pruning.
– Trimming the exponential search space based on the
support measure.
– Based on Anti-Monotone Property
Support of an itemset never exceeds the support of
it’s subsets:
I : set of items,
J = 2I (power set) // all possible combinations
S (support) is anti-monotone (or downward closed)
X , Y : ( X Y ) s ( X ) s(Y )
Any anti-monotone measured can be used in mining
3/8/2021 Introduction to Data Mining, 2 nd Edition 28
The Apriori Principle
Monotonicity Property
I : set of items,
J = 2I (power set) // all possible combinations
– f is monotone (or upward closed)
∀X, Y ∈ J : (X ⊆ Y ) → f (X) ≤ f (Y )
Apriori Algorithm 1
2
3
Bread, Milk
Beer, Bread, Diaper, Eggs
Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Apriori Algorithm 1
2
3
Bread, Milk
Beer, Bread, Diaper, Eggs
Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
3/8/2021 Introduction to Data Mining, 2 nd Edition 34
Illustrating Apriori Principle
TID
TID Items
Items Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread,
B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 B read, Diaper, Milk
Beer, Bread, Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
Generate length(k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subset of length k
that are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only
those that are frequent
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
3/8/2021 Introduction to Data Mining, 2 nd Edition 50
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
and Algorithms
Size of database
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Item Count
Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
If every subset is Minimum Support = 2
considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based
6 + 15 + 20 +15 = 56
pruning,
6 + 6 + 4 = 16
10
10
Number of frequent itemsets 3
k
k 1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E