The Market-Basket Problem
• Given a database of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Association Rules & Frequent Itemsets Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk}
All you ever wanted to know about diapers, 3 Milk, Diaper, Beer, Coke
beers and their correlation! 4 Bread, Milk, Diaper, Beer Implication here means
5 Bread, Milk, Diaper, Coke co-occurrence, not causality!
Data Mining: Association Rules 2
The Market-Basket Problem Applications of Market-Basket Analysis
Given a database of transactions • Supermarkets
where each transaction is a collection of items (purchased by a – Placement
customer in a visit) – Advertising
find all rules that correlate the presence of one set of – Sales
items with that of another set of items – Coupons
• Many applications outside market basket data analysis
Example: 30% of all transactions that contain diapers also contain
beers; 5% of all transactions contain these items – Prediction (telecom switch failure)
– 30%: confidence of the rule – Web usage mining
– 5%: support of the rule • Many different types of association rules
– Temporal
We are interested in finding all rules, – Spatial
rather than verifying that a particular rule holds – Causal
Data Mining: Association Rules 3 Data Mining: Association Rules 4
Definition: Frequent Itemset Definition: Association Rule
• Itemset TID Items • Association Rule TID Items
– A collection of one or more items 1 Bread, Milk – An implication expression of 1 Bread, Milk
• Example: {Milk, Bread, Diaper} the form X → Y, where X and Y
2 Bread, Diaper, Beer, Eggs 2 Bread, Diaper, Beer, Eggs
– k-itemset are itemsets
3 Milk, Diaper, Beer, Coke 3 Milk, Diaper, Beer, Coke
• An itemset that contains k items – Example:
4 Bread, Milk, Diaper, Beer 4 Bread, Milk, Diaper, Beer
σ)
• Support count (σ {Milk, Diaper} → {Beer}
– Frequency of occurrence of an 5 Bread, Milk, Diaper, Coke 5 Bread, Milk, Diaper, Coke
itemset
• Rule Evaluation Metrics
– E.g. σ({Milk, Bread,Diaper}) = 2 Example:
– Support (s)
• Support {Milk , Diaper } ⇒ Beer
• Fraction of transactions that
– Fraction of transactions that contain both X and Y
contain an itemset σ (Milk , Diaper, Beer ) 2
– E.g. s({Milk, Bread, Diaper}) = 2/5
– Confidence (c) s= = = 0.4
• Measures how often items in Y |T| 5
• Frequent Itemset appear in transactions that
– An itemset whose support is greater σ ( Milk, Diaper, Beer ) 2
contain X c= = = 0.67
than or equal to a minsup threshold σ ( Milk, Diaper ) 3
Data Mining: Association Rules 5 Data Mining: Association Rules 6
Aspects of Association Rule Mining Association Rule Mining Task
• How do we generate rules fast? • Given a set of transactions T, the goal of
– Performance measured in association rule mining is to find all rules
• Number of database scans having
• Number of itemsets that must be counted – support ≥ minsup threshold
• Which are the interesting rules? – confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Data Mining: Association Rules 7 Data Mining: Association Rules 8
Mining Association Rules Finding Association Rules
Example of Rules:
TID Items Two-step approach:
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
1. Frequent Itemset Generation
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} → {Milk} (s=0.4, c=0.67) • Generate all itemsets whose support ≥ minsup
3 Milk, Diaper, Beer, Coke
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke 2. Rule Generation
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
• Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
Observations: frequent itemset
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but • Frequent itemset generation is still
can have different confidence computationally expensive
• Thus, we may decouple the support and confidence requirements
Data Mining: Association Rules 9 Data Mining: Association Rules 10
Frequent Itemset Generation Frequent Itemset Generation
null
• Brute-force approach:
A B C D E
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of
AB AC AD AE BC BD BE CD CE DE
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
ABCD ABCE ABDE ACDE BCDE w
Given d items, there
are 2d possible – Match each transaction against every candidate
ABCDE candidate itemsets – Complexity ~ O(NMw) => Expensive since M = 2d !!!
Data Mining: Association Rules 11 Data Mining: Association Rules 12
Computational Complexity Frequent Itemset Generation Strategies
• Given d unique items: • Reduce the number of candidates (M)
– Total number of itemsets = 2d – Complete search: M=2d
– Total number of possible association rules: – Use pruning techniques to reduce M
• Reduce the number of transactions (N)
d d − k
d −1 d −k
R = ∑ × ∑ – Reduce size of N as the size of itemset increases
k j
k =1 j =1
– Used by DHP and vertical-based mining algorithms
= 3 − 2 +1d d +1
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
If d=6, R = 602 rules – No need to match every candidate against every
transaction
Data Mining: Association Rules 13 Data Mining: Association Rules 14
Reducing Number of Candidates Illustrating Apriori Principle
null
• Apriori principle:
– If an itemset is frequent, then all of its subsets must A B C D E
also be frequent
• Apriori principle holds due to the following AB AC AD AE BC BD BE CD CE DE
property of the support measure: Found to be
Infrequent
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
– Support of an itemset never exceeds the support of
its subsets ABCD ABCE ABDE ACDE BCDE
– This is known as the anti-monotone property of Pruned
support supersets ABCDE
Data Mining: Association Rules 15 Data Mining: Association Rules 16
Illustrating Apriori Principle The Idea of the Apriori Algorithm
Item Count Items (1-itemsets)
Bread 4 • start with all 1-itemsets
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets) • go through data and count their support and
Beer 3
Diaper 4
{Bread,Milk} 3 find all “large” 1-itemsets
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper}
{Milk,Beer}
3
2
candidates involving Coke • combine them to form “candidate” 2-itemsets
or Eggs)
{Milk,Diaper}
{Beer,Diaper}
3
3
• go through data and count their support and
Minimum Support = 3
Triplets (3-itemsets)
find all “large” 2-itemsets
If every subset is considered, Ite m s e t C ount
• combine them to form “candidate” 3-itemsets
6C + 6C + 6C = 41
1 2 3
{ B r e a d ,M ilk ,D ia p e r } 3 …
With support-based pruning,
6 + 6 + 1 = 13
large itemset: itemset with support > s
candidate itemset: itemset that may have support > s
Data Mining: Association Rules 17 Data Mining: Association Rules 18
The Apriori Algorithm Apriori Algorithm from Agrawal et al. (1993)
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot
be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Data Mining: Association Rules 19 Data Mining: Association Rules 20
Apriori Algorithm Example (s = 50%) Algorithm to Guess Itemsets
Database D itemset sup.
L1 itemset sup. • Naïve way:
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 – Extend all itemsets with all possible items
Scan D {2} 3
200 235 {3} 3 {3} 3 • More sophisticated:
300 1235 {4} 1
{5} 3
400 25 {5} 3 – Join Lk-1 with itself, adding only a single, final item
C2 itemset sup C2 itemset e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} produces
L2 itemset sup {1 2} 1 Scan D {1 2} {1 2 3 4} and {1 3 4 5}
{1 3} 2 {1 3}
{1 3} 2 – Remove itemsets with an unsupported subset
{2 3} 2 {1 5} 1 {1 5}
{2 3} e.g.: {1 3 4 5} has an unsupported subset: {1 4 5}
{2 5} 3 {2 3} 2
{2 5} 3 {2 5} if minsup = 50%
{3 5} 2
{3 5} 2 {3 5} – Use the database to further refine Ck
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Data Mining: Association Rules 21 Data Mining: Association Rules 22
Apriori: How to Generate Candidates? How to Count Supports of Candidates?
STEP 1: Self-join operation • Why counting supports of candidates is a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
STEP 2: Subset filtering – Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
Data Mining: Association Rules 23 Data Mining: Association Rules 24
Example of Generating Candidate Itemsets Run Time of Apriori
• L3 = {abc, abd, acd, ace, bcd } • k passes over data where k is the size of the
largest candidate itemset
• Self-joining: L3*L3
• Memory chunking algorithm ⇒ 2 passes over
– abcd from abc and abd
data on disk but multiple in memory
– acde from acd and ace
• Pruning based on the Apriori principle:
Toivonen 1996 gives a statistical technique which
– acde is removed because ade is not in L3 requires 1 + e passes (but more memory)
• C4 = {abcd } Brin 1997 - Dynamic Itemset Counting ⇒ 1 + e
passes (less memory)
Data Mining: Association Rules 25 Data Mining: Association Rules 26
Methods to Improve Apriori’s Efficiency Is Apriori Fast Enough? — Performance Bottlenecks
• Hash-based itemset counting: A k-itemset whose • The core of the Apriori algorithm:
corresponding hashing bucket count is below the threshold – Use frequent (k – 1)-itemsets to generate candidate frequent
cannot be frequent k-itemsets
– Use database scan and pattern matching to collect counts for the
• Transaction reduction: A transaction that does not contain any candidate itemsets
frequent k-itemset is useless in subsequent scans • The bottleneck of Apriori: candidate generation
• Partitioning: Any itemset that is potentially frequent in DB – Huge candidate sets:
must be frequent in at least one of the partitions of DB • 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• Sampling: mining on a subset of given data • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
– lower support threshold one needs to generate 2100 ≈ 1030 candidates.
– a method to determine the completeness – Multiple scans of database:
• Needs (n +1 ) scans, where n is the length of the longest pattern
• Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
Data Mining: Association Rules 27 Data Mining: Association Rules 28