Data Mining
Association Analysis: Basic Concepts
and Algorithms
What Are Patterns?
What are patterns?
– Patterns: A set of items, subsequences, or substructures
that occur frequently together (or strongly correlated) in
a data set
– Patterns represent intrinsic and important properties of
datasets
Frequent item set Frequent sequences Frequent structures
What Is Pattern Discovery?
Pattern discovery: Uncovering patterns from
massive data sets
It can answer questions such as:
– What products were often purchased together?
– What are the subsequent purchases after buying an
iPad?
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
An association rule consists of two parts, i.e., an antecedent (if) and a consequent
(then).
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Basic Concepts: Transactional
Database
Transactional Database (TDB)
– Each transaction is associated with an identifier, called
a TID.
– May also have counts associated with each item sold
Tid Items bought
1 Beer, Nuts, Diaper
2 Beer, Coffee, Diaper
3 Beer, Diaper, Eggs
4 Nuts, Eggs, Milk
5 Nuts, Coffee, Diaper, Eggs, Milk
Definition: Frequent Itemset
Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset TID Items
◆ An itemset that contains k items 1 Bread, Milk
Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
Relative support
– The fraction of transactions that
contains X (i.e., the probability that a
transaction contains X)
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Rule Measures: Support and
Confidence
Customer
buys both
Customer Find all the rules X & Y Z with
buys diaper
minimum confidence and support
– support, s, probability that a transaction
contains {X U Y U Z}
– confidence, c, conditional probability
that a transaction having {X U Y} also
Customer contains Z
buys beer
Transaction ID Items Bought Let minimum support 50%, and
2000 A,B,C minimum confidence 50%, we
1000 A,C have
4000 A,D – A C (50%, 66.6%)
5000 B,E,F – C A (50%, 100%)
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X → Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain Example:
both X and Y {Milk , Diaper} Beer
– Confidence (c)
Measures how often items in Y appear ( Milk , Diaper, Beer ) 2
◆
in transactions that contain X
s= = = 0.4
|T| 5
◆ Confidence is the conditional
probability of Y when X has already (Milk, Diaper, Beer ) 2
c= = = 0.67
occurred (Milk , Diaper ) 3
Mining Association Rules—An
Example
Transaction ID Items Bought Min. support 50%
2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A C: {A,C} 50%
support = P (A U C) = 50%
confidence = P (C|A) = 66.6%
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Given d items, there
are 2d possible
ABCDE candidate itemsets
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
– uses prior knowledge of frequent itemset properties.
Apriori principle holds due to the following property
of the support measure:
X ,Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
Lattice structure of frequent itemsets
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 3
Mining Frequent Itemsets: the Key Step
1. Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a frequent
itemset
◆i.e.,if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
2. Use the frequent itemsets to generate association
rules.
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {A} 2
{B} 3
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2 {A, C}
2nd scan
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}
C3 Itemset F3 Itemset sup
3rd scan
{B, C, E} {B, C, E} 2
Apriori Algorithm
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
◆ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◆ Prune candidate itemsets containing subsets of length k that
are infrequent
◆ Count the support of each candidate by scanning the DB
◆ Eliminate candidates that are infrequent, leaving only those
that are frequent
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;