0% found this document useful (0 votes)
6 views56 pages

9 Association

The document outlines a course on Data Mining focusing on Principal Component Analysis and Association Rule Mining, detailing important dates for assignments and exams. It discusses market basket analysis, the concept of association rules, and the Apriori algorithm for finding frequent itemsets in transaction data. The document also highlights the significance of support and confidence metrics in evaluating association rules.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views56 pages

9 Association

The document outlines a course on Data Mining focusing on Principal Component Analysis and Association Rule Mining, detailing important dates for assignments and exams. It discusses market basket analysis, the concept of association rules, and the Apriori algorithm for finding frequent itemsets in transaction data. The document also highlights the significance of support and confidence metrics in evaluating association rules.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Mining

Principal Component Analysis

Association Rule Mining

CS 584 :: Fall 2024


Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Drs. Tan, Steinbach and Kumar.
Part of slides is from Dr. Jessica Lin.
1
• HW4 and HW5 are due 12/02
• HW3 is due 11/04
• 11/05 election day, no class
• 11/12, no class, watch recorded video
• Quiz 4 will be move to 11/19
• Final exam: 12/17, 7:30pm-9:30pm

2
So far …

• Classification & Regression: data with features and labels


• Clustering: data with features

Now, let’s study transaction data!

3
Outline

• Introduction and Motivations


• Apriori Algorithm
• Association Rule Evaluation

4
Market Basket Analysis
If you are a manager of a supermarket, how do you
design your store layout to promote sales?

You want to put products always being purchased


together close. 5
Market Basket Analysis
You want to put products always being purchased
together close. By looking into the baskets of
customers.

6
Association Rules
• Stores are interested in associations between
different items that people buy together.
• E.g., Someone who buys bread is quite likely to also buy milk

• Association Rules (if-then rules): consequent


• E.g., bread ⟹ milk
• E.g., {Spider Man, Iron Man} ⟹ The Avengers

antecedent

7
Association Rules
• Stores are interested in associations between
different items that people buy together.
• E.g., Someone who buys bread is quite likely to also buy milk

• Association Rules (if-then rules): consequent


• E.g., bread ⟹ milk
• E.g., {Spider Man, Iron Man} ⟹ The Avengers

antecedent
• Associations information is used beyond market
basket analysis.
• E.g., medicine, recommender systems (Amazon’s
“Customers who bought this item also bought…” )
8
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)

• (Osco Drug) Customers who buy diapers are more likely to also
buy beer.
• Daddy needs a beer.

9
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)

• (Osco Drug) Customers who buy diapers are more likely to also
buy beer.
• Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
three types of candy bars.
• Kids come along for errands.

10
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)

• (Osco Drug) Customers who buy diapers are more likely to also
buy beer.
• Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
three types of candy bars.
• Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
accompanies the purchase of paper, waste baskets, scissors,
paper clips, and so on.
• New hires?

11
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)

• (Osco Drug) Customers who buy diapers are more likely to also
buy beer.
• Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
three types of candy bars.
• Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
accompanies the purchase of paper, waste baskets, scissors,
paper clips, folders, and so on.
• New hires?
• (Car insurance) Low credit rating, more car accidents.

12
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)

• (Osco Drug) Customers who buy diapers are more likely to also
buy beer.
• Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
three types of candy bars.
• Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
accompanies the purchase of paper, waste baskets, scissors,
paper clips, folders, and so on.
• New hires?
• (Car insurance) Low credit rating, more car accidents.
• Music taste and political affiliation. 13
Association Rule Mining
Market-basket transactions
• Given a set of transactions
(baskets)
• A transaction contains a
small subset of items
• Typically, #items in a
transaction is small,
#transactions is very large

14
Association Rule Mining
Market-basket transactions
• Given a set of transactions
(baskets)
• A transaction contains a
small subset of items
• Typically, #items in a
transaction is small,
#transactions is very large
• Goal: find association rules
of the form {x, y} -> {z}

15
Definition: Frequent Itemset
• Itemset
• A collection of many items
o Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
o An itemset containing k items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

16
Definition: Frequent Itemset
• Itemset
• A collection of many items
o Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
o An itemset containing k items
1 Bread, Milk
• Support count (𝝈) 2 Bread, Diaper, Beer, Eggs
• Frequency of occurrence of an itemset
3 Milk, Diaper, Beer, Coke
• E.g., 𝜎({Milk, Bread, Diaper})=2
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

17
Definition: Frequent Itemset
• Itemset
• A collection of many items
o Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
o An itemset containing k items
1 Bread, Milk
• Support count (𝝈) 2 Bread, Diaper, Beer, Eggs
• Frequency of occurrence of an itemset
3 Milk, Diaper, Beer, Coke
• E.g., 𝜎({Milk, Bread, Diaper})=2
4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain an
itemset
• E.g., s({Milk, Bread, Diaper})=2/5

18
Definition: Frequent Itemset
• Itemset
• A collection of many items
o Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
o An itemset containing k items
1 Bread, Milk
• Support count (𝝈) 2 Bread, Diaper, Beer, Eggs
• Frequency of occurrence of an itemset
3 Milk, Diaper, Beer, Coke
• E.g., 𝜎({Milk, Bread, Diaper})=2
4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain an
itemset
• E.g., s({Milk, Bread, Diaper})=2/5
• Frequent Itemset
• An itemset whose support is greater than
or equal to a minsup threshold
19
Definition: Association Rule
TID Items
• Association Rule 1 Bread, Milk
• An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X → Y, where X and Y are 3 Milk, Diaper, Beer, Coke
itemsets 4 Bread, Milk, Diaper, Beer
• Example: {Milk, Diaper} → 5 Bread, Milk, Diaper, Coke
{Beer}
• Rule Evaluation Metrics
• Support (s)
o Fraction of transactions that
contain both X and Y

20
Definition: Association Rule
TID Items
• Association Rule 1 Bread, Milk
• An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X → Y, where X and Y are 3 Milk, Diaper, Beer, Coke
itemsets 4 Bread, Milk, Diaper, Beer
• Example: {Milk, Diaper} → 5 Bread, Milk, Diaper, Coke
{Beer}
• Rule Evaluation Metrics
• Support (s)
o Fraction of transactions that
contain both X and Y
• Confidence (c)
o How often Y appears in
transactions that contain X
𝝈(𝑿∪𝒀)
o conf(X→Y)=P(Y|X)=
𝝈(𝑿) 21
Definition: Association Rule
TID Items
• Rule Evaluation Metrics
• Support (s) 1 Bread, Milk
o Fraction of transactions that 2 Bread, Diaper, Beer, Eggs
contain both X and Y 3 Milk, Diaper, Beer, Coke
• Confidence (c) 4 Bread, Milk, Diaper, Beer
o How often Y appears in
5 Bread, Milk, Diaper, Coke
transactions that contain X
𝜎(𝑋∪𝑌)
o conf(X→Y)=P(Y|X )= 𝜎(𝑋)
Example:
{Milk, Diaper}  Beer
 (Milk, Diaper, Beer) 2
s= = = 0.4
|T| 5
 (Milk, Diaper, Beer) 2
c= = = 0.67
 (Milk, Diaper) 3 22
Association Rule Mining Task

• Given a set of transactions 𝕋, the goal is to find all rules


having
• Support ≥ minsup threshold
• Confidence ≥ minconf threshold

23
Association Rule Mining Task

• Given a set of transactions 𝕋, the goal is to find all rules


having
• Support ≥ minsup threshold
• Confidence ≥ minconf threshold

• One possible way: brute-force approach


• List all possible association rules
• Compute support and confidence for each rule
• Prune rules failing the minsup and minconf thresholds
• Computationally Impossible

24
Association Rule Mining Task

25
Association Rule Mining Task

26
Find Frequent Itemset
null
Items: A, B, C, D, E
A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

25 candidate itemsets 27
Find Frequent Itemset
• If there are n items, there will be 2n possible itemsets
• 1000 items → 21000 > 10300
• 1080 hydrogen atoms in the observable universe!

How to solve this?

28
Outline

• Introduction and Motivations


• Apriori Algorithm
• Association Rule Evaluation

29
Apriori
• Apriori principle:
• If an itemset is frequent, then all its subsets must also be
frequent
• Apriori principle holds due to the following property
of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )

30
Apriori
X , Y : ( X  Y )  s( X )  s(Y )
• Support of an itemset cannot TID Items

be larger than the support of its 1 Bread, Milk


subsets 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
• This is known as the anti- 4 Bread, Milk, Diaper, Beer
monotone property of support 5 Bread, Milk, Diaper, Coke

s(Bread, Milk, Diaper)=3


s(Bread, Milk)=4

31
Apriori
null

Found to be infrequent A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE
Pruned supersets
32
Minimum support count=3
Apriori 1-itemsets
TID Items
Item Count
1 Bread, Milk Bread 4
2 Bread, Diaper, Beer, Eggs Coke 2
3 Milk, Diaper, Beer, Coke Milk 4
4 Bread, Milk, Diaper, Beer Beer 3
5 Bread, Milk, Diaper, Coke Diaper 4
Eggs 1

33
Minimum support count=3
Apriori 1-itemsets
TID Items
Item Count
1 Bread, Milk
Bread 4
2 Bread, Diaper, Beer, Eggs Coke 2
3 Milk, Diaper, Beer, Coke Milk 4
4 Bread, Milk, Diaper, Beer Beer 3
5 Bread, Milk, Diaper, Coke Diaper 4
Eggs 1

34
Minimum support count=3
Apriori 1-itemsets
2-itemsets
Item Count
Itemset Bread 4
{Bread,Milk} Coke 2
{Bread, Beer } Milk 4
{Bread,Diaper} Beer 3
{Beer, Milk} Diaper 4
{Diaper, Milk} Eggs 1
{Beer,Diaper}
(No need to generate candidates
involving Coke or Eggs) TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

35
Minimum support count=3
Apriori 1-itemsets
2-itemsets
Item Count
Itemset Count Bread 4
{Bread,Milk} 3 Coke 2
{Bread,Beer} 2 Milk 4
{Bread,Diaper} 3 Beer 3
{Milk,Beer} 2 Diaper 4
{Milk,Diaper} 3 Eggs 1
{Beer,Diaper} 3

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

36
Minimum support count=3
Apriori 1-itemsets
2-itemsets
Item Count
Itemset Count Bread 4
{Bread,Milk} 3 Coke 2
{Bread,Beer} 2 Milk 4
{Bread,Diaper} 3 Beer 3
{Milk,Beer} 2 Diaper 4
{Milk,Diaper} 3 Eggs 1
{Beer,Diaper} 3

3-itemsets TID Items


1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

37
Minimum support count=3
Apriori 1-itemsets
2-itemsets
Item Count
Itemset Count Bread 4
{Bread,Milk} 3 Coke 2
{Bread,Beer} 2 Milk 4
{Bread,Diaper} 3 Beer 3
{Milk,Beer} 2 Diaper 4
{Milk,Diaper} 3 Eggs 1
{Beer,Diaper} 3

3-itemsets TID Items


1 Bread, Milk
Itemset Count 2 Bread, Diaper, Beer, Eggs
{ Beer, Diaper, Milk} 2 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
{ Beer,Bread, Diaper} 2
5 Bread, Milk, Diaper, Coke
{Bread, Diaper, Milk} 1
{Beer, Bread, Milk} 1 38
Minimum support count=3
Apriori 1-itemsets
2-itemsets
Item Count
Itemset Count Bread 4
{Bread,Milk} 3 Coke 2
{Bread,Beer} 2 Milk 4
{Bread,Diaper} 3 Beer 3
{Milk,Beer} 2 Diaper 4
{Milk,Diaper} 3 Eggs 1
{Beer,Diaper} 3

If every subset is considered,


6C + 6C + 6C
1 2 3
3-itemsets
6 + 15 + 20 = 41
Itemset Count
{ Beer, Diaper, Milk} 2
With support-based pruning,
{ Beer,Bread, Diaper} 2
{Bread, Diaper, Milk} 1 6 + 6 + 4 = 16
{Beer, Bread, Milk} 1 39
Apriori
Algorithm:
• Generate frequent itemsets of length 1
• Let k=2
• Repeat until no new frequent itemsets are
identified
o Generate length k candidate itemsets from length
k-1 frequent itemsets
o Count the support of each candidate by scanning
the database
o Eliminate candidates that are infrequent, leaving
only those that are frequent

40
Apriori
Algorithm:
• Generate frequent itemsets of length 1
• Let k=2
• Repeat until no new frequent itemsets are
identified
o Generate length k candidate itemsets from length
k-1 frequent itemsets (how???)
o Count the support of each candidate by scanning
the database
o Eliminate candidates that are infrequent, leaving
only those that are frequent

41
Candidate Itemsets Generation
o Generate length k candidate itemsets from
length k-1 frequent itemsets

A basic approach:
• Fk-1x F1 method
• Fk-1x Fk-1 method

42
Candidate Itemsets Generation
• Fk-1x F1 Method TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

43
Candidate Itemsets Generation
• Fk-1x Fk-1 Method TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Only merge a pair of frequent (k-1)-itemsets if


their first k-2 items are identical!

44
Association Rule Mining Task

Two steps:
• Find all frequent itemsets
• Generate high-confidence association rules from
each frequent itemset
?

45
Rule Generation

46
Rule Generation
How to efficiently generate rules from frequent itemsets?
ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

47
Rule Generation
How to efficiently generate rules from frequent itemsets?

• confidence of rules generated from the same itemset


has an anti-monotone property
• E.g., L = {A,B,C,D}:
𝜎(𝐴𝐵𝐶) ≤ 𝜎(𝐴𝐵) ≤ 𝜎(𝐴)
𝜎(𝐴𝐵𝐶𝐷) 𝜎(𝐴𝐵𝐶𝐷) 𝜎(𝐴𝐵𝐶𝐷)
≥ ≥
𝜎(𝐴𝐵𝐶) 𝜎(𝐴𝐵) 𝜎(𝐴)

𝑐 𝐴𝐵𝐶 → 𝐷 ≥ 𝑐 𝐴𝐵 → 𝐶𝐷  𝑐 𝐴 → 𝐵𝐶𝐷

48
Rule Generation
Low-confidence rule
ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned rules
49
Outline

• Introduction and Motivations


• Apriori Algorithm
• Association Rule Evaluation

50
Limitation of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 15/20 = 0.75


 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375
 Because P(Coffee) = 90/100 = 0.9
51
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)

– P(SÙB) = 420/1000 = 0.42


– P(S) ´ P(B) = 0.6 ´ 0.7 = 0.42

– P(SÙB) = P(S) ´ P(B) => Statistical independence


– P(SÙB) > P(S) ´ P(B) => Positively correlated
– P(SÙB) < P(S) ´ P(B) => Negatively correlated
52
Statistical-based Measure
Lift: the ratio between the observed support and the
expected support if X and Y were independent

sup(𝑋 ∪ 𝑌)
𝐿𝑖𝑓𝑡 𝑋 → 𝑌 =
sup 𝑋 sup(𝑌)
conf(𝑋 → 𝑌)
= = 𝐿𝑖𝑓𝑡 𝑌 → 𝑋
sup(𝑌)
• Range [0, ∞)
• Lift(X,Y)=1: X and Y are independent
• Lift(X,Y)>1: X and Y are positively correlated
• Lift(X,Y)<1: X and Y are negatively correlated
53
Statistical-based Measure

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


P(Coffee) = 0.9
 Lift = P(Coffee|Tea)/ P(Coffee) = 0.75/0.9= 0.8333
(< 1, therefore is negatively associated)
54
More

Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava. "Selecting the right interestingness measure for
association patterns." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge 55
discovery and data mining. 2002.
What we have learnt so far

• The basic concepts and applications of


association rule mining.
• The Apriori Algorithm to find frequent
itemsets.
• How to efficiently generate rules from frequent
itemset.
• Lift: the statistical-based evaluation

56

You might also like