0% found this document useful (0 votes)
12 views62 pages

Association Analysis Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Uploaded by

dasoha5437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views62 pages

Association Analysis Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

Uploaded by

dasoha5437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Mining

Association Analysis

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

3/8/2021 Introduction to Data Mining, 2 nd Edition 1


Introduction

 Business enterprises accumulate large quantities of data


from their day-to-day operations
 huge amounts of customer purchase data collected daily at
the checkout counters of grocery stores.
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

3/8/2021 Introduction to Data Mining, 2 nd Edition 2


Introduction

 Business enterprises accumulate large quantities of data


from their day-to-day operations
 huge amounts of customer purchase data collected daily at
the checkout counters of grocery stores.
Market-Basket transactions Analyze the data to learn about the
purchasing behaviour of the
TID Items customers.
1 Bread, Milk
Valuable information can be used to
2 Bread, Diaper, Beer, Eggs support variety of business-related
3 Milk, Diaper, Beer, Coke applications:

4 Bread, Milk, Diaper, Beer → Marketing promotions

5 Bread, Milk, Diaper, Coke → Inventory managment


→ CRM
→ Pricing
3/8/2021 Introduction to Data Mining, 2 nd Edition 3
Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} →
{Beer},
1 Bread, Milk {Milk, Bread} →
{Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} →{Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

3/8/2021 Introduction to Data Mining, 2 nd Edition 4


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

3/8/2021 Introduction to Data Mining, 2 nd Edition 5


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count (
σ) 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. σ
({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke

3/8/2021 Introduction to Data Mining, 2 nd Edition 6


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count ( σ) 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. σ
({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5

3/8/2021 Introduction to Data Mining, 2 nd Edition 7


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count ( σ) 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. σ
({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold

3/8/2021 Introduction to Data Mining, 2 nd Edition 8


Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X→Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} →
{Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

3/8/2021 Introduction to Data Mining, 2 nd Edition 9


Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X→Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} →
{Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain
both X and Y
– Confidence (c)
 Measures how often items in Y
appear in transactions that
contain X

3/8/2021 Introduction to Data Mining, 2 nd Edition 10


Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X→Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} →
{Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain Example:
both X and Y {Milk , Diaper}  {Beer}
– Confidence (c)
 Measures how often items in Y  (Milk , Diaper, Beer ) 2
appear in transactions that s  0.4
contain X
|T| 5
 (Milk, Diaper, Beer ) 2
c  0.67
 (Milk , Diaper ) 3
3/8/2021 Introduction to Data Mining, 2 nd Edition 11
Association Rule Mining Task

 Given a set of transactions T, the goal of association


rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

3/8/2021 Introduction to Data Mining, 2 nd Edition 12


Association Rule Mining Task

 Given a set of transactions T, the goal of association


rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!

3/8/2021 Introduction to Data Mining, 2 nd Edition 13


Computational Complexity
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d 
d1  d  k 
d k
R       
 k   j 
k 1 j 1

3  2  1
d d 1

If d=6, R = 602 rules

3/8/2021 Introduction to Data Mining, 2 nd Edition 14


Mining Association Rules

Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper} →
{Beer} (s=0.4, c=0.67)
{Milk,Beer} →
{Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} →{Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} →{Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Diaper} →
{Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} →
{Diaper,Beer} (s=0.4, c=0.5)

3/8/2021 Introduction to Data Mining, 2 nd Edition 15


Mining Association Rules

Observations:
 All the rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}

 Rules originating from the same itemset have identical support but
can have different confidence

 Thus, we may decouple the support and confidence requirements

3/8/2021 Introduction to Data Mining, 2 nd Edition 16


Mining Association Rules

Divide the problem in 2 sub tasks:


1. Frequent Itemset Generation

Find all the itemsets where support >= minsup threshold
2. Rule Generation

From above set find all itemsets with high confidence. These
are strong rules

 Frequent Itemset Generation is still computationally


expensive than Rule Generation

3/8/2021 Introduction to Data Mining, 2 nd Edition 17


Frequent Itemset Generation

What are the list of possible itemsets?

For items, I = { a, b, c, d, e}

Itemsets could be a, b, c, d, e

ab, ac, ad, ae

bc, bd, be

cd, ce

de

abc, abd, abe, acd, ace, and all such combinations

3/8/2021 Introduction to Data Mining, 2 nd Edition 18


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


How many possible
candidate itemsets?
ABCDE

3/8/2021 Introduction to Data Mining, 2 nd Edition 19


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
3/8/2021 Introduction to Data Mining, 2 nd Edition 20
Frequent Itemset Generation
 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match (compare) each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d-1 !!!
3/8/2021 Introduction to Data Mining, 2 nd Edition 21
Frequent Itemset Generation Strategies

 Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

 Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

 Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction

3/8/2021 Introduction to Data Mining, 2 nd Edition 22


The Apriori Principle

 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

 For example:
– {Milk, Bread, Diapers} is a frequent itemset

means Milk, Bread, Diapers support is high

thus {Milk, Bread, Diapers} occur in many
transaction

3/8/2021 Introduction to Data Mining, 2 nd Edition 23


The Apriori Principle

 Apriori principle:

If {Milk, Bread, Diapers} occurs in many transactions

So does {Milk} , {Bread}, {Diapers}

and

{Milk, Bread}, {Milk, Diapers}, {Bread, Diapers}

3/8/2021 Introduction to Data Mining, 2 nd Edition 24


The Apriori Principle

3/8/2021 Introduction to Data Mining, 2 nd Edition 25


The Apriori Principle

 Apriori principle:
– If an itemset is infrequent, then all of its supersets must
also be infrequent

 For example:
– if {Coke} , {Jam} is infrequent itemset

means {Coke} , {Jam} support is low

thus {Coke, Jam} also don’t occur in many
transaction

3/8/2021 Introduction to Data Mining, 2 nd Edition 26


The Apriori Principle

3/8/2021 Introduction to Data Mining, 2 nd Edition 27


The Apriori Principle

 Support-Based Pruning.
– Trimming the exponential search space based on the
support measure.
– Based on Anti-Monotone Property

Support of an itemset never exceeds the support of
it’s subsets:

I : set of items,

J = 2I (power set) // all possible combinations

S (support) is anti-monotone (or downward closed)

X , Y : ( X  Y )  s ( X ) s(Y )

Any anti-monotone measured can be used in mining
3/8/2021 Introduction to Data Mining, 2 nd Edition 28
The Apriori Principle

 Monotonicity Property
 I : set of items,
 J = 2I (power set) // all possible combinations
– f is monotone (or upward closed)

∀X, Y ∈ J : (X ⊆ Y ) → f (X) ≤ f (Y )

3/8/2021 Introduction to Data Mining, 2 nd Edition 29


Apriori Algorithm

 Uses Apriori Principle

 Support Based Pruning

 Control exponential growth of candidate itemset

3/8/2021 Introduction to Data Mining, 2 nd Edition 30


TID Items

Apriori Algorithm 1
2
3
Bread, Milk
Beer, Bread, Diaper, Eggs
Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2 nd Edition 31


TID Items

Apriori Algorithm 1
2
3
Bread, Milk
Beer, Bread, Diaper, Eggs
Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
3/8/20216 + 6 + 1Introduction
= 13 to Data Mining, 2 nd Edition 32
Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
3/8/2021 Introduction to Data Mining, 2 nd Edition 34
Illustrating Apriori Principle

TID
TID Items
Items Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread,
B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 B read, Diaper, Milk
Beer, Bread, Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 35


Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 36


Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
Milk
4 Beer, Bread, Diaper, Milk
Milk
5 Bread, Coke, Diaper, Milk
Milk

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 37


Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
Milk
4 Beer, Bread, Diaper, Milk
Milk
5 Bread, Coke, Diaper, Milk
Milk

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3
candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 38


Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
Milk
4 Beer, Bread, Diaper, Milk
Milk
5 Bread, Coke, Diaper, Milk
Milk

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk}
{ Beer,Bread,Diaper}
6 + 15 + 20 = 41 {Bread,Diaper,Milk}
With support-based pruning, { Beer, Bread, Milk}
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 39


Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
Milk
4 Beer, Bread, Diaper, Milk
Milk
5 Bread, Coke, Diaper, Milk
Milk

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
{ Beer,Bread, Diaper} 2
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
With support-based pruning, {Beer, Bread, Milk} 1
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 40


Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
Milk
4 Beer, Bread, Diaper, Milk
Milk
5 Bread, Coke, Diaper, Milk
Milk

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
{Bread, Diaper, Milk} 2
With support-based pruning,
{Beer, Bread, Milk} 1
6 + 6 + 4 = 16
6 + 6 + 1 = 13

3/8/2021 Introduction to Data Mining, 2 nd Edition 41


Apriori Algorithm


Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified

Generate length(k+1) candidate itemsets from length k
frequent itemsets

Prune candidate itemsets containing subset of length k
that are infrequent

Count the support of each candidate by scanning the DB

Eliminate candidates that are infrequent, leaving only
those that are frequent

3/8/2021 Introduction to Data Mining, 2 nd Edition 42


Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items


3/8/2021 Introduction to Data Mining, 2 nd Edition 43
Support Counting Using a Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
 Hash function
 Max leaf size: max number of itemsets stored in a leaf node (if number
of candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
3/8/2021 Introduction to Data Mining, 2 nd Edition 44
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2 nd Edition 45


Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2 nd Edition 46


Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2 nd Edition 47


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2 nd Edition 48


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2 nd Edition 49


Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
3/8/2021 Introduction to Data Mining, 2 nd Edition 50
Rule Generation

 Given a frequent itemset L, find all non-empty


subsets f L such that f L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC 
D, ABD 
C, ACD 
B, BCD 
A,
ABCD,B 
ACD,C 
ABD, DABC
AB 
CD,AC BD, AD BC, BC 
AD,
BD 
AC, CD 
AB,

 If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L and L)

3/8/2021 Introduction to Data Mining, 2 nd Edition 51


Rule Generation

 In general, confidence does not have an anti-


monotone property
c(ABC 
D) can be larger or smaller than c(AB 
D)

 But confidence of rules generated from the same


itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABC D) c(AB CD) c(A BCD)

– Confidence is anti-monotone w.r.t. number of items


on the RHS of the rule
3/8/2021 Introduction to Data Mining, 2 nd Edition 52
Rule Generation for Apriori Algorithm

Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules

3/8/2021 Introduction to Data Mining, 2 nd Edition 53


Association Analysis: Basic Concepts

and Algorithms

Algorithms and Complexity

3/8/2021 Introduction to Data Mining, 2 nd Edition 54


Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

 Dimensionality (number of items) of the data set

 Size of database

 Average transaction width


3/8/2021 Introduction to Data Mining, 2 nd Edition 55


Factors Affecting Complexity of
Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set

 Size of database TID Items



1 Bread, Milk
2 Beer, B read, Diaper, Eggs
 Average transaction width 3 Beer, Coke, Diaper, Milk
– 4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2 nd Edition 56


Impact of Support Based Pruning

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Item Count
Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3
If every subset is Minimum Support = 2
considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based
6 + 15 + 20 +15 = 56
pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2 nd Edition 57


Factors Affecting Complexity of
Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
TID Items
 Average transaction width
1 Bread, Milk

2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2 nd Edition 58


Factors Affecting Complexity of
Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
3/8/2021 Introduction to Data Mining, 2 nd Edition 59
Factors Affecting Complexity of
Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting

3/8/2021 Introduction to Data Mining, 2 nd Edition 60


Factors Affecting Complexity of
Apriori

3/8/2021 Introduction to Data Mining, 2 nd Edition 61


Compact Representation of Frequent Itemsets

 Some frequent itemsets are redundant because their


supersets are also frequent
Consider the following data set. Assume support threshold =5
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

 10 
10
Number of frequent itemsets 3   
k
k 1

 Need a compact representation

3/8/2021 Introduction to Data Mining, 2 nd Edition 62


Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E

3/8/2021 Introduction to Data Mining, 2 nd Edition 63

You might also like