Data Mining
Principal Component Analysis
   Association Rule Mining
CS 584 :: Fall 2024
Ziwei Zhu
Department of Computer Science
George Mason University
     Part of slides is from Drs. Tan, Steinbach and Kumar.
              Part of slides is from Dr. Jessica Lin.
                                                             1
•   HW4 and HW5 are due 12/02
•   HW3 is due 11/04
•   11/05 election day, no class
•   11/12, no class, watch recorded video
•   Quiz 4 will be move to 11/19
•   Final exam: 12/17, 7:30pm-9:30pm
                                            2
So far …
• Classification & Regression: data with features and labels
• Clustering: data with features
             Now, let’s study transaction data!
                                                       3
Outline
  • Introduction and Motivations
  • Apriori Algorithm
  • Association Rule Evaluation
                                   4
Market Basket Analysis
  If you are a manager of a supermarket, how do you
  design your store layout to promote sales?
  You want to put products always being purchased
  together close.                                     5
Market Basket Analysis
  You want to put products always being purchased
  together close. By looking into the baskets of
  customers.
                                                    6
Association Rules
 • Stores are interested in associations between
   different items that people buy together.
   •   E.g., Someone who buys bread is quite likely to also buy milk
 • Association Rules (if-then rules):                 consequent
   •   E.g., bread ⟹ milk
   •   E.g., {Spider Man, Iron Man} ⟹ The Avengers
antecedent
                                                                7
Association Rules
 • Stores are interested in associations between
   different items that people buy together.
   •   E.g., Someone who buys bread is quite likely to also buy milk
 • Association Rules (if-then rules):                 consequent
   •   E.g., bread ⟹ milk
   •   E.g., {Spider Man, Iron Man} ⟹ The Avengers
antecedent
 • Associations information is used beyond market
   basket analysis.
   •   E.g., medicine, recommender systems (Amazon’s
       “Customers who bought this item also bought…” )
                                                                8
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)
• (Osco Drug) Customers who buy diapers are more likely to also
  buy beer.
  • Daddy needs a beer.
                                                             9
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)
• (Osco Drug) Customers who buy diapers are more likely to also
  buy beer.
  • Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
  three types of candy bars.
  • Kids come along for errands.
                                                             10
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)
• (Osco Drug) Customers who buy diapers are more likely to also
  buy beer.
  • Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
  three types of candy bars.
  • Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
  accompanies the purchase of paper, waste baskets, scissors,
  paper clips, and so on.
  • New hires?
                                                             11
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)
• (Osco Drug) Customers who buy diapers are more likely to also
  buy beer.
  • Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
  three types of candy bars.
  • Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
  accompanies the purchase of paper, waste baskets, scissors,
  paper clips, folders, and so on.
  • New hires?
• (Car insurance) Low credit rating, more car accidents.
                                                                12
Bizarre and Surprising Rules
(From “Predictive Analytics” by Eric Siegel)
• (Osco Drug) Customers who buy diapers are more likely to also
  buy beer.
  • Daddy needs a beer.
• (Walmart) 60% of customers who buy a Barbie doll buy one of
  three types of candy bars.
  • Kids come along for errands.
• (Some large retailer) The purchase of a stapler often
  accompanies the purchase of paper, waste baskets, scissors,
  paper clips, folders, and so on.
  • New hires?
• (Car insurance) Low credit rating, more car accidents.
• Music taste and political affiliation.                        13
 Association Rule Mining
                                Market-basket transactions
• Given a set of transactions
  (baskets)
• A transaction contains a
  small subset of items
• Typically, #items in a
  transaction is small,
  #transactions is very large
                                                       14
 Association Rule Mining
                                 Market-basket transactions
• Given a set of transactions
  (baskets)
• A transaction contains a
  small subset of items
• Typically, #items in a
  transaction is small,
  #transactions is very large
• Goal: find association rules
  of the form {x, y} -> {z}
                                                        15
    Definition: Frequent Itemset
•       Itemset
    •   A collection of many items
      o    Example: {Milk, Bread, Diaper}
    • k-itemset
                                            TID   Items
      o    An itemset containing k items
                                            1     Bread, Milk
                                            2     Bread, Diaper, Beer, Eggs
                                            3     Milk, Diaper, Beer, Coke
                                            4     Bread, Milk, Diaper, Beer
                                            5     Bread, Milk, Diaper, Coke
                                                                      16
    Definition: Frequent Itemset
•       Itemset
    •   A collection of many items
      o    Example: {Milk, Bread, Diaper}
    • k-itemset
                                                  TID   Items
      o    An itemset containing k items
                                                  1     Bread, Milk
•       Support count (𝝈)                         2     Bread, Diaper, Beer, Eggs
    •     Frequency of occurrence of an itemset
                                                  3     Milk, Diaper, Beer, Coke
    •     E.g., 𝜎({Milk, Bread, Diaper})=2
                                                  4     Bread, Milk, Diaper, Beer
                                                  5     Bread, Milk, Diaper, Coke
                                                                            17
    Definition: Frequent Itemset
•       Itemset
    •   A collection of many items
      o    Example: {Milk, Bread, Diaper}
    • k-itemset
                                                     TID   Items
      o    An itemset containing k items
                                                     1     Bread, Milk
•       Support count (𝝈)                            2     Bread, Diaper, Beer, Eggs
    •     Frequency of occurrence of an itemset
                                                     3     Milk, Diaper, Beer, Coke
    •     E.g., 𝜎({Milk, Bread, Diaper})=2
                                                     4     Bread, Milk, Diaper, Beer
•       Support                                      5     Bread, Milk, Diaper, Coke
    •     Fraction of transactions that contain an
          itemset
    •     E.g., s({Milk, Bread, Diaper})=2/5
                                                                               18
    Definition: Frequent Itemset
•       Itemset
    •   A collection of many items
      o    Example: {Milk, Bread, Diaper}
    • k-itemset
                                                     TID   Items
      o    An itemset containing k items
                                                     1     Bread, Milk
•       Support count (𝝈)                            2     Bread, Diaper, Beer, Eggs
    •     Frequency of occurrence of an itemset
                                                     3     Milk, Diaper, Beer, Coke
    •     E.g., 𝜎({Milk, Bread, Diaper})=2
                                                     4     Bread, Milk, Diaper, Beer
•       Support                                      5     Bread, Milk, Diaper, Coke
    •     Fraction of transactions that contain an
          itemset
    •     E.g., s({Milk, Bread, Diaper})=2/5
•       Frequent Itemset
    •     An itemset whose support is greater than
          or equal to a minsup threshold
                                                                               19
 Definition: Association Rule
                                          TID   Items
• Association Rule                        1     Bread, Milk
  •   An implication expression of the    2     Bread, Diaper, Beer, Eggs
      form X → Y, where X and Y are       3     Milk, Diaper, Beer, Coke
      itemsets                            4     Bread, Milk, Diaper, Beer
  •   Example: {Milk, Diaper} →           5     Bread, Milk, Diaper, Coke
      {Beer}
• Rule Evaluation Metrics
  •     Support (s)
      o   Fraction of transactions that
          contain both X and Y
                                                                    20
 Definition: Association Rule
                                             TID   Items
• Association Rule                           1     Bread, Milk
  •       An implication expression of the   2     Bread, Diaper, Beer, Eggs
          form X → Y, where X and Y are      3     Milk, Diaper, Beer, Coke
          itemsets                           4     Bread, Milk, Diaper, Beer
  •       Example: {Milk, Diaper} →          5     Bread, Milk, Diaper, Coke
          {Beer}
• Rule Evaluation Metrics
  •   Support (s)
    o   Fraction of transactions that
        contain both X and Y
  • Confidence (c)
    o   How often Y appears in
        transactions that contain X
                                𝝈(𝑿∪𝒀)
      o     conf(X→Y)=P(Y|X)=
                                  𝝈(𝑿)                                 21
    Definition: Association Rule
                                              TID   Items
•       Rule Evaluation Metrics
    •   Support (s)                           1     Bread, Milk
      o    Fraction of transactions that      2     Bread, Diaper, Beer, Eggs
           contain both X and Y               3     Milk, Diaper, Beer, Coke
    • Confidence (c)                          4     Bread, Milk, Diaper, Beer
      o    How often Y appears in
                                              5     Bread, Milk, Diaper, Coke
           transactions that contain X
                               𝜎(𝑋∪𝑌)
         o   conf(X→Y)=P(Y|X )= 𝜎(𝑋)
                                           Example:
                              {Milk, Diaper}  Beer
                                     (Milk, Diaper, Beer)     2
                              s=                              = = 0.4
                                             |T|               5
                                 (Milk, Diaper, Beer) 2
                             c=                       = = 0.67
                                    (Milk, Diaper)    3                22
 Association Rule Mining Task
• Given a set of transactions 𝕋, the goal is to find all rules
  having
  •   Support ≥ minsup threshold
  •   Confidence ≥ minconf threshold
                                                           23
 Association Rule Mining Task
• Given a set of transactions 𝕋, the goal is to find all rules
  having
  •   Support ≥ minsup threshold
  •   Confidence ≥ minconf threshold
• One possible way: brute-force approach
  •   List all possible association rules
  •   Compute support and confidence for each rule
  •   Prune rules failing the minsup and minconf thresholds
  •   Computationally Impossible
                                                              24
Association Rule Mining Task
                               25
Association Rule Mining Task
                               26
Find Frequent Itemset
                                        null
Items: A, B, C, D, E
                     A       B           C            D       E
        AB    AC      AD   AE     BC           BD      BE   CD     CE    DE
        ABC   ABD    ABE   ACD    ACE          ADE    BCD   BCE    BDE   CDE
                    ABCD   ABCE         ABDE         ACDE   BCDE
                                    ABCDE
25 candidate itemsets                                                          27
Find Frequent Itemset
• If there are n items, there will be 2n possible itemsets
• 1000 items → 21000 > 10300
• 1080 hydrogen atoms in the observable universe!
                             How to solve this?
                                                        28
Outline
  • Introduction and Motivations
  • Apriori Algorithm
  • Association Rule Evaluation
                                   29
Apriori
• Apriori principle:
  •   If an itemset is frequent, then all its subsets must also be
      frequent
• Apriori principle holds due to the following property
  of the support measure:
           X , Y : ( X  Y )  s( X )  s(Y )
                                                                     30
Apriori
            X , Y : ( X  Y )  s( X )  s(Y )
•   Support of an itemset cannot        TID   Items
    be larger than the support of its   1     Bread, Milk
    subsets                             2     Bread, Diaper, Beer, Eggs
                                        3     Milk, Diaper, Beer, Coke
•   This is known as the anti-          4     Bread, Milk, Diaper, Beer
    monotone property of support        5     Bread, Milk, Diaper, Coke
                                        s(Bread, Milk, Diaper)=3
                                        s(Bread, Milk)=4
                                                                    31
 Apriori
                                                   null
Found to be infrequent          A       B           C            D       E
                 AB      AC      AD   AE     BC           BD      BE   CD     CE         DE
                 ABC     ABD    ABE   ACD    ACE          ADE    BCD   BCE    BDE        CDE
                               ABCD   ABCE         ABDE         ACDE   BCDE
                                               ABCDE
              Pruned supersets
                                                                                    32
               Minimum support count=3
Apriori                                  1-itemsets
TID   Items
                                    Item       Count
1     Bread, Milk                   Bread        4
2     Bread, Diaper, Beer, Eggs     Coke         2
3     Milk, Diaper, Beer, Coke      Milk         4
4     Bread, Milk, Diaper, Beer     Beer         3
5     Bread, Milk, Diaper, Coke     Diaper       4
                                    Eggs         1
                                                       33
               Minimum support count=3
Apriori                                  1-itemsets
TID   Items
                                    Item       Count
1     Bread, Milk
                                    Bread        4
2     Bread, Diaper, Beer, Eggs     Coke         2
3     Milk, Diaper, Beer, Coke      Milk         4
4     Bread, Milk, Diaper, Beer     Beer         3
5     Bread, Milk, Diaper, Coke     Diaper       4
                                    Eggs         1
                                                       34
           Minimum support count=3
Apriori                                    1-itemsets
     2-itemsets
                                  Item                Count
  Itemset                         Bread                 4
  {Bread,Milk}                    Coke                  2
  {Bread, Beer }                  Milk                  4
  {Bread,Diaper}                  Beer                  3
  {Beer, Milk}                    Diaper                4
  {Diaper, Milk}                  Eggs                  1
  {Beer,Diaper}
(No need to generate candidates
involving Coke or Eggs)              TID   Items
                                     1     Bread, Milk
                                     2     Bread, Diaper, Beer, Eggs
                                     3     Milk, Diaper, Beer, Coke
                                     4     Bread, Milk, Diaper, Beer
                                     5     Bread, Milk, Diaper, Coke
                                                              35
           Minimum support count=3
Apriori                                    1-itemsets
     2-itemsets
                                 Item                 Count
Itemset           Count          Bread                  4
{Bread,Milk}        3            Coke                   2
{Bread,Beer}        2            Milk                   4
{Bread,Diaper}      3            Beer                   3
{Milk,Beer}         2            Diaper                 4
{Milk,Diaper}       3            Eggs                   1
{Beer,Diaper}       3
                                     TID   Items
                                     1     Bread, Milk
                                     2     Bread, Diaper, Beer, Eggs
                                     3     Milk, Diaper, Beer, Coke
                                     4     Bread, Milk, Diaper, Beer
                                     5     Bread, Milk, Diaper, Coke
                                                              36
           Minimum support count=3
Apriori                                    1-itemsets
     2-itemsets
                                 Item                 Count
Itemset           Count          Bread                  4
{Bread,Milk}        3            Coke                   2
{Bread,Beer}        2            Milk                   4
{Bread,Diaper}      3            Beer                   3
{Milk,Beer}         2            Diaper                 4
{Milk,Diaper}       3            Eggs                   1
{Beer,Diaper}       3
    3-itemsets                       TID   Items
                                     1     Bread, Milk
                                     2     Bread, Diaper, Beer, Eggs
                                     3     Milk, Diaper, Beer, Coke
                                     4     Bread, Milk, Diaper, Beer
                                     5     Bread, Milk, Diaper, Coke
                                                              37
            Minimum support count=3
Apriori                                     1-itemsets
     2-itemsets
                                  Item                 Count
Itemset            Count          Bread                  4
{Bread,Milk}         3            Coke                   2
{Bread,Beer}         2            Milk                   4
{Bread,Diaper}       3            Beer                   3
{Milk,Beer}          2            Diaper                 4
{Milk,Diaper}        3            Eggs                   1
{Beer,Diaper}        3
     3-itemsets                       TID   Items
                                      1     Bread, Milk
Itemset               Count           2     Bread, Diaper, Beer, Eggs
{ Beer, Diaper, Milk}   2             3     Milk, Diaper, Beer, Coke
                                      4     Bread, Milk, Diaper, Beer
{ Beer,Bread, Diaper}   2
                                      5     Bread, Milk, Diaper, Coke
{Bread, Diaper, Milk}   1
{Beer, Bread, Milk}     1                                      38
            Minimum support count=3
Apriori                                       1-itemsets
     2-itemsets
                                         Item         Count
Itemset            Count                 Bread          4
{Bread,Milk}         3                   Coke           2
{Bread,Beer}         2                   Milk           4
{Bread,Diaper}       3                   Beer           3
{Milk,Beer}          2                   Diaper         4
{Milk,Diaper}        3                   Eggs           1
{Beer,Diaper}        3
                              If every subset is considered,
                                  6C + 6C + 6C
                                    1     2      3
     3-itemsets
                                  6 + 15 + 20 = 41
Itemset               Count
{ Beer, Diaper, Milk}   2
                              With support-based pruning,
{ Beer,Bread, Diaper}   2
{Bread, Diaper, Milk}   1        6 + 6 + 4 = 16
{Beer, Bread, Milk}     1                                   39
Apriori
Algorithm:
• Generate frequent itemsets of length 1
• Let k=2
• Repeat until no new frequent itemsets are
   identified
   o   Generate length k candidate itemsets from length
       k-1 frequent itemsets
   o   Count the support of each candidate by scanning
       the database
   o   Eliminate candidates that are infrequent, leaving
       only those that are frequent
                                                      40
Apriori
Algorithm:
• Generate frequent itemsets of length 1
• Let k=2
• Repeat until no new frequent itemsets are
   identified
   o   Generate length k candidate itemsets from length
       k-1 frequent itemsets (how???)
   o   Count the support of each candidate by scanning
       the database
   o   Eliminate candidates that are infrequent, leaving
       only those that are frequent
                                                      41
Candidate Itemsets Generation
   o   Generate length k candidate itemsets from
       length k-1 frequent itemsets
              A basic approach:
                • Fk-1x F1 method
                • Fk-1x Fk-1 method
                                                   42
Candidate Itemsets Generation
• Fk-1x F1 Method        TID   Items
                         1     Bread, Milk
                         2     Bread, Diaper, Beer, Eggs
                         3     Milk, Diaper, Beer, Coke
                         4     Bread, Milk, Diaper, Beer
                         5     Bread, Milk, Diaper, Coke
                                                    43
Candidate Itemsets Generation
• Fk-1x Fk-1 Method                      TID   Items
                                         1     Bread, Milk
                                         2     Bread, Diaper, Beer, Eggs
                                         3     Milk, Diaper, Beer, Coke
                                         4     Bread, Milk, Diaper, Beer
                                         5     Bread, Milk, Diaper, Coke
                 Only merge a pair of frequent (k-1)-itemsets if
                 their first k-2 items are identical!
                                                                    44
Association Rule Mining Task
 Two steps:
 • Find all frequent itemsets
 • Generate high-confidence association rules from
   each frequent itemset
                              ?
                                                     45
Rule Generation
                  46
Rule Generation
How to efficiently generate rules from frequent itemsets?
                                ABCD=>{ }
            BCD=>A     ACD=>B               ABD=>C        ABC=>D
   CD=>AB     BD=>AC   BC=>AD         AD=>BC         AC=>BD        AB=>CD
            D=>ABC     C=>ABD               B=>ACD        A=>BCD
                                                                            47
Rule Generation
How to efficiently generate rules from frequent itemsets?
• confidence of rules generated from the same itemset
   has an anti-monotone property
  • E.g., L = {A,B,C,D}:
                𝜎(𝐴𝐵𝐶) ≤ 𝜎(𝐴𝐵) ≤ 𝜎(𝐴)
               𝜎(𝐴𝐵𝐶𝐷)       𝜎(𝐴𝐵𝐶𝐷)       𝜎(𝐴𝐵𝐶𝐷)
                         ≥             ≥
                𝜎(𝐴𝐵𝐶)        𝜎(𝐴𝐵)         𝜎(𝐴)
      𝑐 𝐴𝐵𝐶 → 𝐷 ≥ 𝑐 𝐴𝐵 → 𝐶𝐷  𝑐 𝐴 → 𝐵𝐶𝐷
                                                      48
 Rule Generation
Low-confidence rule
                                        ABCD=>{ }
                 BCD=>A        ACD=>B               ABD=>C        ABC=>D
        CD=>AB        BD=>AC   BC=>AD         AD=>BC         AC=>BD        AB=>CD
                 D=>ABC        C=>ABD               B=>ACD        A=>BCD
                  Pruned rules
                                                                              49
Outline
  • Introduction and Motivations
  • Apriori Algorithm
  • Association Rule Evaluation
                                   50
Limitation of Confidence
          Coffee   Coffee
   Tea     15        5      20
   Tea     75        5      80
           90       10      100
 Association Rule: Tea → Coffee
 Confidence= P(Coffee|Tea) = 15/20 = 0.75
  Although confidence is high, rule is misleading
  P(Coffee|Tea) = 0.9375
  Because P(Coffee) = 90/100 = 0.9
                                                     51
Statistical Independence
  • Population of 1000 students
    – 600 students know how to swim (S)
    – 700 students know how to bike (B)
    – 420 students know how to swim and bike (S,B)
    – P(SÙB) = 420/1000 = 0.42
    – P(S) ´ P(B) = 0.6 ´ 0.7 = 0.42
    – P(SÙB) = P(S) ´ P(B) => Statistical independence
    – P(SÙB) > P(S) ´ P(B) => Positively correlated
    – P(SÙB) < P(S) ´ P(B) => Negatively correlated
                                                         52
Statistical-based Measure
 Lift: the ratio between the observed support and the
 expected support if X and Y were independent
                 sup(𝑋 ∪ 𝑌)
  𝐿𝑖𝑓𝑡 𝑋 → 𝑌 =
               sup 𝑋 sup(𝑌)
               conf(𝑋 → 𝑌)
             =              = 𝐿𝑖𝑓𝑡 𝑌 → 𝑋
                  sup(𝑌)
       •   Range [0, ∞)
       •   Lift(X,Y)=1: X and Y are independent
       •   Lift(X,Y)>1: X and Y are positively correlated
       •   Lift(X,Y)<1: X and Y are negatively correlated
                                                            53
Statistical-based Measure
          Coffee   Coffee
   Tea      15       5      20
   Tea      75       5      80
            90      10      100
 Association Rule: Tea → Coffee
 Confidence= P(Coffee|Tea) = 0.75
 P(Coffee) = 0.9
  Lift = P(Coffee|Tea)/ P(Coffee) = 0.75/0.9= 0.8333
 (< 1, therefore is negatively associated)
                                                    54
More
Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava. "Selecting the right interestingness measure for
association patterns." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge    55
discovery and data mining. 2002.
What we have learnt so far
  • The basic concepts and applications of
    association rule mining.
  • The Apriori Algorithm to find frequent
    itemsets.
  • How to efficiently generate rules from frequent
    itemset.
  • Lift: the statistical-based evaluation
                                                      56