0% found this document useful (0 votes)
40 views12 pages

DMDW - Association Analysis

The document discusses association rule mining, focusing on frequent patterns and the generation of rules that predict item occurrences based on transaction data. It outlines methods such as the Apriori algorithm and FP-growth for mining frequent itemsets, emphasizing the importance of support and confidence thresholds. Additionally, it explores various types of association rules, including multilevel, multidimensional, and quantitative rules, highlighting their applications in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

DMDW - Association Analysis

The document discusses association rule mining, focusing on frequent patterns and the generation of rules that predict item occurrences based on transaction data. It outlines methods such as the Apriori algorithm and FP-growth for mining frequent itemsets, emphasizing the importance of support and confidence thresholds. Additionally, it explores various types of association rules, including multilevel, multidimensional, and quantitative rules, highlighting their applications in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Association Analysis

Association Rule Mining

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear
in a data set frequently.

Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction

For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset.

TID List of Items

1 Milk, Bread, Cereal


2 Milk, Bread, Sugar, Eggs
3 Milk, Bread, Butter
4 Sugar, Eggs

Example of Association Rules


({Milk}  {Bread})
({Milk,Bread}  {Butter})
({Bread}  {Sugar})

Definitions:
Itemset
– A collection of one or more items
 Example: {Milk, Bread}
– k-itemset
 An itemset that contains k items
Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread}) = 3
Support
– Fraction of transactions that contain an itemset
– E.g. S({Milk}  {Bread}) = 3/4
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold
Ex., computer ⇒ antivirus software

[support = 2%,confidence = 60%]

Rule support and confidence are two measures of rule interestingness.

A support of 2% for Association Rule means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased together.

A confidence of 60% means that 60% of the customers who purchased a computer
also bought the software.

Typically, association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold.

Association Rule Mining Task:

Given a set of transactions T, the goal of association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
– Computationally prohibitive!
The total number of possible rules extracted from a data set that contains d items is

Mining Association Rules:

A common strategy adopted by many association rule mining algorithms is to


decompose the problem into two major subtasks.
Two-step approach:
1. Frequent Itemset Generation
– whose objective is to find all the itemsets that satisfy the minsup
threshold.
– These itemsets are called frequent itemsets.
2. Rule Generation
– whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step.
– These rules are called strong rules.
Frequent itemset generation is still computationally expensive.
Association Rule Mining Methods:

1. Apriori Method: Finding Frequent Itemsets with candidate generation


The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties, as we shall see later. Apriori employs an iterative approach
known as a level-wise search, where k-itemsets are used to explore (k+1) - itemsets. First, the
set of frequent 1-itemsets is found by scanning the database to accumulate the count for each
item, and collecting those items that satisfy minimum support. The resulting set is denoted by
L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full
scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an important


property called the Apriori property is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

It includes two steps:

i. Join step:
• To find Lk, a set of candidate k-itemsets is generated by joining Lk−1 with
itself.
• The join, Lk−1 on Lk−1,that is, members l1 and l2 of Lk−1 are joined if
• (l1[1] = l2[1]) ∧ (l1[2] =l2[2]) ∧...∧(l1[k−2] = l2[k−2]) ∧(l1[k−1] < l2[k−1]).
• Ensures that no duplicates are generated

ii. Prune step:
The prune component employs the Apriori property to remove candidates that have a
subset that is not frequent.

Apriori Example-1:
The Apriori Algorithm
Generating association rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is straight
forward to generate strong association rules from them (where strong association rules satisfy
both minimum support and minimum confidence). This can be done by using the below
equation,

Suppose the data contain the frequent itemset l = {I1, I2, I5}
What are the association rules that can be generated from l?
The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}.
The resulting association rules are as shown below, each listed with its confidence:

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules
above are output, because these are the only ones generated that are strong.

Example-2: A database has 5 transactions. Let min sup = 60% and min conf = 80%.

a) Find all frequent itemsets using Apriori


b) List all of the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and item i denotes
variables representing items (e.g., “A”, “B”, etc.):

Scan DB and find frequent 1-itemset (single item pattern) their support > = 3
Finally resulting in the complete set of frequent itemsets: { e, k, m, o, y, ke, oe, mk, ok, ky,
oke }

2. Frequent Pattern(FP)-Growth Approach for Mining Frequent


Itemsets without candidate generation
As we have seen, in many cases the Apriori candidate generate-and-test method significantly
reduces the size of candidate sets, leading to good performance gain. However, it can suffer
from two nontrivial costs:

 It may still need to generate a huge number of candidate sets. For example, if there
are 104 frequent 1-itemsets, the Apriori algorithm will need to generate more than 107
candidate 2-itemsets.
 It may need to repeatedly scan the whole database and check a large set of candidates
bypattern matching. It is costly to go over each transaction in the database to
determine the support of the candidate itemsets.

“Can we design a method that mines the complete set of frequent itemsets without such a
costly candidate generation process?” An interesting method in this attempt is called
frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy
as follows. First, it compresses the database representing frequent items into a frequent
pattern tree, or FP-tree, which retains the itemset association information. It then divides
the compressed database into a set of conditional databases (a special kind of projected
database), each associated with one frequent item or “pattern fragment,” and mines each
database separately. For each “pattern fragment,” only its associated data sets need to be
examined. Therefore, this approach may substantially reduce the size of the data sets to be
searched, along with the “growth” of patterns being examined.

Example:
Mining Various Kinds of Association Rules

• Mining multilevel association rules


• Miming multidimensional association rules
• Mining quantitative association rules

Multilevel associations involve concepts at different abstraction levels.


Multidimensional associations involve more than one dimension or predicate (e.g.,
rules that relate what a customer buys to his or her age). Quantitative association rules
involve numeric attributes that have an implicit ordering among values (e.g., age).
Rare patterns are patterns that suggest interesting although rare item combinations.

1. Mining multilevel association rules

For many applications, strong associations discovered at high abstraction levels, though with
high support, could be commonsense knowledge. We may want to drill down to find novel
patterns at more detailed levels.

The same minimum support threshold is used when mining at each abstraction level. For
example, in Figure, a minimum support threshold of 5% is used throughout (e.g., for mining
from “computer” downward to “laptop computer”). Both “computer” and “laptop computer”
are found to be frequent, whereas “desktop computer” is not.

Each abstraction level has its own minimum support threshold. The deeper the abstraction
level, the smaller the corresponding threshold. For example, in Figure , the minimum support
thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop
computer,” and “desktop computer” are all considered frequent.

2. Mining Multi-Dimensional Association rules

Following the terminology used in multi-dimensional databases, we refer to each distinct predicate in
a rule as a dimension. Hence, we can refer to Rule as a single dimensional or intra-dimensional
association rule because it contains a single distinct predicate (e.g., buys) with multiple occurrences
(i.e., the predicate occurs more than once within the rule). Such rules are commonly mined from
transactional data.

Association rules that involve two or more dimensions or predicates can be referred to as multi-
dimensional association rules. Rule (7.7) contains three predicates (age, occupation, and buys), each
of which occurs only once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called inter-dimensional
association rules. We can also mine multidimensional association rules with repeated predicates,
which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional
association rules. An example of such a rule is the following, where the predicate buys is repeated

3. Mining Quantitative Association Rules

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes

 The transformed multidimensional data may be used to construct a data cube.


 Data cubes are well suited for the mining of multidimensional association rules
 They store aggregates (such as counts), in multi dimensional space, which is essential
for computing the support and confidence of multidimensional association rules.
 The base cuboid aggregates the task-relevant data by age, income, and buys;
 The 2-D cuboid, (age, income), aggregates by age and income, and so on;
 The 0-D (apex) cuboid contains the total number of transactions in the task-relevant
data.

You might also like