0% found this document useful (0 votes)
38 views50 pages

UNIT 2 Updated

The document provides an overview of association rules in data mining, focusing on their definition, components, and the process of generating them through frequent itemset generation and rule evaluation. Key metrics such as support, confidence, and lift are discussed, along with algorithms like Apriori and FP-Growth. Applications of association rules include market basket analysis, recommendation systems, and fraud detection.

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views50 pages

UNIT 2 Updated

The document provides an overview of association rules in data mining, focusing on their definition, components, and the process of generating them through frequent itemset generation and rule evaluation. Key metrics such as support, confidence, and lift are discussed, along with algorithms like Apriori and FP-Growth. Applications of association rules include market basket analysis, recommendation systems, and fraud detection.

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

DATA MINING

UNIT 2: ASSOCIATION RULES

Association Rules: Problem Definition, Frequent Item Set Generation, The APRIORI
Principle, Support and Confidence Measures, Association Rule Generation; APRIOIRI
Algorithm, The Partition Algorithms, FP-Growth Algorithms, Compact Representation of
Frequent Item Set- Maximal Frequent Item Set, Closed Frequent Item Set.

Association Rules in Data Mining


Association rules are a fundamental concept in data mining, particularly in the context of
market basket analysis. They help identify relationships between different items in large
datasets. Below is a comprehensive overview covering definitions, examples, and steps
involved in generating association rules.
What are Association Rules?
• Definition: An association rule is a rule that implies a strong relationship between two
items or itemsets in a dataset. It is typically expressed in the form (𝐴→𝐵)(A→B),
where:
o (𝐴)(A) is the antecedent (the item or itemset that is present).
o (𝐵)(B) is the consequent (the item or itemset that is predicted to occur).
Example
If a customer buys Bread (A), they are likely to also buy Butter (B). This can be represented
as:
• Rule: Bread → Butter
Components of Association Rules
Association rules are evaluated based on several key metrics:
1. Support
• Definition: Support measures the frequency of occurrence of an itemset in the dataset.

2. Confidence
• Definition: Confidence measures the likelihood that the consequent occurs given the
antecedent.

3. Lift
• Definition: Lift measures how much more likely the consequent is to occur in the
presence of the antecedent compared to its general occurrence.

Steps to Generate Association Rules


Step 1: Data Collection
• Gather transactional data, such as sales data from a retail store.
Step 2: Preprocessing
• Clean the data by removing duplicates and handling missing values.
• Transform the data into a suitable format (e.g., transaction format).
Step 3: Frequent Itemset Generation

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 1


• Use algorithms like Apriori or FP-Growth to identify frequent itemsets, which are
item combinations that appear together frequently in transactions.
Step 4: Rule Generation
• Generate association rules from the frequent itemsets.
Step 5: Rule Evaluation
• Evaluate the generated rules using support, confidence, and lift to filter out the most
relevant rules.
Example of Association Rule Mining
Consider a dataset of transactions from a grocery store:

Transaction ID Items Purchased

1 Bread, Milk

2 Bread, Diaper, Beer

3 Milk, Diaper, Beer

4 Bread, Milk, Diaper, Beer

5 Milk, Diaper

Frequent Itemsets
Using an appropriate minimum support threshold (e.g., 60%), we might identify the
following frequent itemsets:
• {Bread}
• {Milk}
• {Diaper}
• {Bread, Milk}
• {Milk, Diaper}
Generated Rules
From the frequent itemsets, we can generate the following rules:
• Rule 1: Bread → Milk
• Rule 2: Milk → Diaper
Evaluation of Rules
• Rule 1:
o Support: 0.6 (3 out of 5 transactions contain both)
o Confidence: 0.75 (3 out of 4 transactions with Bread also contain Milk)
o Lift: 0.9375 (less than 1 indicates no strong relationship)
• Rule 2:
o Support: 0.6 (3 out of 5 transactions contain both)
o Confidence: 0.75 (3 out of 4 transactions with Milk also contain Diaper)
o Lift: 1.5 (greater than 1 indicates a strong relationship)
Applications of Association Rules
• Market Basket Analysis: Understanding customer purchase behavior to optimize
product placement.
• Cross-Selling Recommendations: Suggesting additional products based on customer
purchases.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 2


• Web Usage Mining: Analyzing user behavior on websites to improve user
experience.
Conclusion
Association rules are a powerful tool in data mining that help to uncover hidden patterns in
transactional data. By understanding and applying support, confidence, and lift, businesses
can gain valuable insights into customer behavior and enhance their marketing strategies.

Second Method

Association Rules in Data Mining


Association rule mining is a technique used in data mining to discover interesting relationships,
patterns, or associations between variables in large datasets. It is commonly used in market
basket analysis, where the goal is to find product combinations that customers frequently
purchase together.
The process involves two main steps:
1. Frequent Itemset Generation: Identifying itemsets (groups of items) that appear
frequently in the dataset.
2. Rule Generation: Creating rules that describe how the presence of certain items in a
transaction is related to the presence of other items.
Structure of an Association Rule:
An association rule is typically written in the form:

Where:
• X is called the antecedent (or the left-hand side of the rule), which refers to the item
or set of items.
• Y is called the consequent (or the right-hand side of the rule), which refers to the item
or set of items that are likely to occur when XXX occurs.

For example:

This rule means that customers who buy Bread and Milk together are likely to also buy
Butter.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 3


Key Measures in Association Rules
1. Support
Support is a measure of how frequently the itemset (or combination of items) appears in the
dataset. It gives an idea of how relevant the rule is across all transactions.
• Support of an itemset X∪Y:

For example, if Bread and Milk appear together in 3 out of 10 transactions, the support for
{Bread, Milk} is 0.3 (or 30%).

2. Confidence
Confidence measures how often the rule X→YX \rightarrow YX→Y is true, meaning that if
itemset XXX occurs, how often itemset YYY also occurs. It is the conditional probability of
YYY given XXX.

For example, if 3 out of 5 customers who bought Bread also bought Butter, the confidence of
the rule {Bread} → {Butter} is 3/5 = 0.6 (or 60%).

3. Lift
Lift measures how much more likely the consequent Y is to occur given the antecedent X,
compared to the overall likelihood of Y occurring in the dataset. It helps assess the strength
of the rule.

A lift greater than 1 indicates that the occurrence of X increases the likelihood of Y, whereas
a lift less than 1 indicates a negative correlation.

Example of Association Rules


Let's consider a simple example of a grocery store's transactions:
Transaction ID Items Purchased
1 {Bread, Milk, Butter}
2 {Bread, Milk}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 4


Transaction ID Items Purchased
3 {Milk, Butter}
4 {Bread, Butter}
5 {Milk, Eggs, Butter}

Step 1: Frequent Itemsets


Let’s assume the minimum support threshold is 40%, and we identify frequent itemsets.
• {Bread} (3/5 = 60%)
• {Milk} (4/5 = 80%)
• {Butter} (4/5 = 80%)
• {Bread, Butter} (2/5 = 40%)
• {Milk, Butter} (3/5 = 60%)

Step 2: Generate Association Rules


From the frequent itemsets, we can generate rules such as:
• Rule 1: {Bread} → {Butter}
o Support: 40%
o Confidence: 66.7% (2 out of 3 transactions containing Bread also contain
Butter)
o Lift: 0.6670.8=0.833\frac{0.667}{0.8} = 0.8330.80.667=0.833
• Rule 2: {Milk} → {Butter}
o Support: 60%
o Confidence: 75% (3 out of 4 transactions containing Milk also contain Butter)
o Lift: 0.750.8=0.9375\frac{0.75}{0.8} = 0.93750.80.75=0.9375

Applications of Association Rules


• Market Basket Analysis: Identifying products that are frequently purchased together.
• Recommendation Systems: Suggesting additional products based on user behavior
(e.g., "People who bought X also bought Y").
• Fraud Detection: Finding unusual patterns in financial transactions.
• Healthcare: Identifying relationships between symptoms and diagnoses.

Algorithms for Association Rule Mining


1. Apriori Algorithm: The most widely used algorithm that generates frequent itemsets
and then derives association rules from them. It uses a "bottom-up" approach by
generating candidate itemsets and pruning those that don’t meet the minimum
support.
2. FP-Growth Algorithm: A more efficient algorithm than Apriori, which constructs a
frequent pattern tree (FP-tree) to generate frequent itemsets without candidate
generation.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 5


Association Rules in Data Mining
Problem Statement
In data mining, association rule learning is a rule-based machine learning method for
discovering interesting relations between variables in large databases. The goal is to identify
strong rules discovered in databases using measures of interestingness.
Key Objectives
• Identify frequent itemsets that occur together.
• Generate association rules that show relationships between these itemsets.
• Evaluate the rules using metrics like support, confidence, and lift.
Frequent Itemset Generation
Frequent itemset generation is the first step in the process of association rule mining. It involves
finding all itemsets that have a support greater than or equal to a specified
minimum support threshold.
Example Scenario
Consider a retail store that wants to analyze customer purchase behavior.
Transaction Data
Assume the following transactions from the store:

Transaction ID Items Purchased

1 Bread, Milk

2 Bread, Diaper, Beer

3 Milk, Diaper, Beer

4 Bread, Milk, Diaper, Beer

5 Milk, Diaper

Step 1: Define Minimum Support


Let’s define a minimum support threshold of 60%. In this case, with 5 transactions, an
itemset must appear in at least 3 transactions to be considered frequent.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 6


Step 2: Generate Frequent Itemsets
Now, we analyze the transactions to find frequent itemsets:
1. Single Items:
o Bread: 4 occurrences (80%)
o Milk: 4 occurrences (80%)
o Diaper: 4 occurrences (80%)
o Beer: 3 occurrences (60%)
All single items are frequent since they all exceed the support threshold.
2. Pairs:
o {Bread, Milk}: 3 occurrences (60%)
o {Bread, Diaper}: 3 occurrences (60%)
o {Milk, Diaper}: 3 occurrences (60%)
o {Diaper, Beer}: 3 occurrences (60%)
o All other pairs do not meet the threshold.
3. Triplets:
o {Bread, Milk, Diaper}: 3 occurrences (60%)
o {Bread, Diaper, Beer}: 3 occurrences (60%)
o All other triplets do not meet the threshold.
Step 3: Resulting Frequent Itemsets

Itemset Support (%)

{Bread} 80%

{Milk} 80%

{Diaper} 80%

{Beer} 60%

{Bread, Milk} 60%

{Bread, Diaper} 60%

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 7


Itemset Support (%)

{Milk, Diaper} 60%

{Diaper, Beer} 60%

{Bread, Milk, Diaper} 60%

{Bread, Diaper, Beer} 60%

Conclusion
The frequent itemsets identified here can be further used to generate association rules. These
rules help the store in marketing strategies, such as cross-selling and product placement.
Next Steps
• Generate association rules from the frequent itemsets.
• Evaluate the rules using confidence and lift metrics.

Problem Statement: Association Rules in Data Mining


In the context of Association Rule Mining, the goal is to identify relationships or patterns
among a set of items in large datasets. These relationships are commonly used in various
applications such as market basket analysis, recommendation systems, and medical diagnosis.
Example: Market Basket Analysis
A supermarket wants to identify product combinations that customers frequently buy together
to optimize marketing strategies and store layouts. Using historical transaction data, the
supermarket applies association rule mining to uncover these relationships.
Steps Involved:
1. Frequent Itemset Generation: Find all itemsets (combinations of items) that
frequently appear in the transaction dataset.
2. Association Rule Generation: From the frequent itemsets, generate rules that
describe how the presence of one or more items influences the likelihood of the
presence of other items.
3. Evaluation of Rules: Rules are evaluated using measures like support, confidence,
and lift to determine their significance and usefulness.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 8


Key Metrics:
• Support: The proportion of transactions that contain a specific itemset.

Confidence: The likelihood that if an item or itemset X is bought, itemset Y is also bought.

Lift: Measures how much more likely two items are to be bought together compared to if
they were independent.

Frequent Itemset Generation (Example using Apriori Algorithm)


Let’s use an example of transactions to generate frequent itemsets.
Transaction Data (Market Basket Example):

Transaction ID Items Purchased

1 {Bread, Milk, Butter}

2 {Bread, Milk}

3 {Milk, Butter}

4 {Bread, Butter}

5 {Milk, Eggs, Butter}

Step 1: Set a Minimum Support Threshold


Let's say we set a minimum support threshold of 2 transactions (40% support for 5
transactions).
Step 2: Generate Frequent 1-itemsets
• {Bread}: Appears in 3 transactions (Support = 3/5 = 60%)
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 9
• {Milk}: Appears in 4 transactions (Support = 4/5 = 80%)
• {Butter}: Appears in 4 transactions (Support = 4/5 = 80%)
• {Eggs}: Appears in 1 transaction (Support = 1/5 = 20%)
Since {Eggs} doesn't meet the threshold, it’s eliminated.
Step 3: Generate Frequent 2-itemsets
• {Bread, Milk}: Appears in 2 transactions (Support = 2/5 = 40%)
• {Bread, Butter}: Appears in 2 transactions (Support = 2/5 = 40%)
• {Milk, Butter}: Appears in 3 transactions (Support = 3/5 = 60%)
All the above 2-itemsets meet the minimum support threshold.
Step 4: Generate Frequent 3-itemsets
• {Bread, Milk, Butter}: Appears in 1 transaction (Support = 1/5 = 20%)
Since this itemset doesn't meet the support threshold, it is eliminated.
Step 5: Generate Association Rules
From the frequent itemsets, generate association rules. For example:
• Rule: {Milk} → {Butter}
o Support = 3/5 = 60%
o Confidence = 3/4 = 75%
o Lift = 0.75 / 0.8 = 0.9375
• Rule: {Butter} → {Milk}
o Support = 3/5 = 60%
o Confidence = 3/4 = 75%
o Lift = 0.75 / 0.8 = 0.9375
Rules are then evaluated based on these metrics to decide their strength and applicability.

Support and Confidence Measures in Association Rules

Types of Association Rule Lerning

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 10


Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the
Apriori Algorithm. It represents the database in the form of a tree structure that is known as a
frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent patterns.

Metrics for Evaluating Association Rules

In association rule mining, several metrics are commonly used to evaluate the quality and
importance of the discovered association rules.

These metrics can be used to evaluate the quality and importance of association rules and to
select the most relevant rules for a given application. It is important to note that the appropriate
choice of metric will depend on the specific goals and requirements of the application.

Interpreting the results of association rule mining metrics requires understanding the meaning
and implications of each metric, as well as how to use them to evaluate the quality and
importance of the discovered association rules. Here are some guidelines for interpreting the
results of the main association rule mining metrics:

Support

Support is a measure of how frequently an item or itemset appears in the dataset. It is calculated
as the number of transactions containing the item(s) divided by the total number of transactions
in the dataset. High support indicates that an item or itemset is common in the dataset, while
low support indicates that it is rare.

Confidence

Confidence is a measure of the strength of the association between two items. It is calculated
as the number of transactions containing both items divided by the number of transactions
containing the first item. High confidence indicates that the presence of the first item is a strong
predictor of the presence of the second item.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 11


Lift

Lift is a measure of the strength of the association between two items, taking into account the
frequency of both items in the dataset. It is calculated as the confidence of the association
divided by the support of the second item. Lift is used to compare the strength of the association
between two items to the expected strength of the association if the items were independent.

A lift value greater than 1 indicates that the association between two items is stronger than
expected based on the frequency of the individual items. This suggests that the association may
be meaningful and worth further investigation. A lift value less than 1 indicates that the
association is weaker than expected and may be less reliable or less significant.

Association Rule Mining Algorithms

There are several algorithms used for association rule mining. Some common ones are:

Apriori algorithm

The Apriori algorithm is one of the most widely used algorithms for association rule mining.
It works by first identifying the frequent itemsets in the dataset (itemsets that appear in a certain
number of transactions). It then uses these frequent itemsets to generate association rules,
which are statements of the form "if item A is purchased, then item B is also likely to be
purchased." The Apriori algorithm uses a bottom-up approach, starting with individual items
and gradually building up to more complex itemsets.

FP-Growth algorithm

The FP-Growth (Frequent Pattern Growth) algorithm is another popular algorithm for
association rule mining. It works by constructing a tree-like structure called a FP-tree, which
encodes the frequent itemsets in the dataset. The FP-tree is then used to generate association
rules in a similar manner to the Apriori algorithm. The FP-Growth algorithm is generally faster
than the Apriori algorithm, especially for large datasets.

Apriori Algorithm

The apriori algorithm has become one of the most widely used algorithms for frequent itemset
mining and association rule learning. It has been applied to a variety of applications, including
market basket analysis, recommendation systems, and fraud detection, and has inspired the
development of many other algorithms for similar tasks.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 12


Algorithm Details

The apriori algorithm starts by setting the minimum support threshold. This is the minimum
number of times an item must occur in the database in order for it to be considered a frequent
itemset. The algorithm then filters out any candidate itemsets that do not meet the minimum
support threshold.

The algorithm then generates a list of all possible combinations of frequent itemsets and counts
the number of times each combination appears in the database. The algorithm then generates a
list of association rules based on the frequent itemset combinations.

An association rule is a statement of the form "if item A is present in a transaction, then item
B is also likely to be present". The strength of the association is measured using the confidence
of the rule, which is the probability that item B is present given that item A is present.

The algorithm then filters out any association rules that do not meet a minimum confidence
threshold. These rules are referred to as strong association rules. Finally, the algorithm then
returns the list of strong association rules as output.

Apriori uses a "bottom-up" approach, starting with individual items and gradually combining
them into larger and larger itemsets as it searches for frequent patterns. It also uses a "delete-
relabel" approach to efficiently prune the search space by eliminating infrequent itemsets from
consideration.

Pros:

• Apriori is relatively simple and easy to understand.


• It effectively identifies frequent patterns and association rules in large datasets.
• It is efficient at pruning the search space by eliminating infrequent itemsets from
consideration, reducing the algorithm's computational complexity.
• It has been widely used and tested in various applications, making it a well-established
and reliable algorithm.

Cons:

• Apriori is not suitable for very large datasets, as the computational complexity of the
algorithm increases exponentially with the size of the dataset.
• It may not be as effective at identifying patterns in datasets with many rare items or
infrequent transactions.
• It is sensitive to the minimum support and minimum confidence thresholds, which can
affect the quality of the results.
• It may be prone to generating a large number of association rules, which can make it
difficult to interpret the results.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 13


Mining Frequent Patterns in Data Mining
Item Set:
An Item set is collection or set of items
Examples:
{Computer, Printer, MSOffice} is 3 item set
{Milk, Bread} is 2 item set

Similarly,
Set of K items is called k item set

Frequent patterns
These are patterns that appear frequently in a data set. Patterns may be item sets, or sub
sequences.
Example: Transaction Database (Dataset)

TID Items

T1 Bread, Coke, Milk

T2 Popcorn, Bread

T3 Bread, Egg, Milk.

T4 Egg, Bread, Coke, Milk

A set of items, such as Milk & Bread that appear together in a transaction data set (Also called
as Frequent Item set).
Frequent item set mining leads to the discovery of associations and correlations among items
in large transactional (or) relational data sets.
Finding frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data. Moreover, it helps in data classification,
clustering, and other data mining tasks.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 14


Associations and correlations
Association rule mining (or) Frequent item set mining finds interesting associations and
relationships (correlations) in large transactional or relational data sets. This rule shows how
frequently an item set occurs in a transaction. A typical example is Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the items that
people buy together frequently.
This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.

The discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of bread) on
the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
Understanding these buying patterns can help to increase sales in several ways. If there is a
pair of items, X and Y, which are frequently bought together:
• Both X and Y can be placed on the same shelf, so that buyers of one item would be
prompted to buy the other.
• Promotional discounts could be applied to just one out of the two items.
• Advertisements on X could be targeted at buyers who purchase Y.
• X and Y could be combined into a new product, such as having Y in flavours of X.
Association rule: If there is a pair of items, X and Y, which are frequently bought together
then association rule is represented as X ⇒ Y.
For example, the information that customers who purchase computers also tend to
buy antivirus software at the same time is represented as

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 15


Computer ⇒ Antivirus_Software
Measures to discover interestingness of association rules
Association rules analysis is a technique to discover how items are associated to each other.
There are three measure to discover interestingness of association rules. Those are:
Support: The support of an item / item set is the number of transactions in which the item /
item set appears, divided by the total number of transactions.
Formula:

Where, A, B are items and N is the total number of transactions.


Example: Table-1 Example Transactions

TID Items

T1 Bread, Coke, Milk

T2 Popcorn, Bread

T3 Bread, Egg, Milk.

T4 Egg, Bread, Coke, Milk

T5 Egg, Apple

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 16


Confidence: This says that how likely item B is purchased when item A is purchased,
expressed as {A → B}. The Confidence of items (A and B) is the frequency or number of
transactions in which the items (A and B) appear, divided by the frequency or number of
transactions in which the item (A) appears.

Lift: This says that how likely item B is purchased when item A is purchased, expressed as an
association rule {A → B}. The lift is a measure to predict the performance of an association
rule (targeting model).
If lift value is:
• greater than 1 means that item B is likely to be bought if item A is bought,
• less than 1 means that item B is unlikely to be bought if item A is bought,
• equals to 1 means there is no association between items (A and B).

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 17


Formula:

Example: From the Table-1, the lift of {Bread → Milk} is


Support {Bread, Milk} =3 / 5 = 0.6
Support {Bread} = 4 / 5 = 0.8
Support {Milk} = 3 / 5 = 0.6

The Lift value is greater than 1 means that item Milk is likely to be bought if item Bread is
bought.
Example: To find Support, Confidence and Lift measures on the following transactional data
set.
Table-2: Example Transactions

TID Items

T1 Bread, Milk

T2 Bread, Diaper, Burger, Eggs

T3 Milk, Diaper, Burger, Coke

T4 Bread, Milk, Diaper, Burger

T5 Bread, Milk, Diaper, Coke

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 18


Number of transactions = 5.
Support:
1 – Item Set:
Support {Bread} = 4 / 5 = 0.8 = 80%
Support {Diaper} = 4 / 5 = 0.8 = 80%
Support {Milk} = 4 / 5 = 0.8 = 80%
Support {Burger} = 3 / 5 = 0.6 = 60%
Support {Coke} = 2 / 5 = 0.4 = 40%
Support {Eggs} = 1 / 5 = 0.2 = 20%
2 – Item Set:
Support {Bread, Milk} = 3 / 5 = 0.6 = 60%
Support {Milk, Diaper} = 3 / 5 = 0.6 = 60%
Support {Milk, Burger} = 2 / 5 = 0.4 = 40%
Support {Burger, Coke} = 1 / 5 = 0.2 = 20%
Support {Milk, Eggs} = 0 / 5 = 0.0 = 0%
3 – Item Set:
Support {Bread, Milk, Diaper} = 2 / 5 = 0.4 = 40%
Support {Milk, Diaper, Burger} = 2 / 5 = 0.4 = 40%

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 19


DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 20
Mining Methods
The most famous story about association rule mining is the “beer and diaper.” Researchers
discovered that customers who buy diapers also tend to buy beer. This classic example shows
that there might be many interesting association rules hidden in our daily data.
Association rules help to predict the occurrence of one item based on the occurrences of other
items in a set of transactions.
Association rules Examples
• People who buy bread will also buy milk; represented as{ bread → milk }
• People who buy milk will also buy eggs; represented as { milk → eggs }
• People who buy bread will also buy jam; represented as { bread → jam }
Association Rules discover the relationship between two or more attributes. It is mainly in the
form of- If antecedent than consequent. For example, a supermarket sees that there are 200
customers on Friday evening. Out of the 200 customers, 100 bought chicken, and out of the
100 customers who bought chicken, 50 have bought Onions. Thus, the association rule would
be- If customers buy chicken then buy onion too, with a support of 50/200 = 25% and a
confidence of 50/100=50%.
Association rule mining is a technique to identify interesting relations between different items.
Association rule mining has to:
• Find all the frequent items.
• Generate association rules from the above frequent itemset.
There are many methods or algorithms to perform Association Rule Mining or Frequent Itemset
Mining, those are:
• Apriori algorithm
• FP-Growth algorithm
Apriori algorithm
The Apriori algorithm is a classic and powerful tool in data mining used to discover frequent
itemsets and generate association rules. Imagine a grocery store database with customer
transactions. Apriori can help you find out which items frequently appear together, revealing
valuable insights like:
• Customers buying bread often buy butter and milk too. (Frequent itemset)
• 70% of people who purchase diapers also buy baby wipes. (Association rule)
How Apriori algorithm works:
• Bottom-up Approach: Starts with finding frequent single items, then combines them
to find frequent pairs, triplets, and so on.
• Apriori Property: If a smaller itemset isn't frequent, none of its larger versions can be
either. This "prunes" the search space for efficiency.
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 21
• Support and Confidence: Two key measures used to define how often an itemset
appears and how strong the association between items is.
Limitations for Apriori algorithm
• Can be computationally expensive for large datasets.
• Sensitive to minimum support and confidence thresholds.
FP-Growth algorithm
FP-Growth stands for Frequent Pattern Growth, and it's a smarter sibling of the Apriori
algorithm for mining frequent itemsets in data. But instead of brute force, it uses a clever
strategy to avoid generating and testing tons of candidate sets, making it much faster and more
memory-efficient.
Here's its secret weapon:
• Frequent Pattern Tree (FP-Tree): This special data structure efficiently stores the
frequent itemsets and their relationships. Think of it as a compressed and organized
representation of your grocery store database.
• Pattern Fragment Growth: Instead of building candidate sets, FP-Growth focuses on
"growing" smaller frequent patterns (fragments) by adding items at their frequent ends.
This avoids the costly generation and scanning of redundant patterns.
Advantages of FP-Growth over Apriori
• Faster for large datasets: No more candidate explosions, just targeted pattern growth.
• Less memory required: The compact FP-Tree minimizes memory usage.
• More versatile: Can easily mine conditional frequent patterns without building new
trees.
When to Choose FP-Growth
• If you're dealing with large datasets and want faster results.
• If memory limitations are a concern.
• If you need to mine conditional frequent patterns.
Remember: Both Apriori and FP-Growth have their strengths and weaknesses. Choosing the
right tool depends on your specific data and needs.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 22


Apriori algorithm
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It
was introduced by R Agarwal and R Srikant.
Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties.
Frequent Item Set
• Frequent Itemset is an itemset whose support value is greater than a threshold
value(support).
Apriori algorithm uses frequent itemsets to generate association rules. To improve the
efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
Apriori Property
• All subsets of a frequent itemset must be frequent (Apriori property).
• If an itemset is infrequent, all its supersets will be infrequent.
Steps in Apriori algorithm
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. A minimum support threshold is given in the problem or it is assumed by
the user.
The steps followed in the Apriori Algorithm of data mining are:
Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with itself.
Prune Step: This step scans the count of each item in the database. If the candidate item does
not meet minimum support, then it is regarded as infrequent and thus it is removed. This step
is performed to reduce the size of the candidate itemsets.
The above join and the prune steps iteratively until the most frequent itemsets are achieved.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 23


Apriori Algorithm Example
Consider the following dataset and find frequent item sets and generate association rules for
them. Assume that minimum support threshold (s = 50%) and minimum confident threshold
(c = 80%).

Transaction List of items

T1 I1, I2, I3

T2 I2, I3, I4

T3 I4, I5

T4 I1, I2, I4

T5 I1, I2, I3, I5

T6 I1, I2, I3, I4

Solution
Finding frequent item sets:
Support threshold=50% ⇒ 0.5*6 = 3 ⇒ min_sup = 3
Step-1:
(i) Create a table containing support count of each item present in dataset – Called C1
(candidate set).

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 24


(ii) Prune Step: Compare candidate set item’s support count with minimum support count.
The above table shows that I5 item does not meet min_sup = 3, thus it is removed, only I1,
I2, I3, I4 meet min_sup count.
This gives us the following item set L1.

Item Count

I1 4

I2 5

I3 4

I4 4

Step-2:
(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of
2-itemset from the given dataset.

Item Count

I1, I2 4

I1, I3 3

I1, I4 2

I2, I3 4

I2, I4 3

I3, I4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count.
The above table shows that item sets {I1, I4} and {I3, I4} does not meet min_sup = 3, thus
those are removed.
This gives us the following item set L2.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 25


Item Count

I1, I2 4

I1, I3 3

I2, I3 4

I2, I4 3

Step-3:
(i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the occurrences of
3-itemset from the given dataset.

Item Count

I1, I2, I3 3

I1, I2, I4 2

I1, I3, I4 1

I2, I3, I4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count.
The above table shows that itemset {I1, I2, I4}, {I1, I3, I4} and {I2, I3, I4} does not meet
min_sup = 3, thus those are removed. Only the item set {I1, I2, I3} meet min_sup count.

Generate Association Rules:


Thus, we have discovered all the frequent item-sets. Now we need to generate strong
association rules (satisfies the minimum confidence threshold) from frequent item sets. For
that we need to calculate confidence of each rule.
The given Confidence threshold is 80%.
The all possible association rules from the frequent item set {I1, I2, I3} are:

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 26


Exercise1: Apriori Algorithm

TID Items

T1 I1, I2, I5

T2 I2, I4

T3 I2, I3

T4 I1, I2, I4

T5 I1, I3

T6 I2, I3

T7 I1, I3

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 27


TID Items

T8 I1, I2, I3, I5

T9 I1, I2, I3

Consider the above dataset and find frequent item sets and generate association rules for
them. Assume that Minimum support count is 2 and minimum confident threshold (c = 50%).

Exercise2: Apriori Algorithm

TID Items

T1 {milk, bread}

T2 {bread, sugar}

T3 {bread, butter}

T4 {milk, bread, sugar}

T5 {milk, bread, butter}

T6 {milk, bread, butter}

T7 {milk, sugar}

T8 {milk, sugar}

T9 {sugar, butter}

T10 {milk, sugar, butter}

T11 {milk, bread, butter}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 28


Consider the above dataset and find frequent item sets and generate association rules for
them. Assume that Minimum support count is 3 and minimum confident threshold (c = 60%).

Drawbacks of Apriori Algorithm


1. Using Apriori needs a generation of candidate itemsets. These itemsets may be large
in number if the itemset in the database is huge.
2. Apriori needs multiple scans of the database to check the support of each itemset
generated and this leads to high costs.
These drawbacks can be overcome using the FP growth algorithm.

Frequent Pattern Growth Algorithm


This algorithm is an improvement to the Apriori method. A frequent pattern is generated
without the need for candidate generation. FP growth algorithm represents the database in the
form of a tree called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”. The
itemsets of these fragmented patterns are analyzed. Thus with this method, the search for
frequent itemsets is reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP
tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.
Frequent Pattern Algorithm Steps
The frequent pattern growth method lets us find the frequent pattern without candidate
generation.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is
called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 29
next itemset with lower count and so on. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for
example in the 1st transaction), then this transaction branch would share a common prefix to
the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along
with the links of the lowest nodes. The lowest node represents the frequency pattern length 1.
From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.

Example Of FP-Growth Algorithm


Support threshold=50%, Confidence= 60%
Table 1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 30


1. Count of each item
Table 2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2

2. Sort the itemset in descending order.


Table 3
Item Count
I2 5
I1 4
I3 4
I4 4

3. Build FP Tree
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4
is linked to I3. But this branch would share I2 node as common as it is already used in
T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child to
I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node,
hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already
linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 31
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3},
{I4 1}.

4. Mining of FP-tree is summarized below:


1. The lowest node item I5 is not considered as it does not have a min support count, hence
it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}.
This forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4}
and frequent patterns are generated: {I2, I1:4}.

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated


I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}
I1 {I2:4} {I2:4} {I2,I1:4}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 32


The diagram given below depicts the conditional FP tree associated with the conditional node
I3.

Advantages Of FP Growth Algorithm


1. This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.

FP Growth vs Apriori
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a Apriori generates pattern by pairing the items
FP tree into singletons, pairs and triplets.
Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The process is comparatively slower than FP
The runtime of process increases linearly Growth, the runtime increases exponentially
with increase in number of itemsets. with increase in number of itemsets
Memory Usage

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 33


FP Growth Apriori
A compact version of database is saved The candidates combinations are saved in
memory

Compact Representation of Frequent Item Sets: Maximal Frequent Item Set


In data mining, particularly in the context of association rule learning, frequent item sets are
used to identify sets of items that frequently occur together in transactional data. A maximal
frequent item set (MFI) is a key concept within this framework, representing a compact way
to capture the most relevant item sets.
1. Definitions
• Frequent Item Set: A set of items that appears in a dataset with a frequency above a
specified threshold (minimum support).
• Maximal Frequent Item Set (MFI): A frequent item set that is not a subset of any
other frequent item set. In other words, if an item set is frequent and adding any other
item to it would make it infrequent, it is considered maximal.
2. Importance of Maximal Frequent Item Sets
• Compactness: MFIs provide a more compact representation of frequent item sets,
reducing the number of item sets to analyze while retaining meaningful relationships.
• Efficiency: By focusing on MFIs, algorithms can improve performance and reduce
computational complexity in mining association rules.
• Insightful Patterns: MFIs help identify the most relevant combinations of items,
which can be particularly valuable in market basket analysis and recommendation
systems.

3. Example of Maximal Frequent Item Sets


A. Transaction Dataset
Consider a simple transaction dataset:

Transaction ID Items

1 {A, B, C}

2 {A, B}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 34


Transaction ID Items

3 {A, C, D}

4 {B, C}

5 {A, C}

B. Finding Frequent Item Sets


Assuming a minimum support of 3, we first identify the frequent item sets:
1. Single Items:
o A: 4
o B: 3
o C: 4
o D: 1
Frequent items: A, B, C (support >= 3)
2. Pairs of Items:
o {A, B}: 2
o {A, C}: 3
o {A, D}: 1
o {B, C}: 3
o {B, D}: 0
o {C, D}: 1
Frequent pairs: {A, C}, {B, C} (support >= 3)
3. Triplets of Items:
o {A, B, C}: 2
o {A, B, D}: 0
o {A, C, D}: 1
o {B, C, D}: 0
No triplets meet the support threshold.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 35


C. Maximal Frequent Item Sets
From the frequent item sets identified, we can derive the maximal frequent item sets:
• Maximal Frequent Item Sets:
o {A, C}
o {B, C}
These sets are maximal because:
• {A, C} cannot be extended by adding any other frequent item without losing its status
as frequent.
• {B, C} is similarly maximal.

4. Algorithm for Finding Maximal Frequent Item Sets


The process of finding MFIs can be done using algorithms such as:
• Apriori Algorithm:
o Generate candidate item sets and prune based on support.
o After obtaining frequent item sets, identify MFIs by checking if they can be
extended.
• FP-Growth Algorithm:
o Construct a frequent pattern tree (FP-tree) and mine the tree for frequent item
sets.
o Extract MFIs directly from the FP-tree structure.
5. Applications of Maximal Frequent Item Sets
• Market Basket Analysis: Understanding which products are frequently purchased
together can inform marketing strategies and store layouts.
• Recommendation Systems: Suggesting items based on frequent combinations can
enhance user experience.
• Web Usage Mining: Analyzing user behavior on websites to improve navigation and
content delivery.

Maximal Frequent Item Sets (MFIs) serve as an essential tool in data mining, offering a
compact representation of the most significant associations within a dataset. By focusing on
MFIs, analysts can gain insights into patterns of behavior while optimizing resource usage and
computational efficiency. Understanding and applying MFIs can lead to powerful outcomes in
various domains, including marketing, e-commerce, and web analytics.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 36


Closed Frequent Item Set
In the context of data mining and association rule learning, closed frequent item sets are an
important concept that enhances the efficiency and effectiveness of discovering patterns
in transactional datasets. They provide a more compact representation of frequent item sets
while retaining the essential information about their support.
1. Definitions
• Frequent Item Set: A collection of items that appears together in a dataset with a
frequency above a specified threshold (minimum support).
• Closed Frequent Item Set: A frequent item set is considered closed if there are no
superset item sets that have the same support count. In other words, if an item set is
frequent and no item can be added to it without decreasing its support, it is considered
closed.
2. Importance of Closed Frequent Item Sets
• Compactness: By focusing on closed frequent item sets, the number of item sets that
need to be considered is reduced, providing a more efficient way to represent frequent
patterns without losing information.
• Reduction in Redundancy: Closed frequent item sets eliminate redundancy since all
supersets of a closed frequent item set will have a support count greater than or equal
to it.
• Insightful Patterns: Closed frequent item sets capture all the essential information
about the frequent item sets, making them useful for generating association rules.

3. Example of Closed Frequent Item Sets


Let's go through a detailed example to illustrate the concept of Closed Frequent Item
Sets (CFIs) using a simple transaction dataset.
Let's break down the concept of Closed Frequent Item Sets with a clearer example, step by
step:
1. Transaction Dataset
Here is a simplified dataset representing transactions made by customers:

Transaction ID Items

1 {A, B, C}

2 {A, B}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 37


Transaction ID Items

3 {A, C}

4 {B, C}

5 {A, B, C, D}

2. Minimum Support Threshold


Set the minimum support threshold to 2. We will only consider item sets that appear in at
least 2 transactions.

3. Finding Frequent Item Sets


A. Single Items
Count the occurrences of individual items:

Item Count

A 4

B 3

C 4

D 1

• Frequent Items (support >= 2): A, B, C


B. Pairs of Items
Next, count pairs of items:

Item Pair Count

{A, B} 3

{A, C} 3

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 38


Item Pair Count

{A, D} 1

{B, C} 3

{B, D} 0

{C, D} 0

• Frequent Pairs (support >= 2): {A, B}, {A, C}, {B, C}

C. Triplets of Items
Now, count triplets of items:

Item Triplet Count

{A, B, C} 2

{A, B, D} 1

{A, C, D} 1

{B, C, D} 0

• Frequent Triplets (support >= 2): {A, B, C}

4. Identifying Closed Frequent Item Sets


Now we identify the Closed Frequent Item Sets.
• A closed frequent item set is a frequent item set for which there are no supersets
with the same support.
Frequent Item Sets and Their Supersets:

Item Set Support Supersets

{A} 4 {A, B}, {A, C}, {A, B, C}, {A, B, D}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 39


Item Set Support Supersets

{B} 3 {A, B}, {B, C}, {A, B, C}

{C} 4 {A, C}, {B, C}, {A, B, C}

{A, B} 3 {A, B, C}

{A, C} 3 {A, B, C}

{B, C} 3 {A, B, C}

{A, B, C} 2 None (superset)

Closed Frequent Item Sets:


• {A, B}: Support = 3, no superset with the same support.
• {A, C}: Support = 3, no superset with the same support.
• {B, C}: Support = 3, no superset with the same support.
• {A, B, C}: Support = 2, but it is not closed because it has a superset with a larger size.
5. Final Summary of Closed Frequent Item Sets
The Closed Frequent Item Sets identified from the dataset are:
• {A, B} (support = 3)
• {A, C} (support = 3)
• {B, C} (support = 3)

This example illustrates how to find Closed Frequent Item Sets from a transaction dataset
clearly. By focusing on closed frequent item sets, we can obtain a more compact representation
of frequent patterns while preserving essential information about their support. This is
particularly useful in applications like market basket analysis, where we want to understand
the relationships between products without redundancy.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 40


Partition Algorithm in Association Rule Data Mining
The Partition Algorithm is a data mining technique used for discovering frequent
itemsets and generating association rules. It is particularly useful when dealing with large
datasets, as it divides the dataset into smaller subsets (partitions) to improve computational
efficiency.

In association rule mining, the partition algorithm is an approach used to find frequent
itemsets (sets of items that appear together frequently) in large datasets. The idea is to break
the dataset into smaller partitions, find frequent itemsets locally within each partition, and then
combine results to determine the globally frequent itemsets.

Overview of the Partition Algorithm


Key Features
• Divide and Conquer: The algorithm splits the dataset into smaller partitions, processes
each partition to identify frequent itemsets, and then combines the results.
• Scalability: It is effective for large datasets, reducing the overall computational time
compared to traditional methods like Apriori.
• Memory Efficient: By processing smaller partitions, it minimizes memory usage.
Steps of the Partition Algorithm
1. Partitioning the Dataset:
o The dataset is divided into (𝑘)(k) partitions. Each partition is treated as a
separate dataset.
2. Frequent Itemset Generation:
o For each partition, frequent itemsets are identified using a frequent itemset
mining algorithm (e.g., Apriori).
3. Global Frequent Itemsets:
o The local frequent itemsets from each partition are combined to form a global
frequent itemset. This step may involve counting the frequencies of itemsets
across all partitions.
4. Rule Generation:
o Generate association rules from the global frequent itemsets, evaluating them
based on support and confidence.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 41


Example to illustrate the partition algorithm:

Problem Setup
Consider a small dataset of 6 transactions and 5 unique items:

Transaction ID Items Purchased

T1 {A, B, C}

T2 {A, B}

T3 {B, C, D}

T4 {A, B, D}

T5 {A, B, C, D}

T6 {A, C, D}

Step 1: Define Minimum Support


Let's set the minimum support threshold to 50%, meaning an itemset must appear in at
least 3 transactions (since we have 6 transactions).
Step 2: Partition the Dataset
We divide the dataset into two partitions for processing:
• Partition 1: {T1, T2, T3}
• Partition 2: {T4, T5, T6}
Step 3: Identify Local Frequent Itemsets
Partition 1 (T1, T2, T3):

Transaction ID Items Purchased

T1 {A, B, C}

T2 {A, B}

T3 {B, C, D}

Find local frequent 1-itemsets (minimum support in Partition 1: 2 out of 3 transactions):


• {A}: appears in T1, T2 (2 times) → Frequent
• {B}: appears in T1, T2, T3 (3 times) → Frequent
• {C}: appears in T1, T3 (2 times) → Frequent
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 42
• {D}: appears in T3 (1 time) → Not frequent
Find local frequent 2-itemsets:
• {A, B}: appears in T1, T2 (2 times) → Frequent
• {A, C}: appears in T1 (1 time) → Not frequent
• {B, C}: appears in T1, T3 (2 times) → Frequent
Frequent itemsets in Partition 1:
• 1-itemsets: {A}, {B}, {C}
• 2-itemsets: {A, B}, {B, C}

Partition 2 (T4, T5, T6):

Transaction ID Items Purchased

T4 {A, B, D}

T5 {A, B, C, D}

T6 {A, C, D}

Find local frequent 1-itemsets (minimum support in Partition 2: 2 out of 3 transactions):


• {A}: appears in T4, T5, T6 (3 times) → Frequent
• {B}: appears in T4, T5 (2 times) → Frequent
• {C}: appears in T5, T6 (2 times) → Frequent
• {D}: appears in T4, T5, T6 (3 times) → Frequent
Find local frequent 2-itemsets:
• {A, B}: appears in T4, T5 (2 times) → Frequent
• {A, C}: appears in T5, T6 (2 times) → Frequent
• {A, D}: appears in T4, T5, T6 (3 times) → Frequent
• {B, C}: appears in T5 (1 time) → Not frequent
• {B, D}: appears in T4, T5 (2 times) → Frequent
Frequent itemsets in Partition 2:
• 1-itemsets: {A}, {B}, {C}, {D}
• 2-itemsets: {A, B}, {A, C}, {A, D}, {B, D}

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 43


Step 4: Combine Frequent Itemsets from All Partitions
Now, we combine the frequent itemsets from both partitions:
• From Partition 1: {A}, {B}, {C}, {A, B}, {B, C}
• From Partition 2: {A}, {B}, {C}, {D}, {A, B}, {A, C}, {A, D}, {B, D}
Step 5: Prune Itemsets by Global Support
We now calculate the global support for each combined itemset and check if they meet the
minimum support threshold (3 transactions).
Global Support Count

Itemset Global Support (No. of Transactions) Meets Minimum Support?

{A} Appears in T1, T2, T4, T5, T6 (5 times) Yes

{B} Appears in T1, T2, T3, T4, T5 (5 times) Yes

{C} Appears in T1, T3, T5, T6 (4 times) Yes

{D} Appears in T3, T4, T5, T6 (4 times) Yes

{A, B} Appears in T1, T2, T4, T5 (4 times) Yes

{A, C} Appears in T5, T6 (2 times) No

{A, D} Appears in T4, T5, T6 (3 times) Yes

{B, C} Appears in T1, T3, T5 (3 times) Yes

{B, D} Appears in T4, T5 (2 times) No

Final Global Frequent Itemsets:


After pruning, the globally frequent itemsets that meet the minimum support threshold of 3
transactions are:
• 1-itemsets: {A}, {B}, {C}, {D}
• 2-itemsets: {A, B}, {A, D}, {B, C}

Step-by-Step Recap:
1. Partition the dataset: Split the dataset into smaller subsets.
2. Find local frequent itemsets: Identify frequent 1-itemsets and 2-itemsets in each
partition based on the local minimum support.
DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 44
3. Combine local frequent itemsets: Merge the frequent itemsets from both partitions.
4. Prune the combined itemsets: Calculate the global support for each itemset across
all transactions, and prune those that do not meet the global minimum support
threshold.
In this example, the globally frequent 2-itemsets that meet the minimum support are: {A, B},
{A, D}, and {B, C}.

Benefits of Partition Algorithm


• Efficient for large datasets: By dividing the data, it reduces the memory
requirements for processing.
• Local computation: It can find frequent itemsets within partitions without needing
the entire dataset loaded at once.
This method is useful when dealing with very large datasets that cannot fit into memory all at
once.

Example: 2
Let's consider a different dataset to illustrate the Partition Algorithm in association rule
mining. This example will help reinforce understanding with a new set of transactions.
Example Dataset
Transaction Dataset
Suppose we have a dataset representing purchases made by customers at an electronics store.
Each transaction consists of various items purchased.

Transaction ID Items Purchased

1 Laptop, Mouse

2 Laptop, Keyboard

3 Mouse, Keyboard, Monitor

4 Laptop, Mouse, Monitor

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 45


Transaction ID Items Purchased

5 Keyboard, Monitor

6 Laptop, Mouse, Headphones

7 Laptop, Monitor

8 Mouse, Headphones

9 Keyboard, Mouse

10 Laptop, Headphones

Step 1: Partitioning the Dataset


We will divide this dataset into two partitions for processing.
• Partition 1 (Transactions 1 to 5):
o Transaction 1: Laptop, Mouse
o Transaction 2: Laptop, Keyboard
o Transaction 3: Mouse, Keyboard, Monitor
o Transaction 4: Laptop, Mouse, Monitor
o Transaction 5: Keyboard, Monitor
• Partition 2 (Transactions 6 to 10):
o Transaction 6: Laptop, Mouse, Headphones
o Transaction 7: Laptop, Monitor
o Transaction 8: Mouse, Headphones
o Transaction 9: Keyboard, Mouse
o Transaction 10: Laptop, Headphones
Step 2: Frequent Itemset Generation
We will now find the frequent itemsets in each partition using a minimum support count of 2.
For Partition 1:
Let's count the occurrences of each itemset.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 46


Itemset Count

{Laptop} 3

{Mouse} 4

{Keyboard} 3

{Monitor} 3

{Laptop, Mouse} 3

{Laptop, Keyboard} 2

{Mouse, Keyboard} 2

{Mouse, Monitor} 2

{Keyboard, Monitor} 2

Frequent Itemsets from Partition 1:


• {Laptop}, {Mouse}, {Keyboard}, {Monitor}, {Laptop, Mouse}, {Laptop,
Keyboard}, {Mouse, Keyboard}, {Mouse, Monitor}, {Keyboard, Monitor}
For Partition 2:
Next, we count the occurrences of itemsets in Partition 2.

Itemset Count

{Laptop} 4

{Mouse} 3

{Headphones} 3

{Monitor} 2

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 47


Itemset Count

{Laptop, Mouse} 2

{Laptop, Headphones} 2

{Mouse, Headphones} 2

{Keyboard} 1

Frequent Itemsets from Partition 2:


• {Laptop}, {Mouse}, {Headphones}, {Monitor}, {Laptop, Mouse}, {Laptop,
Headphones}, {Mouse, Headphones}
Step 3: Global Frequent Itemsets
Now, we combine the frequent itemsets from both partitions and count their occurrences
across the entire dataset.

Itemset Global Count

{Laptop} 7

{Mouse} 7

{Keyboard} 3

{Monitor} 5

{Headphones} 3

{Laptop, Mouse} 5

{Laptop, Keyboard} 2

{Mouse, Keyboard} 2

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 48


Itemset Global Count

{Mouse, Monitor} 2

{Laptop, Headphones} 2

{Mouse, Headphones} 2

{Keyboard, Monitor} 2

Global Frequent Itemsets (with support count >= 2):


• {Laptop}, {Mouse}, {Keyboard}, {Monitor}, {Headphones}, {Laptop, Mouse},
{Laptop, Keyboard}, {Mouse, Keyboard}, {Mouse, Monitor}, {Laptop,
Headphones}, {Mouse, Headphones}, {Keyboard, Monitor}

Step 4: Rule Generation


From the global frequent itemsets, we can now generate association rules. We will calculate
the support and confidence for each rule.
Example Rules
1. Rule: Laptop → Mouse
o Support: 5/10 = 0.5 (50%)
o Confidence: 5/7 = 0.71 (71%)
2. Rule: Mouse → Laptop
o Support: 5/10 = 0.5 (50%)
o Confidence: 5/7 = 0.71 (71%)
3. Rule: Laptop → Headphones
o Support: 2/10 = 0.2 (20%)
o Confidence: 2/7 = 0.29 (29%)
4. Rule: Mouse → Headphones
o Support: 2/10 = 0.2 (20%)
o Confidence: 2/7 = 0.29 (29%)
5. Rule: Keyboard → Monitor

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 49


o Support: 2/10 = 0.2 (20%)
o Confidence: 2/3 = 0.67 (67%)

This example illustrates the Partition Algorithm using a different dataset focused on customer
purchases in an electronics store. By partitioning the dataset and analyzing frequent itemsets,
we can generate valuable association rules that help understand customer behavior.

DATA MINING-UNIT 2 | By Siliveri Kiran Kumar 50

You might also like