0% found this document useful (0 votes)
24 views25 pages

Web Mining Unit 1

The document provides an overview of web data mining, detailing the processes and techniques involved in data mining and web mining, including their applications and tasks. It explains the different categories of web mining such as web content, structure, and usage mining, as well as the concepts of association rules and frequent pattern mining. Additionally, it discusses various algorithms used for association rule learning and their applications across different industries.

Uploaded by

merlyne.noel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

Web Mining Unit 1

The document provides an overview of web data mining, detailing the processes and techniques involved in data mining and web mining, including their applications and tasks. It explains the different categories of web mining such as web content, structure, and usage mining, as well as the concepts of association rules and frequent pattern mining. Additionally, it discusses various algorithms used for association rule learning and their applications across different industries.

Uploaded by

merlyne.noel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT- I INTRODUCTION TO WEB DATA MINING

DATA MINING
 The process of using software and algorithms to find patterns and relationships in a large
data set is called Data Mining.
 It is also called Knowledge Discovery in Data (KDD).
 Data Mining involves :
a) Machine Learning
b) Artificial Intelligence
c) Data Visualization
d) Statistics
e) Information Retrieval
f) Database
 Data Mining tasks are :
a) Supervised Learning (Classification)
b) Unsupervised Learning (Clustering)
c) Association rule mining
d) Sequential pattern mining
 Data Analysts ( Data Miners ) are professionals who collect, analyze and interpret data to
help businesses to make better decisions. For example,
a) What kind of customers should a business target in tis next ad campaign?
b) Which age group is most vulnerable to a particular disease?
c) What behavioral patterns are connected to financial fraud?
 Data Mining involves 3 steps
a) Preprocessing
 Raw data contains irrelevant attributes, duplicates, noise, errors, outliers (a
statistical observation that is marked differently in value from the others in
the sample)
 Preprocessing tasks involve cleaning of data and maintain the data in a
separate spreadsheet or through a programming language so that the
interpretations won’t be wrong.
b) Data Mining
 The processed data is fed to a mining algorithm which will produce
patterns or knowledge.
c) Post Processing
 Not all discovered patterns are useful
 This step will identify useful patterns with the help of various evaluation.
 Visualization techniques are used to make decisions.
 Features of Data Mining
a) It can analyze big datasets.
b) It finds patterns and connections in large databases easily.
c) It is used to predict future outcomes and trends.
d) It groups similar types of data together.
e) It provides useful information to make data-based decisions.
f) Data Mining uses many computers to process data faster.

WEB MINING
 Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services..
 Example:
 Analyze website traffic
 Analyze user interactions with a website
 How users navigate their site?
 What content is most popular?
 Where users are dropping off?
 Web mining involves
 Data mining
 Artificial Intelligence
 Machine Learning
 Information Retrieval
 Statistics
Process of Web Mining

CATEGORIES OF WEB MINING

1. Web Content Mining


 Web content mining is the application of extracting useful information from the
content of the web documents.
 Web content consist of several types of data – text, image, audio, video etc.
Content data is the group of facts that a web page is designed.
 It can provide effective and interesting patterns about user needs.
 Text documents are related to text mining, machine learning and natural language
processing.
 This mining is also known as text mining. This type of mining performs scanning
and mining of the text, images and groups of web pages according to the content
of the input.

2. Web Structure Mining


 Web structure mining is the application of discovering structure information from
the web.
 The structure of the web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages.
 It uses a graph theory to analyze the nodes and connections in the structure of a
website.

 It identifies relationship between web pages linked by information or direct link


connection.
 The unstructured web data from various websites are transformed into structured,
usable information that businesses can use for their competitive advantage
 To determine the connection between two commercial websites, Web structure
mining can be very useful.

3. Web Usage Mining


 Web usage mining is the application of identifying or discovering interesting
usage patterns from large data sets.
 Web Usage Mining is watching what people do on a website. It involves looking
at web activity logs, like page views, clicks, downloads, session durations, etc., to
understand how users interact with web applications. Thus it can be used to
understand user behavior, preferences, and patterns.
 In web usage mining, user access data on the web and collect data in form of
logs. So, Web usage mining is also called log mining.
Algorithm used:
• Page rank algorithm
• HITS algorithm
• Weighted page rank algorithm
• Distance rank algorithm
• Weighted page content rank algorithm
• Eigen rumour algorithm
• Time rank algorithm
• Query dependent ranking algorithm

COMPARISON BETWEEN DATA MINING AND WEB MINING

Parameters Data Mining Web Mining


Definition Data Mining is the process that Web Mining is the process of data mining
attempts to discover pattern and hidden techniques to automatically discover and
knowledge in large data sets in any extract information from web documents.
system.
Application Data Mining is very useful for web Web Mining is very useful for a particular
page analysis. website and e-service.
Structure Data Mining get the information from Web Mining get the information from
explicit structure. structured, unstructured and semi-
structured web pages.
Problem Clustering, classification, regression, Web content mining, Web structure
Type prediction, optimization and control. mining.
Skills It includes approaches for data It includes application level knowledge,
cleansing, machine learning data engineering with mathematical
algorithms. Statistics and probability. modules like statistics and probability.

Basic Concepts of Association Rule


 A powerful technique in data mining to discover interesting
relationships between different items or attributers within large
datasets.
 Aim: To identify patterns or correlations or dependencies among
different groups of items, called itemsets in data which can them be
used to predict the likelihood of certain items being purchased or used
together.
 One of the most important measures using in association rule mining is
LIFT, which provides a way to determine the strength and significance
of the discovered association rules.
 Assoication rules -> represent as set of items.
o Eg: {A,B,C}->where A,B,C are different items in a dataset.
 Goal of association rule mining is to identify rules that have a high
level of support and confidence.
 Association rules are “if-then” statements.
o If an itemset A is present in a transaction, then item B is also
likely to be present.
 Antecedent : The “if” part where the condition is being tested.
 Consequent : The “then” part that denotes the outcome that occurs if
the condition is true.

Types of association rules in data mining


There are multiple types of association rules in data mining. They include the following:
 Generalized- Rules in this category are meant to be general examples of association
rules that provide a high-level overview of what these associations of data points look
like.
 Multilevel- Multilevel association rules separate data points into different levels of
importance known as levels of abstraction -- and distinguish between associations of
more important data points and ones of lower importance.
 Quantitative- This type of association rule is used to describe examples where
associations are made between numerical data points.
 Multi-relational- This type is more in-depth than traditional association rules that
consider relationships between single data points. Multi-relational rules are made across
multiple or multidimensional databases.
Itemset
 A set of items together is called an itemset.
 If any itemset has kitems it is called a k-itemset.
 An itemset consists of two or more items. An itemset that occurs frequently is called a
frequent itemset.
 Thus frequent itemset mining is a data mining technique to identify the items that often
occur together.
o For Example, Bread and butter, Laptop and Antivirus software, etc

Frequent Pattern Mining (FPM)


 The frequent pattern mining algorithm is one of the most important techniques of data
mining to discover relationships between different items in a dataset.
 These relationships are represented in the form of association rules. It helps to find the
irregularities in data.
 FPM has many applications in the field of data analysis, software bugs, cross-marketing,
sale campaign analysis, market basket analysis, etc.
 Frequent itemsets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence and
Mining of association rules is the most important of them.
 Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the
items are purchased together.

Key Components of Association Rules


 Antecedent: The “if” part of the rule, representing the condition.
 Example: A customer buys bread.
 Consequent: The “then” part of the rule, representing the outcome.
 Example: The customer also buys butter.
 Association rules are derived through algorithms that evaluate the frequency and strength
of these relationships.
 They use metrics like support, confidence, and lift to measure the relevance and
reliability of discovered patterns.
 These rules have applications in various fields, such as retail, healthcare, and marketing,
where analyzing customer behavior or trends is critical for success.

Rule Evaluation Metrics


 Association rules are evaluated using key metrics that determine their relevance, strength,
and reliability.
 These metrics include support, confidence, and lift, which quantify the frequency and
strength of relationships between data items.
1. Support
 Support measures how frequently an itemset (both antecedent and consequent) appears in
the dataset. It provides an indication of how common a particular association is.
Formula:

 Example: If bread and butter appear together in 100 out of 1,000 transactions, the
support is:
 A higher support value indicates a more frequently occurring pattern in the dataset.

2. Confidence
 Confidence measures the likelihood of the consequent occurring given that the antecedent
has already occurred. It evaluates the reliability of the rule.
Formula:

Example: If 70% of customers who buy bread also buy butter, the confidence is:

 Higher confidence suggests a stronger relationship between the antecedent and


consequent.
3. Lift
 Lift measures the strength of an association compared to its random occurrence in the
dataset.
 It identifies how much more likely the antecedent and consequent are to appear together
than independently.
Formula:
confidence of antecedent∧consequent
LIFT =
support (consequent )
 if Lift=1, two items are independent.
 if Lift>1, indicates positive association.
 if Lift<1, indicates negative association.

Example
 A lift value greater than 1 indicates a strong positive association, while a value equal to 1
suggests no association.
 For instance, if the lift is 1.5, it means the antecedent makes the consequent 1.5 times
more likely.
Association Rule – Working
 Association rule learning is a multi-step process designed to identify meaningful patterns
and relationships in large datasets. It involves two main stages:
1. Identifying Frequent Itemsets:
 The process begins by identifying frequent itemsets—combinations of items that
appear together in transactions with a frequency above a predefined threshold.
 Metrics like support are used to measure how often these itemsets occur in the
dataset. For example, a frequent itemset might reveal that bread and butter are
purchased together in 10% of transactions.
2. Generating Association Rules
 Once frequent itemsets are identified, association rules are generated.
 These rules take the form of if-then statements that describe relationships between
items (e.g., “If a customer buys bread, they are likely to buy butter”).
 Metrics such as confidence and lift are applied to evaluate the strength and
reliability of these rules.
Iterative Refinement
 The process is iterative, with thresholds for support and confidence adjusted to refine the
rules.
 This ensures that only the most significant and actionable rules are selected. For instance,
a rule with low confidence may be excluded from further analysis.
 Through this systematic approach, association rule learning uncovers valuable insights
from raw data, enabling organizations to make data-driven decisions.

Types of Association Rule Learning Algorithms


 Several algorithms are used for association rule learning, each with unique strengths and
applications. The three most commonly used algorithms are:
1. Apriori Algorithm
 The Apriori algorithm employs a breadth-first search approach to identify frequent
itemsets. It relies on the principle that all subsets of a frequent itemset must also be
frequent, reducing the search space.
 Advantage: Simple to implement and effective for small datasets with low
dimensionality.
 Limitation: Performance degrades significantly with large or dense datasets due to
repeated scanning of the database.
2. Eclat Algorithm
 The Eclat algorithm uses a depth-first search strategy to discover frequent itemsets.
 Instead of scanning the database multiple times, it represents transactions as vertical
itemsets and directly computes intersections.
 Advantage: Efficient for datasets with sparse data or where there are fewer
frequent itemsets.
3. FP-Growth Algorithm
 The FP-Growth (Frequent Pattern Growth) algorithm leverages a prefix-tree
structure called the FP-tree to represent transactional data compactly.
 Unlike Apriori, it avoids generating candidate itemsets explicitly, making it faster and
more efficient.
 Advantage: Significantly faster and more memory-efficient than Apriori,
especially for large datasets.

APRIORI PROPERTY / DOWN CLOSURE PROPERTY

If an itemset appears frequently in a dataset, then all its subsets


must also be Frequent.
Conversely, if an itemset is identified as infrequent, then its supersets
are also considered infrequent.

Applications of Association Rules


 Association rules are widely applied across various industries to uncover patterns and
relationships in data, enabling better decision-making and operational efficiency.
1. Retail and Market Basket Analysis:
 Retailers use association rules to identify frequently purchased product
combinations, helping them optimize store layouts or create product bundles to
increase sales.
 Example: A supermarket discovers that customers who buy bread often purchase
butter and jam, leading to strategic placement of these items together.
2. Healthcare:
 In healthcare, association rules help discover co-occurrence patterns in symptoms,
aiding in diagnostic processes and treatment plans.
 Example: Identifying that patients with high blood pressure often have a higher risk
of developing diabetes can guide preventative care strategies.
3. E-Commerce and Recommendation Systems
 E-commerce platforms leverage association rules to build recommendation systems
that enhance user experiences and drive sales.
 Example: Amazon’s “Customers who bought this also bought” feature suggests
complementary products, boosting cross-selling opportunities.
4. Fraud Detection
o Association rules are used in financial services to identify unusual patterns in
transaction data, which can help detect fraudulent activities.
o Example: Flagging transactions that deviate significantly from established
spending patterns for further investigation.

Example of Association Rules


 Consider a small transaction dataset where customers purchase items like bread, butter,
and milk.
Dataset Example:

Transaction ID Items Purchased

1 Bread, Butter

2 Bread, Milk

3 Bread, Butter, Milk

4 Milk

5 Bread, Butter

Rule Discovery Process:


Rule Example: “If bread is purchased, then butter is likely to be purchased.”
1. Support Calculation:
Support = Transactions containing both bread and butter ÷ Total transactions

2. Confidence Calculation:
Confidence = Support of bread and butter ÷ Support of bread
3. Lift Calculation:
Lift = Confidence ÷ Support of butter

Lift= 0.75/ 3= 0.25= 25%

 A lift value greater than 1 indicates a positive association between bread and butter.
 This example demonstrates how association rules are derived and evaluated, providing
actionable insights from transactional data.

Association rule algorithms


 Examples of popular data mining algorithms that use association rules include the
following:
1. AIS
o This algorithm generates and counts itemsets as it scans data.
o In transaction data, AIS determines which large itemsets contain a transaction.
o New candidate itemsets are created by extending the large itemsets with other
items in the transaction data.
2. Apriori
o The Apriori algorithm takes an iterative approach, where it scans a database to
apply association rules, focusing on identifying large itemsets with subsets of
similar features as potential candidates for these rules.
o Itemsets that aren't large are discarded, and the remaining itemsets are the
candidates the algorithm considers.
o The Apriori algorithm considers any subset of a frequent itemset to also be a
frequent itemset. With this approach, the algorithm reduces the number of
candidates being considered by only exploring itemsets with support counts
greater than the minimum support count.
3. SETM
o This algorithm also generates candidate itemsets as it scans a database, but it
accounts for the itemsets at the end of its scan.
o New candidate itemsets are generated the same way as with the AIS algorithm,
but the transaction ID of the generating transaction is saved with the candidate
itemset in a sequential data structure.
o At the end of the scan, the support count of candidate itemsets is created by
aggregating the sequential structure.
4. ECLAT
o This name stands for Equivalence Class Clustering and Bottom-up Lattice
Traversal. The ECLAT algorithm is a version of the Apriori algorithm that
explores complex classes of itemsets first and then repeatedly boils them down
and simplifies them.
5. FP-growth
o FP stands for frequent pattern. The FP-growth algorithm uses a tree structure,
called an FP-tree, to map out relationships between individual items to find the
most frequently recurring patterns.

Apriori Algorithm for Mining Association


Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database containing a
huge number of transactions. For example, the items customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.

Components of Apriori algorithm


The given three components comprise the apriori algorithm.

 Support
 Confidence
 Lift

Application of the Apriori Algorithm:

 Apriori algorithm has a very important contribution in various fields.


 This algorithm has picked up a vital place in recent years.
 Apriori algorithm is being used in different industries for Data mining and Data handling.
 The domains where Apriori algorithm is helpful are discussed below.
1. Mobile e-commerce:
 Bigdata is capable of helping mobile e-commerce companies to provide a convenient,
simple, and customised shopping experience with the help of an Apriory algorithm. The
company's product recommendation accuracy increases, which brings an excellent
customer experience. Thus, the customer engagement as well as the sales of the company
increases.
2. Education:
 Researchers and students use this algorithm to store and analyse their data. The
educational institution stores and monitors students' data like age, gender, traits, grades,
characteristics, performance, parent details etc.
3. Forestry:
 Nowadays it is important to focus upon forestry and wildlife. Apriori algorithm is being
used to store, analyse, and manage the minute details of every flora and fauna of a
particular territory.
4. Medical:
 To treat an existing patient a doctor needs to retrieve lots of past data related to the
patient. In this case, hospitals use the Apriori algorithm generally to manage the database
of patients without mixing it with other patients.
5. Market Basket Analysis:
 Retailers can analyse customer's purchase patterns by monitoring the combinations of
products that they purchase together frequently . This information can be useful to
recommend such products together to increase the sales. Apriori algorithm is also used
here.
6. Improving Website's design and ranking:
 Using the Apriori algorithm we can analyse the users navigation pattern on websites.
Getting such information we can improve website design and user experience.
7. Tour and travel:
 Apriori algorithm analyses the booking pattern of tourists. This algorithm has very
important role in tourism sector.
Minimum Support : Minimum number of times an item must appear in
a transaction for it to be considered frequent.

Example:
Find the frequent itemsets and generate association rules on this.
Assume that minimum support threshold (s = 33.33%) and minimum
confident threshold (c = 60%)

1. Generate Frequent 1-itemset

2. Generate Frequent 2-itemset

3. Generate Frequent 3-itemset

There is only one itemset with minimum support 2. So only one itemset is
frequent.

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}


4. Generate Association rules

TO FORM ASSOCIATION RULE


For every non-empty subset S of I, the
association rule is
S →(I −S)


 Non Empty subsets are S= {{Hot Dogs},
{Coke},
{Chips},
{Hot Dogs, Coke},
{Hot Dogs, Chips},
{Coke, Chips}}
[Hot Dogs,Coke]=>[Chips]
confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke)
= 2/2*100=100% //Selected
[Hot Dogs,Chips]=>[Coke]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Hot Dogs,Chips)
= 2/2*100=100% //Selected
[Coke,Chips]=>[Hot Dogs]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Coke,Chips)
= 2/3*100=66.67% //Selected
[Hot Dogs]=>[Coke,Chips]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Hot Dogs)
= 2/4*100=50% //Rejected
[Coke]=>[Hot Dogs,Chips]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Coke)
= 2/3*100=66.67% //Selected
[Chips]=>[Hot Dogs,Coke]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Chips)
= 2/4*100=50% //Rejected

There are four strong results (minimum confidence greater than


60%)
 [Hot Dogs,Coke]=>[Chips]
 [Hot Dogs,Chips]=>[Coke]
 [Coke,Chips]=>[Hot Dogs]
 [Coke]=>[Hot Dogs,Chips]
Data Formats for Association Rule Mining
1. Transaction Dataset Format (TID List Format)
The most common format is a transactional database, where each
transaction (e.g., a purchase) contains a list of items.
Example:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer,Eggs
3 Milk, Diaper, Beer, Cola
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Cola

2. Flat File (List of Transactions)


This is a simplified list format without explicit TIDs:
Example:
Bread Milk
Bread Diaper Beer Eggs
Milk Diaper Beer Cola
Bread Milk Diaper Beer
Bread Milk Diaper Cola

3. Binary Matrix Format (Less Common)


A matrix where each row is a transaction, and each column is an item.
1 = present, 0 = absent.

TID Bread Milk Diaper Beer Eggs Cola


1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

Mining with multiple minimum supports


 Rare Item Problem
 The rare item problem occurs when infrequent but important items
are eliminated too early in the mining process because their support
is below the global minimum support threshold.
 Example:
Imagine a retail store dataset:
TID Items
1 Bread, Milk
2 Bread, Milk, Diaper
3 Milk, Chocolate, Caviar
4 Bread, Caviar
TID Items
5 Milk, Caviar
 Suppose:
 Milk appears in 4/5 transactions → support = 80%
 Caviar appears in 3/5 transactions → support = 60%
 Set min_sup = 70%
 Then Caviar would be dropped, even though it might be:
• High-value
• Rare but very important (luxury product)
• Part of interesting combinations (e.g., "People who buy
chocolate also buy caviar")
 So in a rare item problem, valuable items get filtered out because
they're not frequent overall.
In fields like:
• Healthcare: rare symptoms or diseases might be critical
• Fraud detection: rare patterns may indicate fraud
• Retail: rare purchases might be high-margin luxury goods
Ignoring rare items can result in missing valuable
knowledge.

Cause Effect
Global min_support is too high Rare but useful items are dropped
Global min_support is too low Too many frequent itemsets (noise)

Solutions:
1. Multiple Minimum Supports (e.g., MSApriori)
Assign each item its own minimum support.
Rare items → lower MIS
Common items → higher MIS
2. Weighted Association Rules
Assign weights to items based on importance or utility, not just
frequency.
3. Utility Mining
Consider the profit or importance of an item, not just how often it
appears.

MS Apriori Algorithm
 MS - Apriori is an algorithm to find frequent itemsets from data with
multiple minimum support.
 With the regular Apriori algorithm, frequent itemsets found has only
one support regardless of the frequency of the itemsets. This creates
two problems
1. If the minimum support is very high, we miss all the rare
itemsets and end up with frequent itemsets which are very
obvious.
2. If the minimum support is very low, we have a problem of
combinatorial explosion wherein we have very vast amount of
itemsets which might not be practically useful

Basic Concepts of Sequential Patterns


 Sequential pattern mining is the process of finding frequent sequences in
a sequence database, where data is ordered (usually by time).
 It’s commonly used in customer behavior analysis, web clickstream
mining, medical records, etc.
 Sequential pattern is a mining technique used to discover the
relationships between occurrences of sequential events, mainly to
discover any specific order in the occurrences of these events.
 Sequence
 A sequence is an ordered list of itemsets.
 Each itemset is a collection of items/events that occur together.
 Sequence is denoted with angle brackets < >
 Example:
 Customers who buy a phone often buy a case and then
earbuds.
 Customers who buy shoes often buy socks and then shoe
polish.
 Sequence Database
 A sequence database is a set of sequences, usually from multiple
users or sessions.

 Sequential Pattern
 A sequential pattern is a sequence that appears frequently (i.e., its
support ≥ min_sup).
 Subsequence
 A sequence S1 is a subsequence of S2 if all items of S1 appear in S2
in the same order, but not necessarily consecutively.
 S1 = < (A), (C) > is a subsequence of S2 = < (A), (B, C), (D) >
 Applications of sequential pattern mining
 Customer purchase behavior analysis (identifying common purchase
sequences over time)
 Web usage mining (analyzing user navigation patterns on websites)
 Biomedical sequence analysis (discovering patterns in DNA
sequences or medical events)
 Sequential Pattern Mining Algorithm Types
• Breadth-first search based
– GSP (Generalized Sequential Pattern) algorithm1
– SPADE
• Deepth-first search based
– PrefixSpan
– SPAM4

Mining Sequential Patterns on GSP


 GSP (Generalized Sequential Pattern) algorithm performs a level-wise
(breadth-first) search within the prefix tree.
 Considers the order of events or items (temporal or positional
relationships)
 Useful for analyzing time-series data and user behavior (customer

Given the set of frequent sequences at level 𝑘, generate all possible


purchase sequences, web navigation paths)

sequence extensions or candidates at level 𝑘 + 1.


 It is a Apriori-based algorithm for sequential pattern mining


(generates and prunes candidates)
 Generates candidate sequences by joining frequent sequences
(creates longer candidates)
 Prunes candidates using the Apriori principle (removes infrequent
subsequences)
 Scans the database to determine frequent sequences (counts
support)

PrefixSpan Algorithm (Prefix-projected Sequential pattern


mining)
 PrefixSpan (Prefix-projected Sequential pattern mining) is an algorithm
that finds frequent sequential patterns by growing prefixes and projecting
databases based on those prefixes instead of generating and testing
candidate sequences like Apriori-based methods.
 Number of candidate generation is Faster
 Works well with long sequences or large databases
 Example:
 Step 1: PrefixSpan first counts the support of each item by
scanning the database

 Step 2: Eliminate infrequent items

 Step 3: Find patterns starting with {a}


Prefix span does a Database Projection with {a}.
(i.e.,) Keep sequences containing only {a}
Delete the first occurrence of {a} and everything that appears
before {a}
Calculate Support

 Step 4: Find patterns starting with <{a}{c}>

This pattern is infrequent!

 Step 5: Find patterns starting with <{a}{b}>


 Step 6: Find patterns starting with <{a,b} {c}>

 Step 7: Find patterns starting with <{b} >


 Step 8: Find patterns starting with <{b} ,{c}>

 Step 9: Find patterns starting with <{c}>

You might also like