0% found this document useful (0 votes)

24 views25 pages

Web Mining Unit 1

The document provides an overview of web data mining, detailing the processes and techniques involved in data mining and web mining, including their applications and tasks. It explains the different categories of web mining such as web content, structure, and usage mining, as well as the concepts of association rules and frequent pattern mining. Additionally, it discusses various algorithms used for association rule learning and their applications across different industries.

Uploaded by

merlyne.noel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views25 pages

Web Mining Unit 1

Uploaded by

merlyne.noel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT- I INTRODUCTION TO WEB DATA MINING

DATA MINING
 The process of using software and algorithms to find patterns and relationships in a large
data set is called Data Mining.
 It is also called Knowledge Discovery in Data (KDD).
 Data Mining involves :
a) Machine Learning
b) Artificial Intelligence
c) Data Visualization
d) Statistics
e) Information Retrieval
f) Database
 Data Mining tasks are :
a) Supervised Learning (Classification)
b) Unsupervised Learning (Clustering)
c) Association rule mining
d) Sequential pattern mining
 Data Analysts ( Data Miners ) are professionals who collect, analyze and interpret data to
help businesses to make better decisions. For example,
a) What kind of customers should a business target in tis next ad campaign?
b) Which age group is most vulnerable to a particular disease?
c) What behavioral patterns are connected to financial fraud?
 Data Mining involves 3 steps
a) Preprocessing
 Raw data contains irrelevant attributes, duplicates, noise, errors, outliers (a
statistical observation that is marked differently in value from the others in
the sample)
 Preprocessing tasks involve cleaning of data and maintain the data in a
separate spreadsheet or through a programming language so that the
interpretations won’t be wrong.
b) Data Mining
 The processed data is fed to a mining algorithm which will produce
patterns or knowledge.
c) Post Processing
 Not all discovered patterns are useful
 This step will identify useful patterns with the help of various evaluation.
 Visualization techniques are used to make decisions.
 Features of Data Mining
a) It can analyze big datasets.
b) It finds patterns and connections in large databases easily.
c) It is used to predict future outcomes and trends.
d) It groups similar types of data together.
e) It provides useful information to make data-based decisions.
f) Data Mining uses many computers to process data faster.

WEB MINING
 Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services..
 Example:
 Analyze website traffic
 Analyze user interactions with a website
 How users navigate their site?
 What content is most popular?
 Where users are dropping off?
 Web mining involves
 Data mining
 Artificial Intelligence
 Machine Learning
 Information Retrieval
 Statistics
Process of Web Mining

CATEGORIES OF WEB MINING

1. Web Content Mining

 Web content mining is the application of extracting useful information from the
content of the web documents.
 Web content consist of several types of data – text, image, audio, video etc.
Content data is the group of facts that a web page is designed.
 It can provide effective and interesting patterns about user needs.
 Text documents are related to text mining, machine learning and natural language
processing.
 This mining is also known as text mining. This type of mining performs scanning
and mining of the text, images and groups of web pages according to the content
of the input.

2. Web Structure Mining

 Web structure mining is the application of discovering structure information from
the web.
 The structure of the web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages.
 It uses a graph theory to analyze the nodes and connections in the structure of a
website.

 It identifies relationship between web pages linked by information or direct link

connection.
 The unstructured web data from various websites are transformed into structured,
usable information that businesses can use for their competitive advantage
 To determine the connection between two commercial websites, Web structure
mining can be very useful.

3. Web Usage Mining

 Web usage mining is the application of identifying or discovering interesting
usage patterns from large data sets.
 Web Usage Mining is watching what people do on a website. It involves looking
at web activity logs, like page views, clicks, downloads, session durations, etc., to
understand how users interact with web applications. Thus it can be used to
understand user behavior, preferences, and patterns.
 In web usage mining, user access data on the web and collect data in form of
logs. So, Web usage mining is also called log mining.
Algorithm used:
• Page rank algorithm
• HITS algorithm
• Weighted page rank algorithm
• Distance rank algorithm
• Weighted page content rank algorithm
• Eigen rumour algorithm
• Time rank algorithm
• Query dependent ranking algorithm

COMPARISON BETWEEN DATA MINING AND WEB MINING

Parameters Data Mining Web Mining

Definition Data Mining is the process that Web Mining is the process of data mining
attempts to discover pattern and hidden techniques to automatically discover and
knowledge in large data sets in any extract information from web documents.
system.
Application Data Mining is very useful for web Web Mining is very useful for a particular
page analysis. website and e-service.
Structure Data Mining get the information from Web Mining get the information from
explicit structure. structured, unstructured and semi-
structured web pages.
Problem Clustering, classification, regression, Web content mining, Web structure
Type prediction, optimization and control. mining.
Skills It includes approaches for data It includes application level knowledge,
cleansing, machine learning data engineering with mathematical
algorithms. Statistics and probability. modules like statistics and probability.

Basic Concepts of Association Rule

 A powerful technique in data mining to discover interesting
relationships between different items or attributers within large
datasets.
 Aim: To identify patterns or correlations or dependencies among
different groups of items, called itemsets in data which can them be
used to predict the likelihood of certain items being purchased or used
together.
 One of the most important measures using in association rule mining is
LIFT, which provides a way to determine the strength and significance
of the discovered association rules.
 Assoication rules -> represent as set of items.
o Eg: {A,B,C}->where A,B,C are different items in a dataset.
 Goal of association rule mining is to identify rules that have a high
level of support and confidence.
 Association rules are “if-then” statements.
o If an itemset A is present in a transaction, then item B is also
likely to be present.
 Antecedent : The “if” part where the condition is being tested.
 Consequent : The “then” part that denotes the outcome that occurs if
the condition is true.

Types of association rules in data mining

There are multiple types of association rules in data mining. They include the following:
 Generalized- Rules in this category are meant to be general examples of association
rules that provide a high-level overview of what these associations of data points look
like.
 Multilevel- Multilevel association rules separate data points into different levels of
importance known as levels of abstraction -- and distinguish between associations of
more important data points and ones of lower importance.
 Quantitative- This type of association rule is used to describe examples where
associations are made between numerical data points.
 Multi-relational- This type is more in-depth than traditional association rules that
consider relationships between single data points. Multi-relational rules are made across
multiple or multidimensional databases.
Itemset
 A set of items together is called an itemset.
 If any itemset has kitems it is called a k-itemset.
 An itemset consists of two or more items. An itemset that occurs frequently is called a
frequent itemset.
 Thus frequent itemset mining is a data mining technique to identify the items that often
occur together.
o For Example, Bread and butter, Laptop and Antivirus software, etc

Frequent Pattern Mining (FPM)

 The frequent pattern mining algorithm is one of the most important techniques of data
mining to discover relationships between different items in a dataset.
 These relationships are represented in the form of association rules. It helps to find the
irregularities in data.
 FPM has many applications in the field of data analysis, software bugs, cross-marketing,
sale campaign analysis, market basket analysis, etc.
 Frequent itemsets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence and
Mining of association rules is the most important of them.
 Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the
items are purchased together.

Key Components of Association Rules

 Antecedent: The “if” part of the rule, representing the condition.
 Example: A customer buys bread.
 Consequent: The “then” part of the rule, representing the outcome.
 Example: The customer also buys butter.
 Association rules are derived through algorithms that evaluate the frequency and strength
of these relationships.
 They use metrics like support, confidence, and lift to measure the relevance and
reliability of discovered patterns.
 These rules have applications in various fields, such as retail, healthcare, and marketing,
where analyzing customer behavior or trends is critical for success.

Rule Evaluation Metrics

 Association rules are evaluated using key metrics that determine their relevance, strength,
and reliability.
 These metrics include support, confidence, and lift, which quantify the frequency and
strength of relationships between data items.
1. Support
 Support measures how frequently an itemset (both antecedent and consequent) appears in
the dataset. It provides an indication of how common a particular association is.
Formula:

 Example: If bread and butter appear together in 100 out of 1,000 transactions, the
support is:
 A higher support value indicates a more frequently occurring pattern in the dataset.

2. Confidence
 Confidence measures the likelihood of the consequent occurring given that the antecedent
has already occurred. It evaluates the reliability of the rule.
Formula:

Example: If 70% of customers who buy bread also buy butter, the confidence is:

 Higher confidence suggests a stronger relationship between the antecedent and

consequent.
3. Lift
 Lift measures the strength of an association compared to its random occurrence in the
dataset.
 It identifies how much more likely the antecedent and consequent are to appear together
than independently.
Formula:
confidence of antecedent∧consequent
LIFT =
support (consequent )
 if Lift=1, two items are independent.
 if Lift>1, indicates positive association.
 if Lift<1, indicates negative association.

Example
 A lift value greater than 1 indicates a strong positive association, while a value equal to 1
suggests no association.
 For instance, if the lift is 1.5, it means the antecedent makes the consequent 1.5 times
more likely.
Association Rule – Working
 Association rule learning is a multi-step process designed to identify meaningful patterns
and relationships in large datasets. It involves two main stages:
1. Identifying Frequent Itemsets:
 The process begins by identifying frequent itemsets—combinations of items that
appear together in transactions with a frequency above a predefined threshold.
 Metrics like support are used to measure how often these itemsets occur in the
dataset. For example, a frequent itemset might reveal that bread and butter are
purchased together in 10% of transactions.
2. Generating Association Rules
 Once frequent itemsets are identified, association rules are generated.
 These rules take the form of if-then statements that describe relationships between
items (e.g., “If a customer buys bread, they are likely to buy butter”).
 Metrics such as confidence and lift are applied to evaluate the strength and
reliability of these rules.
Iterative Refinement
 The process is iterative, with thresholds for support and confidence adjusted to refine the
rules.
 This ensures that only the most significant and actionable rules are selected. For instance,
a rule with low confidence may be excluded from further analysis.
 Through this systematic approach, association rule learning uncovers valuable insights
from raw data, enabling organizations to make data-driven decisions.

Types of Association Rule Learning Algorithms

 Several algorithms are used for association rule learning, each with unique strengths and
applications. The three most commonly used algorithms are:
1. Apriori Algorithm
 The Apriori algorithm employs a breadth-first search approach to identify frequent
itemsets. It relies on the principle that all subsets of a frequent itemset must also be
frequent, reducing the search space.
 Advantage: Simple to implement and effective for small datasets with low
dimensionality.
 Limitation: Performance degrades significantly with large or dense datasets due to
repeated scanning of the database.
2. Eclat Algorithm
 The Eclat algorithm uses a depth-first search strategy to discover frequent itemsets.
 Instead of scanning the database multiple times, it represents transactions as vertical
itemsets and directly computes intersections.
 Advantage: Efficient for datasets with sparse data or where there are fewer
frequent itemsets.
3. FP-Growth Algorithm
 The FP-Growth (Frequent Pattern Growth) algorithm leverages a prefix-tree
structure called the FP-tree to represent transactional data compactly.
 Unlike Apriori, it avoids generating candidate itemsets explicitly, making it faster and
more efficient.
 Advantage: Significantly faster and more memory-efficient than Apriori,
especially for large datasets.

APRIORI PROPERTY / DOWN CLOSURE PROPERTY

If an itemset appears frequently in a dataset, then all its subsets

must also be Frequent.
Conversely, if an itemset is identified as infrequent, then its supersets
are also considered infrequent.

Applications of Association Rules

 Association rules are widely applied across various industries to uncover patterns and
relationships in data, enabling better decision-making and operational efficiency.
1. Retail and Market Basket Analysis:
 Retailers use association rules to identify frequently purchased product
combinations, helping them optimize store layouts or create product bundles to
increase sales.
 Example: A supermarket discovers that customers who buy bread often purchase
butter and jam, leading to strategic placement of these items together.
2. Healthcare:
 In healthcare, association rules help discover co-occurrence patterns in symptoms,
aiding in diagnostic processes and treatment plans.
 Example: Identifying that patients with high blood pressure often have a higher risk
of developing diabetes can guide preventative care strategies.
3. E-Commerce and Recommendation Systems
 E-commerce platforms leverage association rules to build recommendation systems
that enhance user experiences and drive sales.
 Example: Amazon’s “Customers who bought this also bought” feature suggests
complementary products, boosting cross-selling opportunities.
4. Fraud Detection
o Association rules are used in financial services to identify unusual patterns in
transaction data, which can help detect fraudulent activities.
o Example: Flagging transactions that deviate significantly from established
spending patterns for further investigation.

Example of Association Rules

 Consider a small transaction dataset where customers purchase items like bread, butter,
and milk.
Dataset Example:

Transaction ID Items Purchased

1 Bread, Butter

2 Bread, Milk

3 Bread, Butter, Milk

4 Milk

5 Bread, Butter

Rule Discovery Process:

Rule Example: “If bread is purchased, then butter is likely to be purchased.”
1. Support Calculation:
Support = Transactions containing both bread and butter ÷ Total transactions

2. Confidence Calculation:
Confidence = Support of bread and butter ÷ Support of bread
3. Lift Calculation:
Lift = Confidence ÷ Support of butter

Lift= 0.75/ 3= 0.25= 25%

 A lift value greater than 1 indicates a positive association between bread and butter.
 This example demonstrates how association rules are derived and evaluated, providing
actionable insights from transactional data.

Association rule algorithms

 Examples of popular data mining algorithms that use association rules include the
following:
1. AIS
o This algorithm generates and counts itemsets as it scans data.
o In transaction data, AIS determines which large itemsets contain a transaction.
o New candidate itemsets are created by extending the large itemsets with other
items in the transaction data.
2. Apriori
o The Apriori algorithm takes an iterative approach, where it scans a database to
apply association rules, focusing on identifying large itemsets with subsets of
similar features as potential candidates for these rules.
o Itemsets that aren't large are discarded, and the remaining itemsets are the
candidates the algorithm considers.
o The Apriori algorithm considers any subset of a frequent itemset to also be a
frequent itemset. With this approach, the algorithm reduces the number of
candidates being considered by only exploring itemsets with support counts
greater than the minimum support count.
3. SETM
o This algorithm also generates candidate itemsets as it scans a database, but it
accounts for the itemsets at the end of its scan.
o New candidate itemsets are generated the same way as with the AIS algorithm,
but the transaction ID of the generating transaction is saved with the candidate
itemset in a sequential data structure.
o At the end of the scan, the support count of candidate itemsets is created by
aggregating the sequential structure.
4. ECLAT
o This name stands for Equivalence Class Clustering and Bottom-up Lattice
Traversal. The ECLAT algorithm is a version of the Apriori algorithm that
explores complex classes of itemsets first and then repeatedly boils them down
and simplifies them.
5. FP-growth
o FP stands for frequent pattern. The FP-growth algorithm uses a tree structure,
called an FP-tree, to map out relationships between individual items to find the
most frequently recurring patterns.

Apriori Algorithm for Mining Association

Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database containing a
huge number of transactions. For example, the items customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.

Components of Apriori algorithm

The given three components comprise the apriori algorithm.

 Support
 Confidence
 Lift

Application of the Apriori Algorithm:

 Apriori algorithm has a very important contribution in various fields.

 This algorithm has picked up a vital place in recent years.
 Apriori algorithm is being used in different industries for Data mining and Data handling.
 The domains where Apriori algorithm is helpful are discussed below.
1. Mobile e-commerce:
 Bigdata is capable of helping mobile e-commerce companies to provide a convenient,
simple, and customised shopping experience with the help of an Apriory algorithm. The
company's product recommendation accuracy increases, which brings an excellent
customer experience. Thus, the customer engagement as well as the sales of the company
increases.
2. Education:
 Researchers and students use this algorithm to store and analyse their data. The
educational institution stores and monitors students' data like age, gender, traits, grades,
characteristics, performance, parent details etc.
3. Forestry:
 Nowadays it is important to focus upon forestry and wildlife. Apriori algorithm is being
used to store, analyse, and manage the minute details of every flora and fauna of a
particular territory.
4. Medical:
 To treat an existing patient a doctor needs to retrieve lots of past data related to the
patient. In this case, hospitals use the Apriori algorithm generally to manage the database
of patients without mixing it with other patients.
5. Market Basket Analysis:
 Retailers can analyse customer's purchase patterns by monitoring the combinations of
products that they purchase together frequently . This information can be useful to
recommend such products together to increase the sales. Apriori algorithm is also used
here.
6. Improving Website's design and ranking:
 Using the Apriori algorithm we can analyse the users navigation pattern on websites.
Getting such information we can improve website design and user experience.
7. Tour and travel:
 Apriori algorithm analyses the booking pattern of tourists. This algorithm has very
important role in tourism sector.
Minimum Support : Minimum number of times an item must appear in
a transaction for it to be considered frequent.

Example:
Find the frequent itemsets and generate association rules on this.
Assume that minimum support threshold (s = 33.33%) and minimum
confident threshold (c = 60%)

1. Generate Frequent 1-itemset

2. Generate Frequent 2-itemset

3. Generate Frequent 3-itemset

There is only one itemset with minimum support 2. So only one itemset is
frequent.

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}

4. Generate Association rules

TO FORM ASSOCIATION RULE

For every non-empty subset S of I, the
association rule is
S →(I −S)


 Non Empty subsets are S= {{Hot Dogs},
{Coke},
{Chips},
{Hot Dogs, Coke},
{Hot Dogs, Chips},
{Coke, Chips}}
[Hot Dogs,Coke]=>[Chips]
confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke)
= 2/2*100=100% //Selected
[Hot Dogs,Chips]=>[Coke]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Hot Dogs,Chips)
= 2/2*100=100% //Selected
[Coke,Chips]=>[Hot Dogs]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Coke,Chips)
= 2/3*100=66.67% //Selected
[Hot Dogs]=>[Coke,Chips]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Hot Dogs)
= 2/4*100=50% //Rejected
[Coke]=>[Hot Dogs,Chips]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Coke)
= 2/3*100=66.67% //Selected
[Chips]=>[Hot Dogs,Coke]
confidence = sup(Hot Dogs,Coke,Chips)/sup(Chips)
= 2/4*100=50% //Rejected

There are four strong results (minimum confidence greater than

60%)
 [Hot Dogs,Coke]=>[Chips]
 [Hot Dogs,Chips]=>[Coke]
 [Coke,Chips]=>[Hot Dogs]
 [Coke]=>[Hot Dogs,Chips]
Data Formats for Association Rule Mining
1. Transaction Dataset Format (TID List Format)
The most common format is a transactional database, where each
transaction (e.g., a purchase) contains a list of items.
Example:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer,Eggs
3 Milk, Diaper, Beer, Cola
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Cola

2. Flat File (List of Transactions)

This is a simplified list format without explicit TIDs:
Example:
Bread Milk
Bread Diaper Beer Eggs
Milk Diaper Beer Cola
Bread Milk Diaper Beer
Bread Milk Diaper Cola

3. Binary Matrix Format (Less Common)

A matrix where each row is a transaction, and each column is an item.
1 = present, 0 = absent.

TID Bread Milk Diaper Beer Eggs Cola

1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

Mining with multiple minimum supports

 Rare Item Problem
 The rare item problem occurs when infrequent but important items
are eliminated too early in the mining process because their support
is below the global minimum support threshold.
 Example:
Imagine a retail store dataset:
TID Items
1 Bread, Milk
2 Bread, Milk, Diaper
3 Milk, Chocolate, Caviar
4 Bread, Caviar
TID Items
5 Milk, Caviar
 Suppose:
 Milk appears in 4/5 transactions → support = 80%
 Caviar appears in 3/5 transactions → support = 60%
 Set min_sup = 70%
 Then Caviar would be dropped, even though it might be:
• High-value
• Rare but very important (luxury product)
• Part of interesting combinations (e.g., "People who buy
chocolate also buy caviar")
 So in a rare item problem, valuable items get filtered out because
they're not frequent overall.
In fields like:
• Healthcare: rare symptoms or diseases might be critical
• Fraud detection: rare patterns may indicate fraud
• Retail: rare purchases might be high-margin luxury goods
Ignoring rare items can result in missing valuable
knowledge.

Cause Effect
Global min_support is too high Rare but useful items are dropped
Global min_support is too low Too many frequent itemsets (noise)

Solutions:
1. Multiple Minimum Supports (e.g., MSApriori)
Assign each item its own minimum support.
Rare items → lower MIS
Common items → higher MIS
2. Weighted Association Rules
Assign weights to items based on importance or utility, not just
frequency.
3. Utility Mining
Consider the profit or importance of an item, not just how often it
appears.

MS Apriori Algorithm
 MS - Apriori is an algorithm to find frequent itemsets from data with
multiple minimum support.
 With the regular Apriori algorithm, frequent itemsets found has only
one support regardless of the frequency of the itemsets. This creates
two problems
1. If the minimum support is very high, we miss all the rare
itemsets and end up with frequent itemsets which are very
obvious.
2. If the minimum support is very low, we have a problem of
combinatorial explosion wherein we have very vast amount of
itemsets which might not be practically useful

Basic Concepts of Sequential Patterns

 Sequential pattern mining is the process of finding frequent sequences in
a sequence database, where data is ordered (usually by time).
 It’s commonly used in customer behavior analysis, web clickstream
mining, medical records, etc.
 Sequential pattern is a mining technique used to discover the
relationships between occurrences of sequential events, mainly to
discover any specific order in the occurrences of these events.
 Sequence
 A sequence is an ordered list of itemsets.
 Each itemset is a collection of items/events that occur together.
 Sequence is denoted with angle brackets < >
 Example:
 Customers who buy a phone often buy a case and then
earbuds.
 Customers who buy shoes often buy socks and then shoe
polish.
 Sequence Database
 A sequence database is a set of sequences, usually from multiple
users or sessions.

 Sequential Pattern
 A sequential pattern is a sequence that appears frequently (i.e., its
support ≥ min_sup).
 Subsequence
 A sequence S1 is a subsequence of S2 if all items of S1 appear in S2
in the same order, but not necessarily consecutively.
 S1 = < (A), (C) > is a subsequence of S2 = < (A), (B, C), (D) >
 Applications of sequential pattern mining
 Customer purchase behavior analysis (identifying common purchase
sequences over time)
 Web usage mining (analyzing user navigation patterns on websites)
 Biomedical sequence analysis (discovering patterns in DNA
sequences or medical events)
 Sequential Pattern Mining Algorithm Types
• Breadth-first search based
– GSP (Generalized Sequential Pattern) algorithm1
– SPADE
• Deepth-first search based
– PrefixSpan
– SPAM4

Mining Sequential Patterns on GSP

 GSP (Generalized Sequential Pattern) algorithm performs a level-wise
(breadth-first) search within the prefix tree.
 Considers the order of events or items (temporal or positional
relationships)
 Useful for analyzing time-series data and user behavior (customer

Given the set of frequent sequences at level 𝑘, generate all possible

purchase sequences, web navigation paths)

sequence extensions or candidates at level 𝑘 + 1.



 It is a Apriori-based algorithm for sequential pattern mining

(generates and prunes candidates)
 Generates candidate sequences by joining frequent sequences
(creates longer candidates)
 Prunes candidates using the Apriori principle (removes infrequent
subsequences)
 Scans the database to determine frequent sequences (counts
support)

PrefixSpan Algorithm (Prefix-projected Sequential pattern

mining)
 PrefixSpan (Prefix-projected Sequential pattern mining) is an algorithm
that finds frequent sequential patterns by growing prefixes and projecting
databases based on those prefixes instead of generating and testing
candidate sequences like Apriori-based methods.
 Number of candidate generation is Faster
 Works well with long sequences or large databases
 Example:
 Step 1: PrefixSpan first counts the support of each item by
scanning the database

 Step 2: Eliminate infrequent items

 Step 3: Find patterns starting with {a}

Prefix span does a Database Projection with {a}.
(i.e.,) Keep sequences containing only {a}
Delete the first occurrence of {a} and everything that appears
before {a}
Calculate Support

 Step 4: Find patterns starting with <{a}{c}>

This pattern is infrequent!

 Step 5: Find patterns starting with <{a}{b}>

 Step 6: Find patterns starting with <{a,b} {c}>

 Step 7: Find patterns starting with <{b} >

 Step 8: Find patterns starting with <{b} ,{c}>

 Step 9: Find patterns starting with <{c}>

Uneditable - M.sc. CS Sem-II Web Data Analytics
No ratings yet
Uneditable - M.sc. CS Sem-II Web Data Analytics
93 pages
Wdm-Unit I
No ratings yet
Wdm-Unit I
70 pages
Data vs. Web Mining Explained
No ratings yet
Data vs. Web Mining Explained
7 pages
Web Mining - Lec1 2
No ratings yet
Web Mining - Lec1 2
62 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Data Mining Course Syllabus
No ratings yet
Data Mining Course Syllabus
8 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining
No ratings yet
Web Mining
42 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Combine 056
No ratings yet
Combine 056
57 pages
UNIT 1 - Lecture 1 - Introduction To Data Mining
No ratings yet
UNIT 1 - Lecture 1 - Introduction To Data Mining
62 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
Data Mining for Beginners
No ratings yet
Data Mining for Beginners
26 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit 1
No ratings yet
Unit 1
148 pages
References: Machine Learning Tools and Techniques, 2 Edition
No ratings yet
References: Machine Learning Tools and Techniques, 2 Edition
32 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
01 Intro 1
No ratings yet
01 Intro 1
33 pages
Digital Design - Morris Mano-Fifth Edition
No ratings yet
Digital Design - Morris Mano-Fifth Edition
31 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
Web Data Mining Techniques Overview
No ratings yet
Web Data Mining Techniques Overview
9 pages
Web Mining for Website Enhancement
No ratings yet
Web Mining for Website Enhancement
7 pages
Unit 3
No ratings yet
Unit 3
23 pages
Data Ming Unit 2
No ratings yet
Data Ming Unit 2
8 pages
IS352 - Lecture 01
No ratings yet
IS352 - Lecture 01
62 pages
DMM Finals
No ratings yet
DMM Finals
30 pages
Unit 1
No ratings yet
Unit 1
59 pages
Unit - I
No ratings yet
Unit - I
22 pages
Web Mining
No ratings yet
Web Mining
28 pages
Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm
No ratings yet
Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm
33 pages
Data Mining for Business Insights
100% (1)
Data Mining for Business Insights
39 pages
DM 1
No ratings yet
DM 1
47 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
01 Intro
No ratings yet
01 Intro
22 pages
Data Mining PDF
No ratings yet
Data Mining PDF
17 pages
Web Mining 171317705012335496661d01dac5fa2
No ratings yet
Web Mining 171317705012335496661d01dac5fa2
48 pages
Intro Data Mining
No ratings yet
Intro Data Mining
51 pages
Fundamentals of Data Mining
No ratings yet
Fundamentals of Data Mining
36 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
37 pages
Seminar On Data Mining Concepts and Its
No ratings yet
Seminar On Data Mining Concepts and Its
8 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
35 pages
Web Data Mining - 5
No ratings yet
Web Data Mining - 5
14 pages
Lecture 01 11jan
No ratings yet
Lecture 01 11jan
29 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
55 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
39 pages
Week 02 PDF
No ratings yet
Week 02 PDF
39 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
39 pages
LECTURE 1 Data Mining
No ratings yet
LECTURE 1 Data Mining
41 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
DATA MINING (Gtu Sem-6) Assignment
No ratings yet
DATA MINING (Gtu Sem-6) Assignment
3 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
Data Mining (Module-1)
No ratings yet
Data Mining (Module-1)
14 pages
Mysql Datasheet - en
No ratings yet
Mysql Datasheet - en
2 pages
PGDM Big Data & Analytics Lecture
No ratings yet
PGDM Big Data & Analytics Lecture
17 pages
Data Warehouse Quiz for IT Students
No ratings yet
Data Warehouse Quiz for IT Students
13 pages
TCS Technical Interview Questions
No ratings yet
TCS Technical Interview Questions
8 pages
Integration With Third-Party Applications and Data Sources
No ratings yet
Integration With Third-Party Applications and Data Sources
52 pages
DBT Util Package
No ratings yet
DBT Util Package
14 pages
Lab Mannual DBMS
No ratings yet
Lab Mannual DBMS
128 pages
Question Paper 2020
No ratings yet
Question Paper 2020
2 pages
Yash Pardeshi-Resume
No ratings yet
Yash Pardeshi-Resume
1 page
Major Assignment: Sample Tables
No ratings yet
Major Assignment: Sample Tables
3 pages
Magic User Reborn in Another World As A Max Level Wizard Volume 01 Mikawa Souhei PDF Download
No ratings yet
Magic User Reborn in Another World As A Max Level Wizard Volume 01 Mikawa Souhei PDF Download
26 pages
1.2 Informix Database Administration
No ratings yet
1.2 Informix Database Administration
108 pages
Etl Developer
No ratings yet
Etl Developer
4 pages
PE PRN212 SU24TrialTest
No ratings yet
PE PRN212 SU24TrialTest
4 pages
2023 Release-Configurable InBasket
No ratings yet
2023 Release-Configurable InBasket
33 pages
DBMS Overview: Types, Models, Languages
No ratings yet
DBMS Overview: Types, Models, Languages
26 pages
2021 - MiksaS - Cataloging Principles and Objectives History and Development - CCQ
No ratings yet
2021 - MiksaS - Cataloging Principles and Objectives History and Development - CCQ
33 pages
Tableau Interview Questions
No ratings yet
Tableau Interview Questions
30 pages
Digitization and Database Creation
No ratings yet
Digitization and Database Creation
51 pages
PostgreSQL DBA Full Notes 2025
0% (1)
PostgreSQL DBA Full Notes 2025
267 pages
PB4 It402 December 2024 25 Set-A
No ratings yet
PB4 It402 December 2024 25 Set-A
4 pages
Fundamentals of Database Systems: LESSON 4: The EER Model
No ratings yet
Fundamentals of Database Systems: LESSON 4: The EER Model
31 pages
How To Ace Any SQL Interview
No ratings yet
How To Ace Any SQL Interview
12 pages
4 RDBMS
No ratings yet
4 RDBMS
46 pages
Advanced Geographic Information Systems: Module-3
No ratings yet
Advanced Geographic Information Systems: Module-3
132 pages
1 Odbc: 1.1 Window Composition
No ratings yet
1 Odbc: 1.1 Window Composition
50 pages
Indexed Structures
No ratings yet
Indexed Structures
29 pages
Set-3 QP
No ratings yet
Set-3 QP
7 pages
Pharmacy Management System
75% (4)
Pharmacy Management System
37 pages
Adventure Works
No ratings yet
Adventure Works
14 pages