0% found this document useful (0 votes)

67 views59 pages

Data Mining Unit-1

The document outlines a syllabus for a data mining course, detailing prerequisites, objectives, and outcomes, including methods for mining patterns, classification, and clustering of various data types. It describes the Knowledge Discovery from Data (KDD) process, which involves data cleaning, integration, selection, transformation, mining, evaluation, and representation. Additionally, it covers different data types suitable for mining, functionalities of data mining, and specific techniques such as association rule mining and classification.

Uploaded by

anitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views59 pages

Data Mining Unit-1

Uploaded by

anitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 59

DATA MINING NOTES

SYLLABUS
Pre-Requisites:

 A course on “Database Management Systems”

 Knowledge of probability and statistics

Course Objectives:

 It presents methods for mining frequent patterns, associations, and correlations.

 It then describes methods for data classification and prediction, and data–clustering
approaches.

 It covers mining various types of data stores such as spatial, textual, multimedia, streams.

Course Outcomes:

 Ability to understand the types of the data to be mined and present a general classification
of tasks and primitives to integrate a data mining system.

 Apply preprocessing methods for any given raw data.

 Extract interesting patterns from large amounts of data.

 Discover the role played by data mining in various fields.

 Choose and employ suitable data mining algorithms to build analytical applications

 Evaluate the accuracy of supervised and unsupervised models and algorithms.

UNIT - I Data Mining: Data–Types of Data–, Data Mining Functionalities– Interestingness

Patterns– Classification of Data Mining systems– Data mining Task primitives –Integration of
Data mining system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.

UNIT - II Association Rule Mining: Mining Frequent Patterns–Associations and correlations –

Mining Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.

UNIT - III Classification: Classification and Prediction – Basic concepts–Decision tree

induction–Bayesian classification, Rule–based classification, Lazy learner.

UNIT - IV Clustering and Applications: Cluster analysis–Types of Data in Cluster Analysis–

Categorization of Major Clustering Methods– Partitioning Methods, Hierarchical Methods–
Density–Based Methods, Grid–Based Methods, Outlier Analysis.

UNIT - V Advanced Concepts: Basic concepts in mining data streams–Mining Time–series

data––Mining sequence patterns in Transactional databases– Mining Object– Spatial–
Multimedia–Text and Web data – Spatial Data mining– Multimedia Data mining–Text Mining–
Mining the World Wide Web.

TEXT BOOKS: 1. Data Mining – Concepts and Techniques – Jiawei Han & Micheline Kamber,
3rd Edition Elsevier. 2. Data Mining Introductory and Advanced topics – Margaret H Dunham,
PEA.
UNIT-I

1. DATA MINING
DEFINITION 1: Data mining is defined as procedure of extracting
information from huge sets of data also defined as mining knowledge
from data.
DEFINITION 2: Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data. The data sources
can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Data mining is also known as knowledge discovery from data, or
KDD.

Knowledge Discovery from Data (KDD):

The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making. Nowadays, data
mining is used in almost all places where a large amount of data is stored
and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection. Data mining also known as Knowledge Discovery from Data
or KDD.
Knowledge Discovery from Data (KDD) Process:
KDD is a process that involves the extraction of useful, previously unknown,
and potentially valuable information from large datasets. The KDD process is
an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.
The following steps are included in KDD process:

1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Representation
Data Cleaning: Data cleaning is defined as removal
of noisy and irrelevant/ inconsistent data from data collection. Cleaning
in case of Missing values. Cleaning noisy data, where noise is
a random or variance error. In this step, the noise and inconsistent data
is removed.
Data Integration: Data integration is defined as heterogeneous data from
multiple data sources combined in a common source (Data
Warehouse).i.e., in this step, multiple data sources may be combined as
single data source.
A popular trend in the information industry is to perform data
cleaning and data integration as a data preprocessing step, where the
resulting data are stored in a data warehouse.
Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data collection.
This step in the KDD process is identifying and selecting the relevant data for
analysis.
Data Transformation:
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. This step involves reducing
the data dimensionality, aggregating the data, normalizing it, and
discretizing it to prepare it for further analysis.
Data Mining:
This is the heart of the KDD process and involves applying various data
mining techniques to the transformed data to discover hidden patterns,
trends, relationships, and insights. A few of the most common data mining
techniques include clustering, classification, association rule mining, and
anomaly detection.
Pattern Evaluation:
After the data mining, the next step is to evaluate the discovered patterns to
determine their usefulness and relevance. This involves assessing the quality
of the patterns, evaluating their significance, and selecting the most
promising patterns for further analysis.
Knowledge Representation:
This step involves representing the knowledge extracted from the data in a
way humans can easily understand and use. This can be done through
visualizations, reports, or other forms of communication that provide
meaningful insights into the data.

2. TYPES OF DATA
What Kinds of Data Can Be Mined
As a general technology, data mining can be applied to any kind of data as
long as the data are meaningful for a target application.
The following are the most basic forms of data for mining.
Basic forms of data for mining

 Database Data (or) Relational database

 Data warehouse data
 Transactional data

Other forms of data for mining

 Multimedia Database
 Spatial Database
 World Wide Web
 Text data (Flat File)
 Time series database

Database Data (or) Relational database

A database system, also called a database management system (DBMS),

consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
A relational database: is a collection of tables, each of which is assigned a
unique name, each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows). Each tuple in a
relational table represents an object identified by a unique key and described
by a set of attribute values.
Example:

Data warehouse data

A data warehouse is a repository of information collected from multiple

sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.

A data warehouse is defined as the collection of data integrated from

multiple sources. Later this data can be mined for decision making.
A data warehouse is usually modelled by a multidimensional data structure,
called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some
aggregate measure such as count or sum. A data cube provides a
multidimensional view of data and allows the precomputation and fast
access of summarized data.
Example:

Transactional data

Transactional database is a collection of data organized by time stamps, date

etc to represent transaction in databases. In general, each record in a
transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction, such as the items
purchased in the transaction.
This type of database has the capability to roll back or undo operation when
a transaction is not completed or committed. And it follows ACID property of
DBMS.
Example:

TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Popcorn, Coke, Egg, Milk
T4 Popcorn, Bread, Egg, Milk
T5 Coke, Egg, Milk

Fig: Transactional data

Multimedia database

The multimedia databases are used to store multimedia data such as

images, animation, audio, video along with text. This data is stored in the
form of multiple file types
like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.

Spatial database

A spatial database is a database that is enhanced to store and access spatial

data or data that defines a geometric space. These data are often associated
with geographic locations and features, or constructed features like cities.
Data on spatial databases are stored as coordinates, points, lines, polygons
and topology.
World Wide Web

The World Wide Web is a collection of documents and resources such as

audio, video, and text. It identifies all this by URLs of the web browsers which
are linked through HTML pages. Online shopping, job hunting, and research
are some uses.
It is the most heterogeneous repository as it collects data from multiple
resources. And it is dynamic in nature as Volume of data is continuously
increasing and changing.

Text data (Flat File)

Flat files are a type of structured data that are stored in a plain text format.
They are called “flat” because they have no hierarchical structure, unlike a
relational database table. Flat files typically consist of rows and columns of
data, with each row representing a single record and each column
representing a field or attribute within that record. They can be stored in
various formats such as CSV, tab-separated values (TSV) and fixed-width
format.

 Flat files is defined as data files in text form or binary form with a
structure that can be easily extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file, then there
will be no relations between the tables.

Example:

Time series database: Time-series data is a sequence of data points

collected over time intervals, allowing us to track changes over time. Time-
series data can track changes over milliseconds, days, or even years. A time
series database (TSDB) is a database optimized for time-stamped or time
series data. Time series data are simply measurements or events that are
tracked, monitored, down sampled, and aggregated over time. This could be
server metrics, application performance monitoring, network data, sensor
data, events, clicks, trades in a market, and many other types of analytics
data.

Example:
3. DATA MINING FUNCTIONALITIES
Data mining is important because there is so much data out there, and it's
impossible for people to look through it all by themselves. Data mining uses
various functionalities to analyze the data and find patterns, trends, and
other information that would be hard for people to find on their own. Data
mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks. In general, such data mining tasks can be classified into
two categories: descriptive and predictive.

Descriptive data mining

Similarities and patterns in data may be discovered using descriptive data

mining. This kind of mining focuses on transforming raw data into
information that can be used in reports and analyses. It provides certain
knowledge about the data, for instance, count, average.
It gives information about what is happening inside the data without any
previous idea. It exhibits the common features in the data. In simple words,
you get to know the general properties of the data present in the database.
Predictive data mining

These kind of mining tasks perform inference on the current data in order to
make predictions. This helps the developers in understanding the
characteristics that are not explicitly available. For instance, the prediction of
business analysis in the next quarter with the performance of the previous
quarters. In general, the predictive analysis predicts or infers the
characteristics with the previously available data.

The following are data mining functionalities:

 Class/Concept Description (Characterization and

Discrimination)
 Mining Frequent Patterns, Associations and Correlation
 Classification and Regression for predictive Analysis
 Cluster Analysis
 Outlier Analysis

1.Class/Concept Description: Characterization and Discrimination

Data is associated with classes or concepts.

Class: A collection of things sharing a common attribute
Example: Classes of items – computers and printers
Concept: An abstract or general idea derived from specific
instances.
Example: Concepts of customers – big Spenders and budget
Spenders.
It can be useful to describe individual classes and concepts in summarized,
concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived
using data characterization and data discrimination, or both.
Data characterization

Data characterization is a summarization of the general characteristics or

features of a target class of data. Data summarization can be done based on
statistical measures and plots. The output of data characterization can be
presented in various forms it includes pie charts, bar charts, curves, and
multidimensional data cubes.
Example: A customer relationship manager at All Electronics may order the
following data mining task: Summarize the characteristics of customers who
spend more than $5000 a year at AllElectronics.The result is a general
profile of these customers, such as that they are 40 to 50 years old,
employed, and have excellent credit ratings.

Data discrimination

Data discrimination is one of the functionalities of data mining. It compares

the data between the two classes. Generally, it maps the target class with a
predefined group or class. It compares and contrasts the characteristics of
the class with the predefined class using a set of rules called discriminate
rules.
Example: A customer relationship manager at All Electronics may want to
compare two groups of customers those who shop for computer products
regularly(e.g., more than twice a month) and those who rarely shop for such
products (e.g., less than three times a year).
The resulting description provides a general comparative profile of these
customers, such as that 80% of the customers who frequently purchase
computer products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy such
products are either senior’s or youths, and have no university degree.

2. Mining Frequent Patterns, Associations and Correlation

Frequent patterns, as the name suggests, are patterns that occur frequently
in data. There are many kinds of frequent patterns, including frequent item
sets, frequent sub-sequences (also known as sequential patterns), and
frequent substructures. A frequent item set typically refers to a set of items
that often appear together in a transactional data set—for example, milk and
bread, which are frequently bought together in grocery stores by many
customers. A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a digital camera, and
then a memory card, is a (frequent) sequential pattern. A substructure can
refer to different
Structural forms (e.g., graphs, trees, or lattices) that may be combined with
item sets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Mining frequent patterns leads to the discovery
of interesting associations and correlations within data.

Association Analysis

It is a way of identifying the relation between various items. Association

Analysis is a functionality of data mining. It relates two or more attributes of
the data. It discovers the relationship between the data and the rules that
are binding them. It is also known as Market Basket Analysis for its wide use
in retail sales.

EX: Suppose that, as a marketing manager at All Electronics, you want to

know which items are frequently purchased together (i.e., within the same
transaction). An example of such a rule, mined from the All Electronics
transactional database, is

buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%, confidence =

50%],

Where X is a variable representing a customer. A confidence, or certainty, of

50% means that if a customer buys a computer, there is a 50% chance that
she will buy software as well. A 1% support means that 1% of all the
transactions under analysis show that computer and software are purchased
together. This association rule involves a single attribute or predicate (i.e.,
buys) that repeats. Association rules that contain a single predicate are

notation, the rule can be written simply as “computer ⇒ software [1%,

referred to as single-dimensional association rules. Dropping the predicate

50%].”

Suppose, instead, that we are given the All Electronics relational database
related to purchases. A data mining system may find association rules like

age(X, “20...29”) ∧ income(X, “40K...49K”) ⇒ buys(X, “laptop”)

[support = 2%, confidence = 60%].

The rule indicates that of the All Electronics customers under study, 2% are
20 to 29 years old with an income of $40,000 to $49,000 and have
purchased a laptop (computer) at All Electronics. There is a 60% probability
that a customer in this age and income group will purchase a laptop. Note
that this is an association involving more than one attribute or predicate (i.e.,
age, income, and buys). Adopting the terminology used in multidimensional
databases, where each attribute is referred to as a dimension, the above rule
can be referred to as a multidimensional association rule.

Correlation Analysis: is a mathematical technique it shows how strongly

pair of attributes are related together

Ex: Tall people tend to have more weight

3. Classification and Regression for predictive Analysis

Classification is the process of finding a model (or function) that describes

and distinguishes data classes or concepts. The model are derived based on
the analysis of a set of training data (i.e., data objects for which the class
labels are known). It uses methods like IF-THEN, Decision trees or Neural
networks to predict a class or essentially classify a collection of items.

Classification is a supervised learning technique used to categorize data into

predefined classes or labels. A decision tree is a flowchart-like tree
structure, where each node denotes a test on an attribute value, each
branch represents an outcome of the test, and tree leaves represent classes
or class distributions. A neural network, used to create a model that can
learn to recognize patterns in the data. Regression analysis is a statistical
methodology that is most often used for numeric prediction.
Example:

Fig: IF-THEN Rule

Fig: Decision tree

Fig:Neural Networks

Prediction

Finding missing data in a database is very important for the accuracy of the
analysis. Prediction is one of the data mining functionalities that help the
analyst find the missing numeric values. If there is a missing class label, then
this function is done using classification. It is very important in business
intelligence and is very popular. One of the methods is to predict the missing
or unavailable data using prediction analysis.
Example:
4. Cluster Analysis

Clustering is an unsupervised learning technique that group’s similar data

points together based on their features. The goal is to identify underlying
structures or patterns in the data. Some common clustering algorithms
include K-means, hierarchical clustering, and DBSCAN.
This data mining functionality is similar to classification. But in this case, the
class label is unknown. Similar objects are grouped in a cluster. There are
vast differences between one cluster and another.
Example1:

Example2:
5.Outlier Analysis

When data that cannot be grouped in any of the class appears, we use
outlier analysis. There will be occurrences of data that will have different
attributes/features to any of the other classes or clusters. These outstanding
data are called outliers. They are usually considered noise or exceptions, and
the analysis of these outliers is called outlier mining.
Outlier analysis is important to understand the quality of data. If there are
too many outliers, you cannot trust the data or draw patterns out of it.
Example1:

Example2:
4. INTERESTINGNESS PATTERNS
A data mining system has the potential to generate thousands or even
millions of patterns, or rules. Then “are all of the patterns
interesting?” Typically, not—only a small fraction of the patterns
potentially generated would be of interest to any given user.
This raises some serious questions for data mining. You may wonder,

1. What makes a pattern interesting?

2. Can a data mining system generate all the interesting patterns?
3. Can a data mining system generate only interesting patterns?

To answer the first question, a pattern is interesting if it is

1. easily understood by humans,

2. valid on new or test data with some degree of certainty,
3. potentially useful, and
4. Novel.

The second question―Can a data mining system generate all the

interesting patterns?--refers to the completeness of a data mining
algorithm. It is often unrealistic and inefficient for data mining systems to
generate all the possible patterns. Instead, user-provided constraints and
interestingness measures should be used to focus the search. A data mining
algorithm is complete if it mines all interesting patterns.
Finally, the third question -- “Can a data mining system generate only
interesting patterns?”— is an optimization problem in data mining. It is
highly desirable for datamining systems to generate only interesting
patterns. An interesting pattern represents knowledge.

5. CLASSIFICATION OF DATA MINING SYSTEMS

Data Mining is considered as an interdisciplinary field. It includes a set of
various disciplines such as statistics, database systems, machine learning,
visualization, and information sciences. Classification of the data mining
system helps users to understand the system and match their requirements
with such systems.
Data mining discovers patterns and extracts useful information from large
datasets. Organizations need to analyze and interpret data using data
mining systems as data grows rapidly. With an exponential increase in data,
active data analysis is necessary to make sense of it all.
Data mining (DM) systems can be classified based on various factors.

 Classification based on Types of Data Mined

 Classification based on Type of knowledge Mined
 Classification based on Type of Technique Utilized
 Classification based on Application Domain
1. Classification based on Types of Data Mined:

A database mining system can be classified based on ‘type of data’ or ‘use of

data’ model or ‘application of data.’
For Example: Relational Database, Transactional Database, Multimedia
Database, Textual Data, World Wide Web (WWW) and etc,

2. Classification based on Type of knowledge Mined:

We can classify a data mining system according to the kind of knowledge

mined. It means the data mining system is classified based on functionalities
such as

 Association Analysis
 Classification
 Prediction
 Cluster Analysis
 Characterization
 Discrimination
3. Classification based on Type of Technique Utilized:

We can classify a data mining system according to the kind of techniques

used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Data mining systems use various techniques, including Statistics, Machine
Learning, Database Systems, Information retrieval, Visualization, and pattern
recognition.

4. Classification based on Application Domain:

We can classify a data mining system according to the applications adapted.

These applications are as follows

 Finance
 Telecommunications
 E-Commerce
 Medial Sector
 Stock Markets

6. DATA MINING TASK PRIMITIVES

A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during the mining
process to discover interesting patterns.
Here is the list of Data Mining Task Primitives

 Set of task relevant data to be mined.

 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.

1. Set of task relevant data to be mined

This specifies the portions of the database or the set of data in which the
user is interested.
This portion includes the following

 Database Attributes
 Data Warehouse dimensions of interest

For example, suppose that you are a manager of All Electronics in charge of
sales in the United States and Canada. You would like to study the buying
trends of customers in Canada. Rather than mining on the entire database.
These are referred to as relevant attributes.

2. Kind of knowledge to be mined

This specifies the data mining functions to be performed, such as

 Characterization& Discrimination
 Association
 Classification
 Clustering
 Prediction
 Outlier analysis

For instance, if studying the buying habits of customers in Canada, you may
choose to mine associations between customer profiles and the items that
these customers like to buy.
3. Background knowledge to be used in discovery process

Users can specify background knowledge, or knowledge about the domain to

be mined. This knowledge is useful for guiding the knowledge discovery
process, and for evaluating the patterns found. User beliefs about
relationship in the data.
There are several kinds of background knowledge. Concept hierarchies are a
popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
Example:
An example of a concept hierarchy for the attribute (or dimension) age is
shown in the following Figure.

In the above, the root node represents the most general abstraction level,
denoted as all.

4. Interestingness measures and thresholds for pattern evaluation

The Interestingness measures are used to separate interesting and

uninteresting patterns from the knowledge. They may be used to guide the
mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
For example, interesting measures for association rules include support and
confidence.
5. Representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed.

Users can choose from different forms for knowledge presentation, such as

 Rules, tables, reports, charts, graphs, decision trees, and cubes.

7. INTEGRATION OF DATA MINING

SYSTEM WITH A DATA WAREHOUSE
The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective mode. A data
mining system operates in an environment that needs to communicate with
other data systems like a Database or Data warehouse system.
There are different possible integration (coupling) schemes as follows:

 No Coupling
 Loose Coupling
 Semi-Tight Coupling
 Tight Coupling
No Coupling

No coupling means that a Data Mining system will not utilize any
function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in
another file.

Drawbacks of No Coupling

 First, without using a Database/Data Warehouse system, a Data Mining

system may spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
 Second, there are many tested, scalable algorithms and data
structures implemented in Database and Data Warehouse systems.

Loose Coupling

In this loose coupling, the data mining system uses some facilities /
services of a database or data warehouse system. The data is fetched from
a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a database
or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of
data stored in Databases or Data Warehouses by using query processing,
indexing, and other system facilities.

Drawbacks of Loose Coupling

 It is difficult for loose coupling to achieve high scalability and good

performance with large data sets.

Semi-Tight Coupling

Semi tight coupling means that besides linking a Data Mining system to a
Data Base/Data Warehouse system, efficient implementations of a few
essential data mining primitives can be provided in the DB/DW system.
These primitives can include sorting, indexing, aggregation, histogram
analysis, multi way join, and precomputation of some essential statistical
measures, such as sum, count, max, min, and standard deviation.
Advantage of Semi-Tight Coupling

 This Coupling will enhance the performance of Data Mining systems

Tight Coupling

Tight coupling means that a Data Mining system is smoothly

integrated into the Data Base/Data Warehouse system. The data mining
subsystem is treated as one functional component of information system.
Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods
of a DB or DW system.

8. MAJOR ISSUES IN DATA MINING

Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially. Data mining is not an
easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various
heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are
mainly divided into three categories, which are given below:

1. Mining Methodology and User Interaction

2. Performance Issues
3. Diverse Data Types Issues
Mining Methodology and User Interaction

It refers to the following kinds of issues

 Mining different kinds of knowledge in databases − Different

users may be interested in different kinds of knowledge. Therefore, it is
necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.
 Handling noisy or incomplete data − the data cleaning methods
are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
 Pattern evaluation − the patterns discovered should be interesting
because either they represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows

 Efficiency and scalability of data mining algorithms − In order to

effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. The incremental algorithms, update databases without mining
the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − the database

may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kinds of data.
 Mining information from heterogeneous databases and global
information systems − the data is available at different data sources
on LAN or WAN. These data source may be structured, semi structured
or unstructured. Therefore, mining the knowledge from them adds
challenges to data mining.
9. DATA PREPROCESSING
What is Data Preprocessing?

Data preprocessing is a crucial step in data mining. It involves transforming

raw data into a clean, structured, and suitable format for mining. Proper data
preprocessing helps improve the quality of the data, enhances the
performance of algorithms, and ensures more accurate and reliable results.

Why Preprocess the Data?

In the real world, many databases and data warehouses

have noisy, missing, and inconsistent data due to their huge size. Low
quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from

 Human or computer error at data entry.

 Errors in data transmission.

Missing: lacking certain attribute values or containing only aggregate

data. E.g., Occupation = “”
Missing (Incomplete) may data come from?

 “Not applicable” data value when collected.

 Human/hardware/software problems.

Inconsistent: Data inconsistency meaning is that different versions of the

same data appear in different places. For example, the ZIP code is saved in
one table as 1234-567 numeric data format; while in another table it may
be represented in 1234567.
Inconsistent data may come from

 Errors in data entry.

 Merging data from different sources with varying formats.
 Differences in the data collection process.

Data preprocessing is used to improve the quality of data and mining results.
And the goal of data preprocessing is to enhance the accuracy, efficiency,
and reliability of data mining algorithms.
Major Tasks in Data Preprocessing

Data preprocessing is an essential step in the knowledge discovery process,

because quality decisions must be based on quality data. And Data
Preprocessing involves Data Cleaning, Data Integration, Data Reduction and
Data Transformation.
Steps in Data Preprocessing

1. Data Cleaning

Data cleaning is a process that "cleans" the data by filling in the missing
values, smoothing noisy data, analyzing, and removing outliers, and
removing inconsistencies in the data.
If users believe the data are dirty, they are unlikely to trust the results of any
data mining that has been applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data.
Missing Values

Imagine that you need to analyze All Electronics sales and customer data.
You note that many tuples have no recorded value for several attributes
such as customer income. How can you go about filling in the missing values
for this attribute? There are several methods to fill the missing values.
Those are,

a. Ignore the tuple: This is usually done when the class label is missing
(classification). This method is not very effective, unless the tuple
contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
c. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant such as a label like
“Unknown” or “- ∞ “.
d. Use the attribute mean or median to fill in the missing
value: Replace all missing values in the attribute by the mean or
median of that attribute values.
Noisy Data:

Noise is a random error or variance in a measured variable. Data smoothing

techniques are used to eliminate noise and extract the useful patterns. The
different techniques used for data smoothing are:
a. Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into several “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
There are three kinds of binning. They are:
o Smoothing by Bin Means: In this method, each value in a bin is
replaced by the mean value of the bin. For example, the mean of
the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
o Smoothing by Bin Medians: In this method, each value in a bin is
replaced by the median value of the bin. For example, the
median of the values 4, 8, and 15 in Bin 1 is 8. Therefore, each
original value in this bin is replaced by the value 8.
o Smoothing by Bin Boundaries: In this method, the minimum and
maximum values in each bin are identified as the bin boundaries.
Each bin value is then replaced by the closest boundary value.
For example, the middle value of the values 4, 8, and 15 in Bin
1is replaced with nearest boundary i.e., 4.

Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

b. Regression: Data smoothing can also be done by regression, a

technique that used to predict the numeric values in a given data set.
It analyses the relationship between a target variable (dependent) and
its predictor variable (independent).
o Regression is a form of a supervised machine learning technique
that tries to predict any continuous valued attribute.
o Regression done in two ways; Linear regression involves finding
the “best” line to fit two attributes (or variables) so that one
attribute can be used to predict the other. Multiple linear
regression is an extension of linear regression, where more than
two attributes are involved and the data are fit to a
multidimensional surface.
c. Clustering: It supports in identifying the outliers. The similar values are
organized into clusters and those values which fall outside the cluster
are known as outliers.

2. Data Integration

Data integration is the process of combining data from multiple sources into
a single, unified view. This process involves identifying and accessing the
different data sources, mapping the data to a common format. Different data
sources may include multiple data cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M)
approach, where G denotes the global schema, S denotes the schema of the
heterogeneous data sources, and M represents the mapping between the
queries of the source and global schema.
Example: To understand the (G, S, M) approach, let us consider a data
integration scenario that aims to combine employee data from two different
HR databases, database A and database B. The global schema (G) would
define the unified view of employee data, including attributes like Employee
ID, Name, Department, and Salary.
In the schema of heterogeneous sources, database A (S1) might have
attributes like EmpID, FullName, Dept, and Pay, while database B's schema
(S2) might have attributes like ID, EmployeeName, DepartmentName, and
Wage. The mappings (M) would then define how the attributes in S1 and S2
map to the attributes in G, allowing for the integration of employee data
from both systems into the global schema.

Issues in Data Integration

There are several issues that can arise when integrating data from multiple
sources, including:

a. Data Quality: Data from different sources may have varying levels of
accuracy, completeness, and consistency, which can lead to data
quality issues in the integrated data.
b. Data Semantics: Integrating data from different sources can be
challenging because the same data element may have different
meanings across sources.
c. Data Heterogeneity: Different sources may use different data formats,
structures, or schemas, making it difficult to combine and analyze the
data.
3. Data Reduction

Imagine that you have selected data from the All Electronics data warehouse
for analysis. The data set will likely be huge! Complex data analysis and
mining on huge amounts of data can take a long time, making such analysis
impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical
results.
In simple words, Data reduction is a technique used in data mining to reduce
the size of a dataset while still preserving the most important information.
This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.

There are several different data reduction techniques that can be used in
data mining, including:

a. Data Sampling: This technique involves selecting a subset of the data

to work with, rather than using the entire dataset. This can be useful
for reducing the size of a dataset while still preserving the overall
trends and patterns in the data.
b. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features that are
not relevant or by combining multiple features into a single feature.
c. Data compression: This is the process of altering, encoding, or
transforming the structure of data in order to save space. By reducing
duplication and encoding data in binary form, data compression
creates a compact representation of information. And it involves the
techniques such as loss or lossless compression to reduce the size of a
dataset.

4. Data Transformation

Data transformation in data mining refers to the process of converting raw

data into a format that is suitable for analysis and modelling. The goal of
data transformation is to prepare the data for data mining so that it can be
used to extract useful insights and knowledge.

Data transformation typically involves several steps, including:

1. Smoothing: It is a process that is used to remove noise from the

dataset using techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction): In this, new
attributes are constructed and added from the given set of attributes
to help the mining process.
3. Aggregation: In this, summary or aggregation operations are applied
to the data. For example, the daily sales data may be aggregated to
compute monthly and annual total amounts.
4. Data normalization: This process involves converting all data
variables into a small range. Such as -1.0 to 1.0, or 0.0 to 1.0.
5. Generalization: It converts low-level data attributes to high-level data
attributes using concept hierarchy. For Example, Age initially in
Numerical form (22,) is converted into categorical value (young, old).

Method Name Irregularity Output

Data Cleaning Missing, Nosie, and Inconsistent Quality Data before

data Integration

Data Different data sources (data Unified view

Integration cubes, databases, or flat files)

Data Reduction Huge amounts of data can take a Reduce the size of a
long time, making such analysis dataset and maintains
impractical or infeasible. the integrity.

Data Raw data Prepare the data for

Transformation data mining
UNIT - II
 Association Rule Mining
 Mining Frequent Patterns
 Associations and correlations
 Mining Methods
 Mining various kinds of Association Rules
 Correlation Analysis
 Constraint based Association mining.
 Graph Pattern Mining, SPM.
1. MINING FREQUENT PATTERNS

Mining Frequent Patterns in Data Mining

Item Set:
An Item set is collection or set of items
Examples:
{Computer, Printer, MSOffice} is 3 item set
{Milk, Bread} is 2 item set

Similarly,
Set of K items is called k item set

Frequent patterns
These are patterns that appear frequently in a data set. Patterns may be
item sets, or sub sequences.
Example: Transaction Database (Dataset)

TID Items

T1 Bread, Coke, Milk

T2 Popcorn, Bread

T3 Bread, Egg, Milk.

T4 Egg, Bread, Coke, Milk

 A set of items, such as Milk & Bread that appear together in a

transaction data set (Also called as Frequent Item set).
 Frequent item set mining leads to the discovery of associations and
correlations among items in large transactional (or) relational data
sets.
 Finding frequent patterns plays an essential role in mining
associations, correlations, and many other interesting relationships
among data. Moreover, it helps in data classification, clustering,
and other data mining tasks.

2. ASSOCIATIONS AND
CORRELATIONS
 Association rule mining (or) frequent item set mining finds
interesting associations and relationships (correlations) in large
transactional or relational data sets.
 This rule shows how frequently an item set occurs in a transaction. A
typical example is Market Based Analysis.
 Market Based Analysis is one of the key techniques used by large
relations to show associations between items. It allows retailers to
identify relationships between the items that people buy together
frequently.
 This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”.
 The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are frequently
purchased together by customers.
 For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the
supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
Understanding these buying patterns can help to increase sales in several
ways. If there is a pair of items, X and Y, which are frequently bought
together:

 Both X and Y can be placed on the same shelf, so that buyers of one
item would be prompted to buy the other.
 Promotional discounts could be applied to just one out of the two
items.
 Advertisements on X could be targeted at buyers who purchase Y.
 X and Y could be combined into a new product, such as having Y in
flavours of X.

bought together then association rule is represented as X ⇒ Y.

Association rule: If there is a pair of items, X and Y, which are frequently

For example, the information that customers who

purchase computers also tend to buy antivirus software at the same time
is represented as
Computer ⇒ Antivirus_Software

Measures to discover interestingness of

association rules
Association rules analysis is a technique to discover how items are
associated to each other. There are three measure to discover
interestingness of association rules. Those are:
Support: The support of an item / item set is the number of
transactions in which the item / item set appears, divided by the total
number of transactions.
Formula:
Where, A, B are items and N is the total number of transactions.

Example: Table-1 Example Transactions

TID Items

T1 Bread, Coke, Milk

T2 Popcorn, Bread

T3 Bread, Egg, Milk.

T4 Egg, Bread, Coke, Milk

T5 Egg, Apple

Example: Support of item Coke:

Example: Support of item set Bread, Milk:

Confidence: This says that how likely item B is purchased when item A is
purchased, expressed as {A → B}. The Confidence of items (A and B) is the
frequency or number of transactions in which the items (A and B) appear,
divided by the frequency or number of transactions in which the item (A)
appears.
Formula:

Example: From the Table-1, the confidence of {Bread → Milk} is

Lift: This says that how likely item B is purchased when item A is purchased,
expressed as an association rule {A → B}. The lift is a measure to predict the
performance of an association rule (targeting model).
If lift value is:
 Greater than 1 means that item B is likely to be bought if item A is
bought,
 Less than 1 means that item B is unlikely to be bought if item A is
bought,
 Equals to 1 means there is no association between items (A and B).
Formula:

Example: From the Table-1, the lift of {Bread → Milk} is

The Lift value is greater than 1 means that item Milk is likely to be bought if
item Bread is bought.
Example: To find Support, Confidence and Lift measures on the following
transactional data set.
Table-2: Example Transactions
TID Items

T1 Bread, Milk

T2 Bread, Diaper, Burger, Eggs

T3 Milk, Diaper, Burger, Coke

T4 Bread, Milk, Diaper, Burger

T5 Bread, Milk, Diaper, Coke

Number of transactions = 5.
Support:
1 – Item Set:
Support {Bread} = 4 / 5 = 0.8 = 80%
Support {Diaper} = 4 / 5 = 0.8 = 80%
Support {Milk} = 4 / 5 = 0.8 = 80%
Support {Burger} = 3 / 5 = 0.6 = 60%
Support {Coke} = 2 / 5 = 0.4 = 40%
Support {Eggs} = 1 / 5 = 0.2 = 20%
2 – Item Set:
Support {Bread, Milk} = 3 / 5 = 0.6 = 60%
Support {Milk, Diaper} = 3 / 5 = 0.6 = 60%
Support {Milk, Burger} = 2 / 5 = 0.4 = 40%
Support {Burger, Coke} = 1 / 5 = 0.2 = 20%
Support {Milk, Eggs} = 0 / 5 = 0.0 = 0%
3 – Item Set:
Support {Bread, Milk, Diaper} = 2 / 5 = 0.4 = 40%
Support {Milk, Diaper, Burger} = 2 / 5 = 0.4 = 40%
Confidence:

Lift:
3. MINING METHODS
The most famous story about association rule mining is the “beer and
diaper.” Researchers discovered that customers who buy diapers also tend
to buy beer. This classic example shows that there might be many
interesting association rules hidden in our daily data.
Association rules help to predict the occurrence of one item based on the
occurrences of other items in a set of transactions.

Association rules Examples

 People who buy bread will also buy milk; represented as{ bread →
milk }
 People who buy milk will also buy eggs; represented as { milk →
eggs }
 People who buy bread will also buy jam; represented as { bread →
jam }

Association Rules discover the relationship between two or more attributes.

It is mainly in the form of- If antecedent than consequent. For example, a
supermarket sees that there are 200 customers on Friday evening. Out of
the 200 customers, 100 bought chicken, and out of the 100 customers who
bought chicken, 50 have bought Onions. Thus, the association rule would be-
If customers buy chicken then buy onion too, with a support of 50/200 = 25%
and a confidence of 50/100=50%.
Association rule mining is a technique to identify interesting relations
between different items. Association rule mining has to:

 Find all the frequent items.

 Generate association rules from the above frequent itemset.

There are many methods or algorithms to perform Association Rule Mining or

Frequent Itemset Mining, those are:

 Apriori algorithm
 FP-Growth algorithm

Apriori algorithm

The Apriori algorithm is a classic and powerful tool in data mining used to
discover frequent itemsets and generate association rules. Imagine a grocery
store database with customer transactions. Apriori can help you find out
which items frequently appear together, revealing valuable insights like:

 Customers buying bread often buy butter and milk too. (Frequent
itemset)
 70% of people who purchase diapers also buy baby wipes.
(Association rule)

How Apriori algorithm works:

 Bottom-up Approach: Starts with finding frequent single items, then

combines them to find frequent pairs, triplets, and so on.
 Apriori Property: If a smaller itemset isn't frequent, none of its larger
versions can be either. This "prunes" the search space for efficiency.
 Support and Confidence: Two key measures used to define how
often an itemset appears and how strong the association between
items is.

Limitations for Apriori algorithm

 Can be computationally expensive for large datasets.

 Sensitive to minimum support and confidence thresholds.
FP-Growth algorithm

FP-Growth stands for Frequent Pattern Growth, and it's a smarter sibling of
the Apriori algorithm for mining frequent itemsets in data. But instead of
brute force, it uses a clever strategy to avoid generating and testing tons of
candidate sets, making it much faster and more memory-efficient.

Here's its secret weapon:

 Frequent Pattern Tree (FP-Tree): This special data structure

efficiently stores the frequent item sets and their relationships. Think
of it as a compressed and organized representation of your grocery
store database.
 Pattern Fragment Growth: Instead of building candidate sets, FP-
Growth focuses on "growing" smaller frequent patterns (fragments) by
adding items at their frequent ends. This avoids the costly generation
and scanning of redundant patterns.

Advantages of FP-Growth over Apriori

 Faster for large datasets: No more candidate explosions, just

targeted pattern growth.
 Less memory required: The compact FP-Tree minimizes memory
usage.
 More versatile: Can easily mine conditional frequent patterns without
building new trees.

When to Choose FP-Growth

 If you're dealing with large datasets and want faster results.

 If memory limitations are a concern.
 If you need to mine conditional frequent patterns.

Remember: Both Apriori and FP-Growth have their strengths and

weaknesses. Choosing the right tool depends on your specific data and
needs.

Apriori algorithm
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining.
It was introduced by R Agarwal and R Srikant.
Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties.

Frequent Item Set

 Frequent Itemset is an itemset whose support value is greater than a threshold
value(support).

Apriori algorithm uses frequent itemsets to generate association rules. To improve the
efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.

Apriori Property
 All subsets of a frequent itemset must be frequent (Apriori property).
 If an itemset is infrequent, all its supersets will be infrequent.

Steps in Apriori algorithm

Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset
in the given database. A minimum support threshold is given in the problem or it is
assumed by the user.
The steps followed in the Apriori Algorithm of data mining are:
Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
Prune Step: This step scans the count of each item in the database. If the candidate
item does not meet minimum support, then it is regarded as infrequent and thus it is
removed. This step is performed to reduce the size of the candidate itemsets.
The above join and the prune steps iteratively until the most frequent itemsets are
achieved.
Apriori Algorithm Example
Consider the following dataset and find frequent item sets and generate association rules for them.
Assume that minimum support threshold (s = 50%) and minimum confident threshold (c = 80%).

Transaction List of items

T1 I1, I2, I3

T2 I2, I3, I4

T3 I4, I5

T4 I1, I2, I4

T5 I1, I2, I3, I5

T6 I1, I2, I3, I4

Solution
Finding frequent item sets:
Support threshold=50% ⇒ 0.5*6 = 3 ⇒ min_sup = 3
Step-1:
(i) Create a table containing support count of each item present in dataset – Called C1 (candidate set).

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that I5 item does not meet min_sup = 3, thus it is
removed, only I1, I2, I3, I4 meet min_sup count.
This gives us the following item set L1.

Item Count

I1 4

I2 5

I3 4

I4 4

Step-2:
(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of 2-
itemset from the given dataset.

Item Count

I1, I2 4

I1, I3 3

I1, I4 2

I2, I3 4

I2, I4 3

I3, I4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that item sets {I1, I4} and {I3, I4} does not meet min_sup
= 3, thus those are removed.
This gives us the following item set L2.

Item Count

I1, I2 4

I1, I3 3

I2, I3 4

I2, I4 3

Step-3:
(i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the occurrences of 3-
itemset from the given dataset.
Item Count

I1, I2, I3 3

I1, I2, I4 2

I1, I3, I4 1

I2, I3, I4 2

(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that itemset {I1, I2, I4}, {I1, I3, I4} and {I2, I3, I4} does not
meet min_sup = 3, thus those are removed. Only the item set {I1, I2, I3} meet min_sup
count.

Generate Association Rules:

Thus, we have discovered all the frequent item-sets. Now we need to generate strong
association rules (satisfies the minimum confidence threshold) from frequent item sets.
For that we need to calculate confidence of each rule.
The given Confidence threshold is 80%.
The all possible association rules from the frequent item set {I1, I2, I3} are:
{I1, I2} ⇒ {I3}
Confidence=support {I1, I2, I3}support {I1, I2} = (3/ 4)* 100 = 75% (Rejected)

{I1, I3} ⇒ {I2}

Confidence=support {I1, I2, I3}support {I1, I3} = (3/ 3)* 100 = 100% (Selected)

{I2, I3} ⇒ {I1}

Confidence=support {I1, I2, I3}support {I2, I3} = (3/ 4)* 100 = 75% (Rejected)

{I1} ⇒ {I2, I3}

Confidence=support {I1, I2, I3}support {I1} = (3/ 4)* 100 = 75% (Rejected)
{I2} ⇒ {I1, I3}
Confidence=support {I1, I2, I3}support {I2} = (3/ 5)* 100 = 60% (Rejected)

{I3} ⇒ {I1, I2}

Confidence=support {I1, I2, I3}support {I3} = (3/ 4)* 100 = 75% (Rejected)

This shows that the association rule {I1, I3} ⇒ {I2} is strong if minimum confidence
threshold is 80%.

Exercise1: Apriori Algorithm

TID Items

T1 I1, I2, I5

T2 I2, I4

T3 I2, I3

T4 I1, I2, I4

T5 I1, I3

T6 I2, I3

T7 I1, I3

T8 I1, I2, I3, I5

T9 I1, I2, I3
Consider the above dataset and find frequent item sets and generate association rules
for them. Assume that Minimum support count is 2 and minimum confident threshold (c
= 50%).

Exercise2: Apriori Algorithm

TID Items

T1 {milk, bread}

T2 {bread, sugar}

T3 {bread, butter}

T4 {milk, bread, sugar}

T5 {milk, bread, butter}

T6 {milk, bread, butter}

T7 {milk, sugar}

T8 {milk, sugar}

T9 {sugar, butter}

T10 {milk, sugar, butter}

T11 {milk, bread, butter}

Consider the above dataset and find frequent item sets and generate association rules
for them. Assume that Minimum support count is 3 and minimum confident threshold (c
= 60%).

Data Minng
No ratings yet
Data Minng
20 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Bi - Unit 3
No ratings yet
Bi - Unit 3
18 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining and KDD Process Guide
No ratings yet
Data Mining and KDD Process Guide
19 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
DWM 4
No ratings yet
DWM 4
23 pages
Software
No ratings yet
Software
93 pages
Unit 1
No ratings yet
Unit 1
11 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
No ratings yet
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
97 pages
Cap481 - Business Communication Unit 4
No ratings yet
Cap481 - Business Communication Unit 4
90 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
DM Module1 Notes
No ratings yet
DM Module1 Notes
25 pages
DM Module 1
No ratings yet
DM Module 1
13 pages
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
No ratings yet
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
52 pages
Module 1
No ratings yet
Module 1
41 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
145 pages
Cs2032 Data Warehousing and Data Mining Notes (Unit III) .PDF - Www.chennaiuniversity - Net.notes
No ratings yet
Cs2032 Data Warehousing and Data Mining Notes (Unit III) .PDF - Www.chennaiuniversity - Net.notes
54 pages
Ramy Mahmoud 52117
No ratings yet
Ramy Mahmoud 52117
3 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
Screenshot 2023-10-19 at 11.36.57
No ratings yet
Screenshot 2023-10-19 at 11.36.57
27 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
39 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
Data Mining Questions 1st Unit
No ratings yet
Data Mining Questions 1st Unit
6 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
12 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
Unit 1 DM
No ratings yet
Unit 1 DM
62 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
DSDM Notes
No ratings yet
DSDM Notes
114 pages
Data Mining Ch1
No ratings yet
Data Mining Ch1
38 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Data Mining
No ratings yet
Data Mining
11 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
24 pages
DWDM Unit - 1-1
No ratings yet
DWDM Unit - 1-1
25 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
1intro - Data Mining
No ratings yet
1intro - Data Mining
61 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
73 pages
8 Data Mining and Warehousing
No ratings yet
8 Data Mining and Warehousing
171 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
15 pages
UNIT-1 Why We Need Data Mining?
No ratings yet
UNIT-1 Why We Need Data Mining?
99 pages
Datamining Unit - 1
No ratings yet
Datamining Unit - 1
20 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
1 ST Review Document
No ratings yet
1 ST Review Document
37 pages
Data Storage Approaches Guide
No ratings yet
Data Storage Approaches Guide
18 pages
120D Chapter 8 Flashcards - Quizlet
No ratings yet
120D Chapter 8 Flashcards - Quizlet
4 pages
DA Unit 2
No ratings yet
DA Unit 2
13 pages
Big Data Management Essentials
No ratings yet
Big Data Management Essentials
30 pages
Introduction to Databases Basics
No ratings yet
Introduction to Databases Basics
20 pages
Ch02 Chapter Review Discussion Questions 2025
No ratings yet
Ch02 Chapter Review Discussion Questions 2025
60 pages
Import Flat File Into Staging Area
No ratings yet
Import Flat File Into Staging Area
1 page
9 - Databases New Updated (MT-L)
No ratings yet
9 - Databases New Updated (MT-L)
19 pages
Cambridge O Level ICT Practical Theory
No ratings yet
Cambridge O Level ICT Practical Theory
8 pages
Accounting Information System
No ratings yet
Accounting Information System
5 pages
Pipeline Partitioning Overview Informatica
80% (5)
Pipeline Partitioning Overview Informatica
3 pages
Accounting Information System - Transaction Processing
No ratings yet
Accounting Information System - Transaction Processing
6 pages
Is Question Bank Full
No ratings yet
Is Question Bank Full
34 pages
Informatica Migration Best Practices
100% (1)
Informatica Migration Best Practices
216 pages
PWX Training
No ratings yet
PWX Training
35 pages
Spring Batch Essentials P Raja Malleswara Rao PDF Download
No ratings yet
Spring Batch Essentials P Raja Malleswara Rao PDF Download
43 pages
Events in SM30
No ratings yet
Events in SM30
4 pages
DFE - Interview Questions and Answers: T24 System - Work File - Output File (.CSV, XML, Etc)
100% (1)
DFE - Interview Questions and Answers: T24 System - Work File - Output File (.CSV, XML, Etc)
15 pages
C5 - Data Types
No ratings yet
C5 - Data Types
11 pages
2 Done
No ratings yet
2 Done
42 pages
WebMethods Logging Guide 7.1 - Software AG Documentation
No ratings yet
WebMethods Logging Guide 7.1 - Software AG Documentation
54 pages
Key Principles and Phases of AIS
No ratings yet
Key Principles and Phases of AIS
10 pages
Creating Multiple Mappings From One Informatica Mapping Template
No ratings yet
Creating Multiple Mappings From One Informatica Mapping Template
13 pages
Setting Up The System in BODS
No ratings yet
Setting Up The System in BODS
21 pages
Informatica ETL Specification Guide
No ratings yet
Informatica ETL Specification Guide
85 pages
Performance Tuning: Identifying Performance Bottleneck Taking Corrective Actions
100% (1)
Performance Tuning: Identifying Performance Bottleneck Taking Corrective Actions
21 pages
9-10 Trading Networks Administrators Guide PDF
0% (1)
9-10 Trading Networks Administrators Guide PDF
315 pages
6th Sem-DBMS-Reference Up To Unit 5
No ratings yet
6th Sem-DBMS-Reference Up To Unit 5
81 pages
Unit-2 Python Programming
No ratings yet
Unit-2 Python Programming
141 pages
Project Charter Template 20
No ratings yet
Project Charter Template 20
11 pages