DATA MINING NOTES
SYLLABUS
Pre-Requisites:
A course on “Database Management Systems”
Knowledge of probability and statistics
Course Objectives:
It presents methods for mining frequent patterns, associations, and correlations.
It then describes methods for data classification and prediction, and data–clustering
approaches.
It covers mining various types of data stores such as spatial, textual, multimedia, streams.
Course Outcomes:
Ability to understand the types of the data to be mined and present a general classification
of tasks and primitives to integrate a data mining system.
Apply preprocessing methods for any given raw data.
Extract interesting patterns from large amounts of data.
Discover the role played by data mining in various fields.
Choose and employ suitable data mining algorithms to build analytical applications
Evaluate the accuracy of supervised and unsupervised models and algorithms.
UNIT - I Data Mining: Data–Types of Data–, Data Mining Functionalities– Interestingness
Patterns– Classification of Data Mining systems– Data mining Task primitives –Integration of
Data mining system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.
UNIT - II Association Rule Mining: Mining Frequent Patterns–Associations and correlations –
Mining Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.
UNIT - III Classification: Classification and Prediction – Basic concepts–Decision tree
induction–Bayesian classification, Rule–based classification, Lazy learner.
UNIT - IV Clustering and Applications: Cluster analysis–Types of Data in Cluster Analysis–
Categorization of Major Clustering Methods– Partitioning Methods, Hierarchical Methods–
Density–Based Methods, Grid–Based Methods, Outlier Analysis.
UNIT - V Advanced Concepts: Basic concepts in mining data streams–Mining Time–series
data––Mining sequence patterns in Transactional databases– Mining Object– Spatial–
Multimedia–Text and Web data – Spatial Data mining– Multimedia Data mining–Text Mining–
Mining the World Wide Web.
TEXT BOOKS: 1. Data Mining – Concepts and Techniques – Jiawei Han & Micheline Kamber,
3rd Edition Elsevier. 2. Data Mining Introductory and Advanced topics – Margaret H Dunham,
PEA.
UNIT-I
1. DATA MINING
DEFINITION 1: Data mining is defined as procedure of extracting
information from huge sets of data also defined as mining knowledge
from data.
DEFINITION 2: Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data. The data sources
can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Data mining is also known as knowledge discovery from data, or
KDD.
Knowledge Discovery from Data (KDD):
The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making. Nowadays, data
mining is used in almost all places where a large amount of data is stored
and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection. Data mining also known as Knowledge Discovery from Data
or KDD.
Knowledge Discovery from Data (KDD) Process:
KDD is a process that involves the extraction of useful, previously unknown,
and potentially valuable information from large datasets. The KDD process is
an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data.
The following steps are included in KDD process:
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Representation
Data Cleaning: Data cleaning is defined as removal
of noisy and irrelevant/ inconsistent data from data collection. Cleaning
in case of Missing values. Cleaning noisy data, where noise is
a random or variance error. In this step, the noise and inconsistent data
is removed.
Data Integration: Data integration is defined as heterogeneous data from
multiple data sources combined in a common source (Data
Warehouse).i.e., in this step, multiple data sources may be combined as
single data source.
A popular trend in the information industry is to perform data
cleaning and data integration as a data preprocessing step, where the
resulting data are stored in a data warehouse.
Data Selection: Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data collection.
This step in the KDD process is identifying and selecting the relevant data for
analysis.
Data Transformation:
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. This step involves reducing
the data dimensionality, aggregating the data, normalizing it, and
discretizing it to prepare it for further analysis.
Data Mining:
This is the heart of the KDD process and involves applying various data
mining techniques to the transformed data to discover hidden patterns,
trends, relationships, and insights. A few of the most common data mining
techniques include clustering, classification, association rule mining, and
anomaly detection.
Pattern Evaluation:
After the data mining, the next step is to evaluate the discovered patterns to
determine their usefulness and relevance. This involves assessing the quality
of the patterns, evaluating their significance, and selecting the most
promising patterns for further analysis.
Knowledge Representation:
This step involves representing the knowledge extracted from the data in a
way humans can easily understand and use. This can be done through
visualizations, reports, or other forms of communication that provide
meaningful insights into the data.
2. TYPES OF DATA
What Kinds of Data Can Be Mined
As a general technology, data mining can be applied to any kind of data as
long as the data are meaningful for a target application.
The following are the most basic forms of data for mining.
Basic forms of data for mining
Database Data (or) Relational database
Data warehouse data
Transactional data
Other forms of data for mining
Multimedia Database
Spatial Database
World Wide Web
Text data (Flat File)
Time series database
Database Data (or) Relational database
A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
A relational database: is a collection of tables, each of which is assigned a
unique name, each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows). Each tuple in a
relational table represents an object identified by a unique key and described
by a set of attribute values.
Example:
Data warehouse data
A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.
A data warehouse is defined as the collection of data integrated from
multiple sources. Later this data can be mined for decision making.
A data warehouse is usually modelled by a multidimensional data structure,
called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some
aggregate measure such as count or sum. A data cube provides a
multidimensional view of data and allows the precomputation and fast
access of summarized data.
Example:
Transactional data
Transactional database is a collection of data organized by time stamps, date
etc to represent transaction in databases. In general, each record in a
transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction, such as the items
purchased in the transaction.
This type of database has the capability to roll back or undo operation when
a transaction is not completed or committed. And it follows ACID property of
DBMS.
Example:
TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Popcorn, Coke, Egg, Milk
T4 Popcorn, Bread, Egg, Milk
T5 Coke, Egg, Milk
Fig: Transactional data
Multimedia database
The multimedia databases are used to store multimedia data such as
images, animation, audio, video along with text. This data is stored in the
form of multiple file types
like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.
Spatial database
A spatial database is a database that is enhanced to store and access spatial
data or data that defines a geometric space. These data are often associated
with geographic locations and features, or constructed features like cities.
Data on spatial databases are stored as coordinates, points, lines, polygons
and topology.
World Wide Web
The World Wide Web is a collection of documents and resources such as
audio, video, and text. It identifies all this by URLs of the web browsers which
are linked through HTML pages. Online shopping, job hunting, and research
are some uses.
It is the most heterogeneous repository as it collects data from multiple
resources. And it is dynamic in nature as Volume of data is continuously
increasing and changing.
Text data (Flat File)
Flat files are a type of structured data that are stored in a plain text format.
They are called “flat” because they have no hierarchical structure, unlike a
relational database table. Flat files typically consist of rows and columns of
data, with each row representing a single record and each column
representing a field or attribute within that record. They can be stored in
various formats such as CSV, tab-separated values (TSV) and fixed-width
format.
Flat files is defined as data files in text form or binary form with a
structure that can be easily extracted by data mining algorithms.
Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file, then there
will be no relations between the tables.
Example:
Time series database: Time-series data is a sequence of data points
collected over time intervals, allowing us to track changes over time. Time-
series data can track changes over milliseconds, days, or even years. A time
series database (TSDB) is a database optimized for time-stamped or time
series data. Time series data are simply measurements or events that are
tracked, monitored, down sampled, and aggregated over time. This could be
server metrics, application performance monitoring, network data, sensor
data, events, clicks, trades in a market, and many other types of analytics
data.
Example:
3. DATA MINING FUNCTIONALITIES
Data mining is important because there is so much data out there, and it's
impossible for people to look through it all by themselves. Data mining uses
various functionalities to analyze the data and find patterns, trends, and
other information that would be hard for people to find on their own. Data
mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks. In general, such data mining tasks can be classified into
two categories: descriptive and predictive.
Descriptive data mining
Similarities and patterns in data may be discovered using descriptive data
mining. This kind of mining focuses on transforming raw data into
information that can be used in reports and analyses. It provides certain
knowledge about the data, for instance, count, average.
It gives information about what is happening inside the data without any
previous idea. It exhibits the common features in the data. In simple words,
you get to know the general properties of the data present in the database.
Predictive data mining
These kind of mining tasks perform inference on the current data in order to
make predictions. This helps the developers in understanding the
characteristics that are not explicitly available. For instance, the prediction of
business analysis in the next quarter with the performance of the previous
quarters. In general, the predictive analysis predicts or infers the
characteristics with the previously available data.
The following are data mining functionalities:
Class/Concept Description (Characterization and
Discrimination)
Mining Frequent Patterns, Associations and Correlation
Classification and Regression for predictive Analysis
Cluster Analysis
Outlier Analysis
1.Class/Concept Description: Characterization and Discrimination
Data is associated with classes or concepts.
Class: A collection of things sharing a common attribute
Example: Classes of items – computers and printers
Concept: An abstract or general idea derived from specific
instances.
Example: Concepts of customers – big Spenders and budget
Spenders.
It can be useful to describe individual classes and concepts in summarized,
concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived
using data characterization and data discrimination, or both.
Data characterization
Data characterization is a summarization of the general characteristics or
features of a target class of data. Data summarization can be done based on
statistical measures and plots. The output of data characterization can be
presented in various forms it includes pie charts, bar charts, curves, and
multidimensional data cubes.
Example: A customer relationship manager at All Electronics may order the
following data mining task: Summarize the characteristics of customers who
spend more than $5000 a year at AllElectronics.The result is a general
profile of these customers, such as that they are 40 to 50 years old,
employed, and have excellent credit ratings.
Data discrimination
Data discrimination is one of the functionalities of data mining. It compares
the data between the two classes. Generally, it maps the target class with a
predefined group or class. It compares and contrasts the characteristics of
the class with the predefined class using a set of rules called discriminate
rules.
Example: A customer relationship manager at All Electronics may want to
compare two groups of customers those who shop for computer products
regularly(e.g., more than twice a month) and those who rarely shop for such
products (e.g., less than three times a year).
The resulting description provides a general comparative profile of these
customers, such as that 80% of the customers who frequently purchase
computer products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy such
products are either senior’s or youths, and have no university degree.
2. Mining Frequent Patterns, Associations and Correlation
Frequent patterns, as the name suggests, are patterns that occur frequently
in data. There are many kinds of frequent patterns, including frequent item
sets, frequent sub-sequences (also known as sequential patterns), and
frequent substructures. A frequent item set typically refers to a set of items
that often appear together in a transactional data set—for example, milk and
bread, which are frequently bought together in grocery stores by many
customers. A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a digital camera, and
then a memory card, is a (frequent) sequential pattern. A substructure can
refer to different
Structural forms (e.g., graphs, trees, or lattices) that may be combined with
item sets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Mining frequent patterns leads to the discovery
of interesting associations and correlations within data.
Association Analysis
It is a way of identifying the relation between various items. Association
Analysis is a functionality of data mining. It relates two or more attributes of
the data. It discovers the relationship between the data and the rules that
are binding them. It is also known as Market Basket Analysis for its wide use
in retail sales.
EX: Suppose that, as a marketing manager at All Electronics, you want to
know which items are frequently purchased together (i.e., within the same
transaction). An example of such a rule, mined from the All Electronics
transactional database, is
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%, confidence =
50%],
Where X is a variable representing a customer. A confidence, or certainty, of
50% means that if a customer buys a computer, there is a 50% chance that
she will buy software as well. A 1% support means that 1% of all the
transactions under analysis show that computer and software are purchased
together. This association rule involves a single attribute or predicate (i.e.,
buys) that repeats. Association rules that contain a single predicate are
notation, the rule can be written simply as “computer ⇒ software [1%,
referred to as single-dimensional association rules. Dropping the predicate
50%].”
Suppose, instead, that we are given the All Electronics relational database
related to purchases. A data mining system may find association rules like
age(X, “20...29”) ∧ income(X, “40K...49K”) ⇒ buys(X, “laptop”)
[support = 2%, confidence = 60%].
The rule indicates that of the All Electronics customers under study, 2% are
20 to 29 years old with an income of $40,000 to $49,000 and have
purchased a laptop (computer) at All Electronics. There is a 60% probability
that a customer in this age and income group will purchase a laptop. Note
that this is an association involving more than one attribute or predicate (i.e.,
age, income, and buys). Adopting the terminology used in multidimensional
databases, where each attribute is referred to as a dimension, the above rule
can be referred to as a multidimensional association rule.
Correlation Analysis: is a mathematical technique it shows how strongly
pair of attributes are related together
Ex: Tall people tend to have more weight
3. Classification and Regression for predictive Analysis
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts. The model are derived based on
the analysis of a set of training data (i.e., data objects for which the class
labels are known). It uses methods like IF-THEN, Decision trees or Neural
networks to predict a class or essentially classify a collection of items.
Classification is a supervised learning technique used to categorize data into
predefined classes or labels. A decision tree is a flowchart-like tree
structure, where each node denotes a test on an attribute value, each
branch represents an outcome of the test, and tree leaves represent classes
or class distributions. A neural network, used to create a model that can
learn to recognize patterns in the data. Regression analysis is a statistical
methodology that is most often used for numeric prediction.
Example:
Fig: IF-THEN Rule
Fig: Decision tree
Fig:Neural Networks
Prediction
Finding missing data in a database is very important for the accuracy of the
analysis. Prediction is one of the data mining functionalities that help the
analyst find the missing numeric values. If there is a missing class label, then
this function is done using classification. It is very important in business
intelligence and is very popular. One of the methods is to predict the missing
or unavailable data using prediction analysis.
Example:
4. Cluster Analysis
Clustering is an unsupervised learning technique that group’s similar data
points together based on their features. The goal is to identify underlying
structures or patterns in the data. Some common clustering algorithms
include K-means, hierarchical clustering, and DBSCAN.
This data mining functionality is similar to classification. But in this case, the
class label is unknown. Similar objects are grouped in a cluster. There are
vast differences between one cluster and another.
Example1:
Example2:
5.Outlier Analysis
When data that cannot be grouped in any of the class appears, we use
outlier analysis. There will be occurrences of data that will have different
attributes/features to any of the other classes or clusters. These outstanding
data are called outliers. They are usually considered noise or exceptions, and
the analysis of these outliers is called outlier mining.
Outlier analysis is important to understand the quality of data. If there are
too many outliers, you cannot trust the data or draw patterns out of it.
Example1:
Example2:
4. INTERESTINGNESS PATTERNS
A data mining system has the potential to generate thousands or even
millions of patterns, or rules. Then “are all of the patterns
interesting?” Typically, not—only a small fraction of the patterns
potentially generated would be of interest to any given user.
This raises some serious questions for data mining. You may wonder,
1. What makes a pattern interesting?
2. Can a data mining system generate all the interesting patterns?
3. Can a data mining system generate only interesting patterns?
To answer the first question, a pattern is interesting if it is
1. easily understood by humans,
2. valid on new or test data with some degree of certainty,
3. potentially useful, and
4. Novel.
The second question―Can a data mining system generate all the
interesting patterns?--refers to the completeness of a data mining
algorithm. It is often unrealistic and inefficient for data mining systems to
generate all the possible patterns. Instead, user-provided constraints and
interestingness measures should be used to focus the search. A data mining
algorithm is complete if it mines all interesting patterns.
Finally, the third question -- “Can a data mining system generate only
interesting patterns?”— is an optimization problem in data mining. It is
highly desirable for datamining systems to generate only interesting
patterns. An interesting pattern represents knowledge.
5. CLASSIFICATION OF DATA MINING SYSTEMS
Data Mining is considered as an interdisciplinary field. It includes a set of
various disciplines such as statistics, database systems, machine learning,
visualization, and information sciences. Classification of the data mining
system helps users to understand the system and match their requirements
with such systems.
Data mining discovers patterns and extracts useful information from large
datasets. Organizations need to analyze and interpret data using data
mining systems as data grows rapidly. With an exponential increase in data,
active data analysis is necessary to make sense of it all.
Data mining (DM) systems can be classified based on various factors.
Classification based on Types of Data Mined
Classification based on Type of knowledge Mined
Classification based on Type of Technique Utilized
Classification based on Application Domain
1. Classification based on Types of Data Mined:
A database mining system can be classified based on ‘type of data’ or ‘use of
data’ model or ‘application of data.’
For Example: Relational Database, Transactional Database, Multimedia
Database, Textual Data, World Wide Web (WWW) and etc,
2. Classification based on Type of knowledge Mined:
We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified based on functionalities
such as
Association Analysis
Classification
Prediction
Cluster Analysis
Characterization
Discrimination
3. Classification based on Type of Technique Utilized:
We can classify a data mining system according to the kind of techniques
used. We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Data mining systems use various techniques, including Statistics, Machine
Learning, Database Systems, Information retrieval, Visualization, and pattern
recognition.
4. Classification based on Application Domain:
We can classify a data mining system according to the applications adapted.
These applications are as follows
Finance
Telecommunications
E-Commerce
Medial Sector
Stock Markets
6. DATA MINING TASK PRIMITIVES
A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during the mining
process to discover interesting patterns.
Here is the list of Data Mining Task Primitives
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1. Set of task relevant data to be mined
This specifies the portions of the database or the set of data in which the
user is interested.
This portion includes the following
Database Attributes
Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics in charge of
sales in the United States and Canada. You would like to study the buying
trends of customers in Canada. Rather than mining on the entire database.
These are referred to as relevant attributes.
2. Kind of knowledge to be mined
This specifies the data mining functions to be performed, such as
Characterization& Discrimination
Association
Classification
Clustering
Prediction
Outlier analysis
For instance, if studying the buying habits of customers in Canada, you may
choose to mine associations between customer profiles and the items that
these customers like to buy.
3. Background knowledge to be used in discovery process
Users can specify background knowledge, or knowledge about the domain to
be mined. This knowledge is useful for guiding the knowledge discovery
process, and for evaluating the patterns found. User beliefs about
relationship in the data.
There are several kinds of background knowledge. Concept hierarchies are a
popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
Example:
An example of a concept hierarchy for the attribute (or dimension) age is
shown in the following Figure.
In the above, the root node represents the most general abstraction level,
denoted as all.
4. Interestingness measures and thresholds for pattern evaluation
The Interestingness measures are used to separate interesting and
uninteresting patterns from the knowledge. They may be used to guide the
mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
For example, interesting measures for association rules include support and
confidence.
5. Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed.
Users can choose from different forms for knowledge presentation, such as
Rules, tables, reports, charts, graphs, decision trees, and cubes.
7. INTEGRATION OF DATA MINING
SYSTEM WITH A DATA WAREHOUSE
The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective mode. A data
mining system operates in an environment that needs to communicate with
other data systems like a Database or Data warehouse system.
There are different possible integration (coupling) schemes as follows:
No Coupling
Loose Coupling
Semi-Tight Coupling
Tight Coupling
No Coupling
No coupling means that a Data Mining system will not utilize any
function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in
another file.
Drawbacks of No Coupling
First, without using a Database/Data Warehouse system, a Data Mining
system may spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data
structures implemented in Database and Data Warehouse systems.
Loose Coupling
In this loose coupling, the data mining system uses some facilities /
services of a database or data warehouse system. The data is fetched from
a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a database
or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of
data stored in Databases or Data Warehouses by using query processing,
indexing, and other system facilities.
Drawbacks of Loose Coupling
It is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
Semi-Tight Coupling
Semi tight coupling means that besides linking a Data Mining system to a
Data Base/Data Warehouse system, efficient implementations of a few
essential data mining primitives can be provided in the DB/DW system.
These primitives can include sorting, indexing, aggregation, histogram
analysis, multi way join, and precomputation of some essential statistical
measures, such as sum, count, max, min, and standard deviation.
Advantage of Semi-Tight Coupling
This Coupling will enhance the performance of Data Mining systems
Tight Coupling
Tight coupling means that a Data Mining system is smoothly
integrated into the Data Base/Data Warehouse system. The data mining
subsystem is treated as one functional component of information system.
Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods
of a DB or DW system.
8. MAJOR ISSUES IN DATA MINING
Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially. Data mining is not an
easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various
heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are
mainly divided into three categories, which are given below:
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues
Mining Methodology and User Interaction
It refers to the following kinds of issues
Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore, it is
necessary for data mining to cover a broad range of knowledge
discovery task.
Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.
Handling noisy or incomplete data − the data cleaning methods
are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
Pattern evaluation − the patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. The incremental algorithms, update databases without mining
the data again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − the database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kinds of data.
Mining information from heterogeneous databases and global
information systems − the data is available at different data sources
on LAN or WAN. These data source may be structured, semi structured
or unstructured. Therefore, mining the knowledge from them adds
challenges to data mining.
9. DATA PREPROCESSING
What is Data Preprocessing?
Data preprocessing is a crucial step in data mining. It involves transforming
raw data into a clean, structured, and suitable format for mining. Proper data
preprocessing helps improve the quality of the data, enhances the
performance of algorithms, and ensures more accurate and reliable results.
Why Preprocess the Data?
In the real world, many databases and data warehouses
have noisy, missing, and inconsistent data due to their huge size. Low
quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from
Human or computer error at data entry.
Errors in data transmission.
Missing: lacking certain attribute values or containing only aggregate
data. E.g., Occupation = “”
Missing (Incomplete) may data come from?
“Not applicable” data value when collected.
Human/hardware/software problems.
Inconsistent: Data inconsistency meaning is that different versions of the
same data appear in different places. For example, the ZIP code is saved in
one table as 1234-567 numeric data format; while in another table it may
be represented in 1234567.
Inconsistent data may come from
Errors in data entry.
Merging data from different sources with varying formats.
Differences in the data collection process.
Data preprocessing is used to improve the quality of data and mining results.
And the goal of data preprocessing is to enhance the accuracy, efficiency,
and reliability of data mining algorithms.
Major Tasks in Data Preprocessing
Data preprocessing is an essential step in the knowledge discovery process,
because quality decisions must be based on quality data. And Data
Preprocessing involves Data Cleaning, Data Integration, Data Reduction and
Data Transformation.
Steps in Data Preprocessing
1. Data Cleaning
Data cleaning is a process that "cleans" the data by filling in the missing
values, smoothing noisy data, analyzing, and removing outliers, and
removing inconsistencies in the data.
If users believe the data are dirty, they are unlikely to trust the results of any
data mining that has been applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data.
Missing Values
Imagine that you need to analyze All Electronics sales and customer data.
You note that many tuples have no recorded value for several attributes
such as customer income. How can you go about filling in the missing values
for this attribute? There are several methods to fill the missing values.
Those are,
a. Ignore the tuple: This is usually done when the class label is missing
(classification). This method is not very effective, unless the tuple
contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
c. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant such as a label like
“Unknown” or “- ∞ “.
d. Use the attribute mean or median to fill in the missing
value: Replace all missing values in the attribute by the mean or
median of that attribute values.
Noisy Data:
Noise is a random error or variance in a measured variable. Data smoothing
techniques are used to eliminate noise and extract the useful patterns. The
different techniques used for data smoothing are:
a. Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into several “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
There are three kinds of binning. They are:
o Smoothing by Bin Means: In this method, each value in a bin is
replaced by the mean value of the bin. For example, the mean of
the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
o Smoothing by Bin Medians: In this method, each value in a bin is
replaced by the median value of the bin. For example, the
median of the values 4, 8, and 15 in Bin 1 is 8. Therefore, each
original value in this bin is replaced by the value 8.
o Smoothing by Bin Boundaries: In this method, the minimum and
maximum values in each bin are identified as the bin boundaries.
Each bin value is then replaced by the closest boundary value.
For example, the middle value of the values 4, 8, and 15 in Bin
1is replaced with nearest boundary i.e., 4.
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
b. Regression: Data smoothing can also be done by regression, a
technique that used to predict the numeric values in a given data set.
It analyses the relationship between a target variable (dependent) and
its predictor variable (independent).
o Regression is a form of a supervised machine learning technique
that tries to predict any continuous valued attribute.
o Regression done in two ways; Linear regression involves finding
the “best” line to fit two attributes (or variables) so that one
attribute can be used to predict the other. Multiple linear
regression is an extension of linear regression, where more than
two attributes are involved and the data are fit to a
multidimensional surface.
c. Clustering: It supports in identifying the outliers. The similar values are
organized into clusters and those values which fall outside the cluster
are known as outliers.
2. Data Integration
Data integration is the process of combining data from multiple sources into
a single, unified view. This process involves identifying and accessing the
different data sources, mapping the data to a common format. Different data
sources may include multiple data cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M)
approach, where G denotes the global schema, S denotes the schema of the
heterogeneous data sources, and M represents the mapping between the
queries of the source and global schema.
Example: To understand the (G, S, M) approach, let us consider a data
integration scenario that aims to combine employee data from two different
HR databases, database A and database B. The global schema (G) would
define the unified view of employee data, including attributes like Employee
ID, Name, Department, and Salary.
In the schema of heterogeneous sources, database A (S1) might have
attributes like EmpID, FullName, Dept, and Pay, while database B's schema
(S2) might have attributes like ID, EmployeeName, DepartmentName, and
Wage. The mappings (M) would then define how the attributes in S1 and S2
map to the attributes in G, allowing for the integration of employee data
from both systems into the global schema.
Issues in Data Integration
There are several issues that can arise when integrating data from multiple
sources, including:
a. Data Quality: Data from different sources may have varying levels of
accuracy, completeness, and consistency, which can lead to data
quality issues in the integrated data.
b. Data Semantics: Integrating data from different sources can be
challenging because the same data element may have different
meanings across sources.
c. Data Heterogeneity: Different sources may use different data formats,
structures, or schemas, making it difficult to combine and analyze the
data.
3. Data Reduction
Imagine that you have selected data from the All Electronics data warehouse
for analysis. The data set will likely be huge! Complex data analysis and
mining on huge amounts of data can take a long time, making such analysis
impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical
results.
In simple words, Data reduction is a technique used in data mining to reduce
the size of a dataset while still preserving the most important information.
This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.
There are several different data reduction techniques that can be used in
data mining, including:
a. Data Sampling: This technique involves selecting a subset of the data
to work with, rather than using the entire dataset. This can be useful
for reducing the size of a dataset while still preserving the overall
trends and patterns in the data.
b. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features that are
not relevant or by combining multiple features into a single feature.
c. Data compression: This is the process of altering, encoding, or
transforming the structure of data in order to save space. By reducing
duplication and encoding data in binary form, data compression
creates a compact representation of information. And it involves the
techniques such as loss or lossless compression to reduce the size of a
dataset.
4. Data Transformation
Data transformation in data mining refers to the process of converting raw
data into a format that is suitable for analysis and modelling. The goal of
data transformation is to prepare the data for data mining so that it can be
used to extract useful insights and knowledge.
Data transformation typically involves several steps, including:
1. Smoothing: It is a process that is used to remove noise from the
dataset using techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction): In this, new
attributes are constructed and added from the given set of attributes
to help the mining process.
3. Aggregation: In this, summary or aggregation operations are applied
to the data. For example, the daily sales data may be aggregated to
compute monthly and annual total amounts.
4. Data normalization: This process involves converting all data
variables into a small range. Such as -1.0 to 1.0, or 0.0 to 1.0.
5. Generalization: It converts low-level data attributes to high-level data
attributes using concept hierarchy. For Example, Age initially in
Numerical form (22,) is converted into categorical value (young, old).
Method Name Irregularity Output
Data Cleaning Missing, Nosie, and Inconsistent Quality Data before
data Integration
Data Different data sources (data Unified view
Integration cubes, databases, or flat files)
Data Reduction Huge amounts of data can take a Reduce the size of a
long time, making such analysis dataset and maintains
impractical or infeasible. the integrity.
Data Raw data Prepare the data for
Transformation data mining
UNIT - II
Association Rule Mining
Mining Frequent Patterns
Associations and correlations
Mining Methods
Mining various kinds of Association Rules
Correlation Analysis
Constraint based Association mining.
Graph Pattern Mining, SPM.
1. MINING FREQUENT PATTERNS
Mining Frequent Patterns in Data Mining
Item Set:
An Item set is collection or set of items
Examples:
{Computer, Printer, MSOffice} is 3 item set
{Milk, Bread} is 2 item set
Similarly,
Set of K items is called k item set
Frequent patterns
These are patterns that appear frequently in a data set. Patterns may be
item sets, or sub sequences.
Example: Transaction Database (Dataset)
TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Bread, Egg, Milk.
T4 Egg, Bread, Coke, Milk
A set of items, such as Milk & Bread that appear together in a
transaction data set (Also called as Frequent Item set).
Frequent item set mining leads to the discovery of associations and
correlations among items in large transactional (or) relational data
sets.
Finding frequent patterns plays an essential role in mining
associations, correlations, and many other interesting relationships
among data. Moreover, it helps in data classification, clustering,
and other data mining tasks.
2. ASSOCIATIONS AND
CORRELATIONS
Association rule mining (or) frequent item set mining finds
interesting associations and relationships (correlations) in large
transactional or relational data sets.
This rule shows how frequently an item set occurs in a transaction. A
typical example is Market Based Analysis.
Market Based Analysis is one of the key techniques used by large
relations to show associations between items. It allows retailers to
identify relationships between the items that people buy together
frequently.
This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”.
The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are frequently
purchased together by customers.
For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the
supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
Understanding these buying patterns can help to increase sales in several
ways. If there is a pair of items, X and Y, which are frequently bought
together:
Both X and Y can be placed on the same shelf, so that buyers of one
item would be prompted to buy the other.
Promotional discounts could be applied to just one out of the two
items.
Advertisements on X could be targeted at buyers who purchase Y.
X and Y could be combined into a new product, such as having Y in
flavours of X.
bought together then association rule is represented as X ⇒ Y.
Association rule: If there is a pair of items, X and Y, which are frequently
For example, the information that customers who
purchase computers also tend to buy antivirus software at the same time
is represented as
Computer ⇒ Antivirus_Software
Measures to discover interestingness of
association rules
Association rules analysis is a technique to discover how items are
associated to each other. There are three measure to discover
interestingness of association rules. Those are:
Support: The support of an item / item set is the number of
transactions in which the item / item set appears, divided by the total
number of transactions.
Formula:
Where, A, B are items and N is the total number of transactions.
Example: Table-1 Example Transactions
TID Items
T1 Bread, Coke, Milk
T2 Popcorn, Bread
T3 Bread, Egg, Milk.
T4 Egg, Bread, Coke, Milk
T5 Egg, Apple
Example: Support of item Coke:
Example: Support of item set Bread, Milk:
Confidence: This says that how likely item B is purchased when item A is
purchased, expressed as {A → B}. The Confidence of items (A and B) is the
frequency or number of transactions in which the items (A and B) appear,
divided by the frequency or number of transactions in which the item (A)
appears.
Formula:
Example: From the Table-1, the confidence of {Bread → Milk} is
Lift: This says that how likely item B is purchased when item A is purchased,
expressed as an association rule {A → B}. The lift is a measure to predict the
performance of an association rule (targeting model).
If lift value is:
Greater than 1 means that item B is likely to be bought if item A is
bought,
Less than 1 means that item B is unlikely to be bought if item A is
bought,
Equals to 1 means there is no association between items (A and B).
Formula:
Example: From the Table-1, the lift of {Bread → Milk} is
The Lift value is greater than 1 means that item Milk is likely to be bought if
item Bread is bought.
Example: To find Support, Confidence and Lift measures on the following
transactional data set.
Table-2: Example Transactions
TID Items
T1 Bread, Milk
T2 Bread, Diaper, Burger, Eggs
T3 Milk, Diaper, Burger, Coke
T4 Bread, Milk, Diaper, Burger
T5 Bread, Milk, Diaper, Coke
Number of transactions = 5.
Support:
1 – Item Set:
Support {Bread} = 4 / 5 = 0.8 = 80%
Support {Diaper} = 4 / 5 = 0.8 = 80%
Support {Milk} = 4 / 5 = 0.8 = 80%
Support {Burger} = 3 / 5 = 0.6 = 60%
Support {Coke} = 2 / 5 = 0.4 = 40%
Support {Eggs} = 1 / 5 = 0.2 = 20%
2 – Item Set:
Support {Bread, Milk} = 3 / 5 = 0.6 = 60%
Support {Milk, Diaper} = 3 / 5 = 0.6 = 60%
Support {Milk, Burger} = 2 / 5 = 0.4 = 40%
Support {Burger, Coke} = 1 / 5 = 0.2 = 20%
Support {Milk, Eggs} = 0 / 5 = 0.0 = 0%
3 – Item Set:
Support {Bread, Milk, Diaper} = 2 / 5 = 0.4 = 40%
Support {Milk, Diaper, Burger} = 2 / 5 = 0.4 = 40%
Confidence:
Lift:
3. MINING METHODS
The most famous story about association rule mining is the “beer and
diaper.” Researchers discovered that customers who buy diapers also tend
to buy beer. This classic example shows that there might be many
interesting association rules hidden in our daily data.
Association rules help to predict the occurrence of one item based on the
occurrences of other items in a set of transactions.
Association rules Examples
People who buy bread will also buy milk; represented as{ bread →
milk }
People who buy milk will also buy eggs; represented as { milk →
eggs }
People who buy bread will also buy jam; represented as { bread →
jam }
Association Rules discover the relationship between two or more attributes.
It is mainly in the form of- If antecedent than consequent. For example, a
supermarket sees that there are 200 customers on Friday evening. Out of
the 200 customers, 100 bought chicken, and out of the 100 customers who
bought chicken, 50 have bought Onions. Thus, the association rule would be-
If customers buy chicken then buy onion too, with a support of 50/200 = 25%
and a confidence of 50/100=50%.
Association rule mining is a technique to identify interesting relations
between different items. Association rule mining has to:
Find all the frequent items.
Generate association rules from the above frequent itemset.
There are many methods or algorithms to perform Association Rule Mining or
Frequent Itemset Mining, those are:
Apriori algorithm
FP-Growth algorithm
Apriori algorithm
The Apriori algorithm is a classic and powerful tool in data mining used to
discover frequent itemsets and generate association rules. Imagine a grocery
store database with customer transactions. Apriori can help you find out
which items frequently appear together, revealing valuable insights like:
Customers buying bread often buy butter and milk too. (Frequent
itemset)
70% of people who purchase diapers also buy baby wipes.
(Association rule)
How Apriori algorithm works:
Bottom-up Approach: Starts with finding frequent single items, then
combines them to find frequent pairs, triplets, and so on.
Apriori Property: If a smaller itemset isn't frequent, none of its larger
versions can be either. This "prunes" the search space for efficiency.
Support and Confidence: Two key measures used to define how
often an itemset appears and how strong the association between
items is.
Limitations for Apriori algorithm
Can be computationally expensive for large datasets.
Sensitive to minimum support and confidence thresholds.
FP-Growth algorithm
FP-Growth stands for Frequent Pattern Growth, and it's a smarter sibling of
the Apriori algorithm for mining frequent itemsets in data. But instead of
brute force, it uses a clever strategy to avoid generating and testing tons of
candidate sets, making it much faster and more memory-efficient.
Here's its secret weapon:
Frequent Pattern Tree (FP-Tree): This special data structure
efficiently stores the frequent item sets and their relationships. Think
of it as a compressed and organized representation of your grocery
store database.
Pattern Fragment Growth: Instead of building candidate sets, FP-
Growth focuses on "growing" smaller frequent patterns (fragments) by
adding items at their frequent ends. This avoids the costly generation
and scanning of redundant patterns.
Advantages of FP-Growth over Apriori
Faster for large datasets: No more candidate explosions, just
targeted pattern growth.
Less memory required: The compact FP-Tree minimizes memory
usage.
More versatile: Can easily mine conditional frequent patterns without
building new trees.
When to Choose FP-Growth
If you're dealing with large datasets and want faster results.
If memory limitations are a concern.
If you need to mine conditional frequent patterns.
Remember: Both Apriori and FP-Growth have their strengths and
weaknesses. Choosing the right tool depends on your specific data and
needs.
Apriori algorithm
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining.
It was introduced by R Agarwal and R Srikant.
Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties.
Frequent Item Set
Frequent Itemset is an itemset whose support value is greater than a threshold
value(support).
Apriori algorithm uses frequent itemsets to generate association rules. To improve the
efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
Apriori Property
All subsets of a frequent itemset must be frequent (Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Steps in Apriori algorithm
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset
in the given database. A minimum support threshold is given in the problem or it is
assumed by the user.
The steps followed in the Apriori Algorithm of data mining are:
Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
Prune Step: This step scans the count of each item in the database. If the candidate
item does not meet minimum support, then it is regarded as infrequent and thus it is
removed. This step is performed to reduce the size of the candidate itemsets.
The above join and the prune steps iteratively until the most frequent itemsets are
achieved.
Apriori Algorithm Example
Consider the following dataset and find frequent item sets and generate association rules for them.
Assume that minimum support threshold (s = 50%) and minimum confident threshold (c = 80%).
Transaction List of items
T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4
Solution
Finding frequent item sets:
Support threshold=50% ⇒ 0.5*6 = 3 ⇒ min_sup = 3
Step-1:
(i) Create a table containing support count of each item present in dataset – Called C1 (candidate set).
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that I5 item does not meet min_sup = 3, thus it is
removed, only I1, I2, I3, I4 meet min_sup count.
This gives us the following item set L1.
Item Count
I1 4
I2 5
I3 4
I4 4
Step-2:
(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of 2-
itemset from the given dataset.
Item Count
I1, I2 4
I1, I3 3
I1, I4 2
I2, I3 4
I2, I4 3
I3, I4 2
(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that item sets {I1, I4} and {I3, I4} does not meet min_sup
= 3, thus those are removed.
This gives us the following item set L2.
Item Count
I1, I2 4
I1, I3 3
I2, I3 4
I2, I4 3
Step-3:
(i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the occurrences of 3-
itemset from the given dataset.
Item Count
I1, I2, I3 3
I1, I2, I4 2
I1, I3, I4 1
I2, I3, I4 2
(ii) Prune Step: Compare candidate set item’s support count with minimum support
count. The above table shows that itemset {I1, I2, I4}, {I1, I3, I4} and {I2, I3, I4} does not
meet min_sup = 3, thus those are removed. Only the item set {I1, I2, I3} meet min_sup
count.
Generate Association Rules:
Thus, we have discovered all the frequent item-sets. Now we need to generate strong
association rules (satisfies the minimum confidence threshold) from frequent item sets.
For that we need to calculate confidence of each rule.
The given Confidence threshold is 80%.
The all possible association rules from the frequent item set {I1, I2, I3} are:
{I1, I2} ⇒ {I3}
Confidence=support {I1, I2, I3}support {I1, I2} = (3/ 4)* 100 = 75% (Rejected)
{I1, I3} ⇒ {I2}
Confidence=support {I1, I2, I3}support {I1, I3} = (3/ 3)* 100 = 100% (Selected)
{I2, I3} ⇒ {I1}
Confidence=support {I1, I2, I3}support {I2, I3} = (3/ 4)* 100 = 75% (Rejected)
{I1} ⇒ {I2, I3}
Confidence=support {I1, I2, I3}support {I1} = (3/ 4)* 100 = 75% (Rejected)
{I2} ⇒ {I1, I3}
Confidence=support {I1, I2, I3}support {I2} = (3/ 5)* 100 = 60% (Rejected)
{I3} ⇒ {I1, I2}
Confidence=support {I1, I2, I3}support {I3} = (3/ 4)* 100 = 75% (Rejected)
This shows that the association rule {I1, I3} ⇒ {I2} is strong if minimum confidence
threshold is 80%.
Exercise1: Apriori Algorithm
TID Items
T1 I1, I2, I5
T2 I2, I4
T3 I2, I3
T4 I1, I2, I4
T5 I1, I3
T6 I2, I3
T7 I1, I3
T8 I1, I2, I3, I5
T9 I1, I2, I3
Consider the above dataset and find frequent item sets and generate association rules
for them. Assume that Minimum support count is 2 and minimum confident threshold (c
= 50%).
Exercise2: Apriori Algorithm
TID Items
T1 {milk, bread}
T2 {bread, sugar}
T3 {bread, butter}
T4 {milk, bread, sugar}
T5 {milk, bread, butter}
T6 {milk, bread, butter}
T7 {milk, sugar}
T8 {milk, sugar}
T9 {sugar, butter}
T10 {milk, sugar, butter}
T11 {milk, bread, butter}
Consider the above dataset and find frequent item sets and generate association rules
for them. Assume that Minimum support count is 3 and minimum confident threshold (c
= 60%).