Data Mining SSWT ZC 425
Data Mining SSWT ZC 425
SSWT ZC 425
BITS Pilani M1: Introduction to Data Mining
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Mining
SSWT ZC 425
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining:
Concepts and Techniques
3
Text Book(s)
T1 :Tan P. N., Steinbach M & Kumar V. “Introduction to
Data Mining” Pearson Education, 2014
https://www.ceom.ou.edu/media/docs/upload/Pang-
Ning_Tan_Michael_Steinbach_Vipin_Kumar_-
_Introduction_to_Data_Mining-Pe_NRDK4fi.pdf
Summary
8
Why Data Mining?
9
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
10
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
11
Chapter 1. Introduction
Why Data Mining?
Summary
12
What Is Data Mining?
13
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
14
Example: A Web Mining Framework
15
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
17
KDD Process: A Typical View from ML and
Statistics
18
Example: Medical Data Mining
19
Chapter 1. Introduction
Why Data Mining?
Summary
20
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous,
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
Summary
22
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
23
Chapter 1. Introduction
Why Data Mining?
Summary
24
Data Mining Function: (1) Generalization
25
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
and other applications?
26
Data Mining Function: (3) Classification
27
Data Mining Function: (4) Cluster Analysis
28
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
29
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g.,
memory cards
Periodicity analysis
Similarity-based analysis
30
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
31
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns” and knowledge
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
32
Chapter 1. Introduction
Why Data Mining?
Summary
33
Data Mining: Confluence of Multiple Disciplines
Applications Visualization
Data Mining
34
Why Confluence of Multiple Disciplines?
Summary
36
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
37
Chapter 1. Introduction
Why Data Mining?
Summary
38
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
39
Major Issues in Data Mining (2)
40
Chapter 1. Introduction
Why Data Mining?
Summary
41
A Brief History of Data Mining Society
43
Where to Find References? DBLP, CiteSeer, Google
Summary
45
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining
46
Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed., 2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
47
Queries ??
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Preprocessing
Data Preprocessing 3
Data preprocessing
Data Preprocessing 4
Data
Data Pre-processing
Pre-processing
• D a t a P re - p ro c e s s i n g : A n O v e r v i ew, D a t a C l e a n i n g ,
• D a t a Re d u c t i o n - O v e r v i ew o f D a ta Re d u c t i o n S t ra teg i es , P C A ,
Att r i b u te S u bs et S e l e ct i o n , H i s to g ra m s , S a m p l i n g ;
• D i s c re t i za t i o n by B i n n i n g , D i s c re t i zat i o n by H i s to g ra m
A n a l ys i s , D i s c re t i za t i o n by C l u s te r, D ec i s i o n Tre e , a n d
C o r rel ati o n A n a l y s es . Data Preprocessing 5
Data Data Mining
DataMining
• Wh at i s D ata M i n i n g ?
• M a ny p e o p l e tre at d ata m i n i n g a s a sy n o ny m fo r
a n o th e r p o p u l a rl y u s e d te rm , Kn o w l e d ge D i s co ve ry
f ro m D ata , o r K D D,
• W h i l e o th e rs v i ew d ata m i n i n g a s m e re ly a n
e s s e ntial ste p i n th e p ro c e s s o f k n o w l e d ge
d i s co ve ry.
Data Preprocessing 6
Knowledge Discovery Process
• The knowledge discovery process is shown in Figure below as an iterative sequence
Data Preprocessing 7
Major Tasks in Data Preprocessing
Data Preprocessing 8
Knowledge Discovery Process
Steps 1 through 4 are different forms of data preprocessing, where data are
prepared for mining.
The data mining step may interact with the user or a knowledge base.
The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.
“Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses, the
Web, other information repositories, or data that are streamed into the system
dynamically” .
Data Preprocessing 10
Data Preprocessing
Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent
data due to their typically huge size and their origin from multiple, heterogenous
sources.
Low-quality data will lead to low-quality mining results.
“How can the data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?”
There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in data.
Data Preprocessing 11
Data Preprocessing
Data Preprocessing
Data Preprocessing
Data integration merges data from multiple sources into a coherent data
store such as a data warehouse.
Data reduction can reduce data size by, aggregating, eliminating redundant
features, or clustering.
There are many factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.
Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s
data with respect to your branch’s sales.
You carefully inspect the company’s database and data warehouse, identifying and selecting the
attributes or dimensions (e.g., item, price, and units sold) to be included in your analysis.
You notice that several of the attributes for various tuples have no recorded value.
Data Preprocessing 13
Data Preprocessing stepssteps
• Figure below summarizes the data preprocessing steps described here
Data Preprocessing 14
Data Cleaning
Data Preprocessing 15
Missing Values
Imagine that you need to analyze AllElectronics sales and customer data.
You note that many tuples have no recorded value for several attributes such as customer income. How
can you go about filling in the missing values for this attribute?
1. Ignore the tuple: This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the
remaining attributes’ values in the tuple. Such data could have been useful to the task at hand.
Data Preprocessing 16
Missing Values
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: measures of central tendency, which indicate the “middle”
value of a data distribution. For normal (symmetric) data distributions, the mean
can be used, while skewed data distribution should employ the median.
Data Preprocessing 17
Missing Values
5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit risk
category as that of the given tuple. If the data distribution for a given class is skewed, the
median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
Data Preprocessing 18
Data Quality: Why Preprocess the Data?
Data Preprocessing 19
Major Tasks in Data Preprocessing
Data Preprocessing 20
Data Reduction
Data reduction:
D a t a re d u c t i o n : O bt a i n a r e d u c e d r e p re s e ntat i o n o f t h e
d a t a s e t t h a t i s m u c h s m a l l e r i n vo l u me b u t ye t p r o d u c e s
t h e s a m e ( o r a l m o st t h e s a m e ) a n a l y ti ca l re s u l t s .
W hy d a t a r e d u c t i o n ? A d a t a b a s e / d ata w a re h o u s e m ay s to re
t e ra by tes o f d a t a . C o m p l ex d a ta a n a l ys i s m ay ta ke a v e r y
l o n g t i m e to r u n o n t h e co m p l ete d a ta s et .
Data Preprocessing 21
Data Reduction
Data reduction strategies
Data compression
Data Preprocessing 22
Data Reduction
Dimensionality Reduction
Curse of dimensionality
• W h e n d i m e n s i o n a l i t y i n c re a s e s , d a ta b e co m e s i n c re a s i n g l y
s p a rs e
• D e n s i t y a n d d i s t a n c e b et w e en p o i nt s , w h i c h i s c r i t i ca l to
c l u s te r i n g , o u t l i er a n a l y s i s , b e co m e s l e s s m e a n i n gf u l
Data Preprocessing 23
Data Reduction
Dimensionality Reduction
Dimensionality reduction
• Av o i d t h e c u rs e o f d i m e n s i o n a l i t y
• H e l p e l i m i n ate i r re l eva nt fe a t u re s a n d re d u c e n o i s e
• Re d u c e t i m e a n d s p a c e re q u i re d i n d a t a m i n i n g
• A l l o w e a s i er v i s u a l i za ti o n
Data Preprocessing 24
Data Reduction
Dimensionality Reduction
• Wavelet transforms
Linear regression
• D a t a m o d e l e d t o f i t a s t ra i ght l i n e
• O f t e n u s e s t h e l e a s t - s q u a re m e t h o d t o f i t t h e l i n e
Multiple regression
• Allows a response variable Y to be modeled as a linear function
of multidimensional feature vector
Data Preprocessing 26
y
Regression Analysis
• Regression analysis: A techniques for the modeling
and analysis of numerical data consisting of values of Y1
Linear regression: Y = w X + b
Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Data Preprocessing 28
Sampling: With or without Replacement
Raw Data
Data Preprocessing 29
Sampling: Cluster or Stratified Sampling
Data Preprocessing 30
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Preprocessing 31
Regression and Log-Linear Models
Linear regression: Y = w X + b
Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Data Preprocessing 32
Correlation Analysis (Nominal Data)
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
Data Preprocessing 33
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
( 250 90) 2
(50 210) 2
( 200 360) 2
(1000 840) 2
2 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the group
34
Co-Variance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Data Preprocessing 37 37
Simple Discretization: Binning
Data Preprocessing 38
Discretization by Classification & Correlation Analysis
Data Preprocessing 39 39
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
Data Mining
40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (PCA)
Data Mining x1
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only
Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA
Data Mining
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heuristic Search in Attribute Selection
Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
• Attribute construction
• Combining features
• Data discretization
Data Mining
45 45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Reduction: Numerosity Reduction
Data Mining
46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
Data Mining
47
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank you
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani M3: Data Exploration
Pilani|Dubai|Goa|Hyderabad
The instructor is gratefully acknowledging the authors who made their course materials freely available online.
Note to Self: (Start Recording ! )
IMPORTANT Note to Students
It is important to know that just login to the session does not guarantee the
attendance.
Once you join the session, continue till the end to consider you as present in the
class.
IMPORTANTLY, you need to make the class more interactive by responding to
Professors queries in the session.
Whenever Professor calls your number / name, you need to respond,
otherwise it will be considered as ABSENT
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recap
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recap
Equal Depth (Frequency) Binning: Bins have an equal frequency.
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output: [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215]
Equal Width (Distance) Binning : bins have equal width with a range of each bin are defined as
[min + w], [min + 2w] …. [min + nw]
w = (max – min) / (no of bins)
No. of bins = 3
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
min = 5 w = (215-5)/3 = 70
Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215]
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recap : Exercise
Consider that following scholar data set made available to you for carrying out some
mining activity. List 4 potential issues with this dataset identified while preprocessing?
timeout
season
• Data matrix, e.g., numerical matrix, crosstabs
coach
game
score
team
ball
lost
pla
wi
n
y
• Document data: text documents: term-frequency vector
• Transaction data
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects
• columns -> attributes.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes and their types-By set of possible values.
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature
of a data object.
– E.g., customer _ID, name, address
Types of attributes
– Nominal :categories, states, or “names of things”
• Examples: ID numbers, eye color, zip codes, Hair_color = {auburn, black, blond, brown, grey,
red, white}
– Ordinal :Values have a meaningful order (ranking) but magnitude between successive values is
not known.
• Examples: rankings (e.g., taste of soft drink on a scale from 1-10), grades{A,A-,B,B-,C,C-
,D,E}
– Interval-scaled:
• No true zero-point
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio-scaled
• Inherent zero-point
• Examples: temperature in Kelvin, length, weight, elapsed time(e.g., time to run a race)
BITS Pilani, Pilani Campus
Attribute Types
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous Attributes :
by the Number of Values
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a
finite number of digits
• Continuous attributes are typically represented as floating-point variables
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
Identify whether the attributes are categorical or numerical
1) Aircraft manufacturers
2) Enrolled in a university?
3) Number of students
4) Weight
5) Customer Satisfaction
6) Monthly Income (in Rs)
7) Passed Entrance Exam? (Yes/No)
8) GPA (Grade Point Average)
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): n
1 n
x xi w x i i
w
Weighted arithmetic mean:
i
• i 1
N
• Trimmed mean: chopping extreme values
• Median:
• Middle value if odd number of values, or average of the
middle two values otherwise
• Estimated by interpolation (for grouped data):
n / 2 ( freq)l Median
interval
median L1 ( ) width
• Mode freqmedian
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean mode 3 (mean median)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric,
positively and negatively skewed data
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Graphic Displays of Basic Statistical Descriptions
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %
of data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
Suppose that the data for analysis includes the attribute age. The
age values for the data tuples are (in increasing order) 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 87
a) Calculate mean, and mode
b) Calculate the 5-number summary
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Solution
Suppose that the data for analysis includes the attribute age. The
age values for the data tuples are (in increasing order) 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 87
a) Calculate mean, and mode
b) Calculate the 5-number summary
Mean = 32, Mode = 25, 35 (bimodal) Median is 25+30/2 = 27.5
Q1 = 20, Q3 = 35 Min = 13, Max = 87
5-number summary : (13,20,27.5,35,87)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix
• Dissimilarity matrix 0
• n data points, but registers only the distance d(2,1) 0
• A triangular matrix d(3,1) d ( 3,2) 0
• Single mode
: : :
d ( n,1) d ( n,2) ... ... 0
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j) p
p
m
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Binary Attributes
• A contingency table for binary data Object j
Object i
• Distance measure for symmetric
binary variables:
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
where
s f
1 (| x m | | x m | ... | x m |)
n 1f f 2f f nf f
Suppose that the minimum and maximum values for the attribute income are $12,000
and $98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to
value of $73,600.
Solution v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
v A
v'
A
73,600 54,000
1.225
16,000
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects,
and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Minkowski Distance (Dissimilarity Matrices)
Euclidean (L2) L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables
Calculate the dissimilarity matrix and similarity matrix for the ordinal attributes
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and r 1
• Treat zif as interval-scaled
zif
if
M 1 f
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
For the following vectors x and y, calculate the indicated similarity or distance
measures:
x =(1,1,1,1), y=(2,2,2,2) cosine, Euclidian
Solution: cos(x, y) = 1, Euclidean(x, y) = 2
For the following vectors x and y, calculate the indicated similarity or distance
measures:
x =(0,1,0,1), y=(1,0,1,0) Cosine, Correlation ,Euclidian,Jacquard
Solution: cos(x, y) = 0,Corr(x,y)=-3 Euclidean(x, y) = 2,Jacquard(x,y)=0
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically representing
empirical information.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualization
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Iris Sample Data Set
Versicolour
Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter Plot Array of Iris Attributes
Data Mining
Source : Figure 3.4. Sepal length data from the Iris data set[ Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education ] BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stem and Leaf Plot
4 : 34444566667788888999999
5 : 0000000000111111111222234444445555555666666777777778888888999
6 : 000000111111222233333333344444445555566777777778889999
7 : 0122234677779
4 : 3444
4 : 566667788888999999
5 : 000000000011111111122223444444
5 : 5555555666666777777778888888999
6 : 00000011111122223333333334444444
6 : 5555566777777778889999
7 : 0122234
7 : 677779 Data Mining
Source : Figure 3.5., Stem and leaf plot for the sepal length from the Iris data set. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Pie Chart
A pie chart that shows the distribution of Iris species in the Iris data set. In this case,
all three flower types have the same frequency.
• Used less frequently in technical publications because the size of relative areas can be hard to
judge.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatterplot Matrices
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Other Visualization Techniques
Star Plots
– Similar approach to parallel coordinates, but axes
radiate from a central point
– The line connecting the values of an object is a
polygon
Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a
characteristic of a face
– The values of each attribute determine the appearance
of the corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
02/03/2018 Introduction to Data Mining 62
Chernoff Faces
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stick Figure
A census data figure showing age,
income, gender, education
A 5-piece stick figure (1 body and
4 limbs w. different angle/length)
Age, income are indicated by
position of the figure.
Gender, education are indicated
by angle/length.
Visualization can show a texture
pattern
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
You are analysing the behaviour of students across 10 variables like
attendance, assignment scores, participation, etc.
Which visualization technique would you choose: Bar chart, Chernoff
faces, or Pie chart, Scatter Plot? Justify your answer.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualizing Complex Data and Relations
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References and Textbooks
Classification Concepts
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Classification: Definition
Task:
Learn a model that maps each attribute set x into one of the predefined class labels y
• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
• Eager learning (the methods mentioned in next slide): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many local linear
functions to form its implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the entire instance space
• Rule-based Methods
• Neural Networks
- computational networks that simulate the decision process in neurons (networks of nerve cell)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Representation
• Given a tuple for which the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
12 / 67
Decision Tree –Vertebrate Data Set
13 / 67
D ECISION T REE E XAMPLE
L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M
Start from the root of tree. Test Data
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
14
BITS Pilani, Pilani Campus
D ECISION T REE E XAMPLE
L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
17 / 67
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Let Dt be the set of training records that are 1 Yes Single 125K No
18 / 67
D E C I S I O N T R E E C O N S T R U C T I O N - H U N T ’ S A LG
Home
Owner
Yes No Home Marital Annual Defaulted
ID
Defaulted = No Owner Status Income Borrower
Defaulted = No Defaulted = No 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(a) (b)
4 Yes Married 120K No
5 No Divorced 95K Yes
Home 6 No Married 60K No
Owner
Yes No 7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes
Defaulted = No Marital
Yes No 9 No Married 75K No
Status
Single,
Marital Married 10 No Single 90K Yes
Defaulted = No Divorced 10
Status
Single, Annual Defaulted = No
Married
Divorced Income
Defaulted = Yes Defaulted = No < 80K >= 80K
(c) (d)
BITS Pilani, Pilani Campus
D ESIGN I SSUES OF D ECISION T REE I NDUCTION
• Splitting Criterion
Method for specifying test condition
Depends on attribute types-Binary, Nominal, Ordinal, Continuous
Measure for evaluating the goodness of a test condition
• Stopping Criterion
Stop splitting if all the records belong to the same class or have identical
attribute values
Early termination
20 / 67
T EST C ONDITION FOR N OMINAL ATTRIBUTES
Marital
Status
• Multi-way split:
• Use as many partitions as distinct values.
• Binary split: Single Divorced Married
• Divides values into two subsets
• Preserve order property among attribute
values Marital Marital Marital
Status Status Status
OR OR
Shirt
Size This grouping
{Married} {Single, {Single} {Married, {Single, {Divorced}
violates order Divorced} Divorced} Married}
property
{Small, {Medium,
Large} Extra Large}
21 / 67
T EST C ONDITION FOR C ONTINUOUS ATTRIBUTES
Binary Decision :
Discretization :
• consider all possible splits and
• to form an ordinal categorical attribute
finds the best cut
• can be more compute intensive
22 / 67
H OW TO DETERMINE THE B EST S PLIT
Before Splitting:
10 records of class 0
10 records of class 1
23 / 67
H OW TO DETERMINE THE B EST S PLIT
Impurity is a measure of homogeneity of the labels at the node at hand
Greedy approach:
Nodes with purer class distribution are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
24 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
I MPURITY M EASURE FOR A S INGLE N ODE
𝑐−1
2
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝𝑖 𝑡 Where 𝒑𝒊 𝒕 is the frequency of
𝑖=0 class 𝒊 at node t, and 𝒄 is the total
𝑐−1
number of classes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)
𝑖=0
Measures seek to determine which variable would split the data to lead to the
underlying child nodes being most homogenous or pure.
25 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
I MPURITY M EASURE FOR A S INGLE N ODE
p(C1) = 0/6 = 0 and p(C2) = 6/6 = 1 Low entropy and
GINI index
Gini = 1 − (0)2 − (1)2 = 0 preferred
Entropy = −(0) log2(0) − (1) log2(1) = 0
:– low impurity
26 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
C OLLECTIVE I MPURITY OF C HILD N ODES
When a node 𝑝 is split into 𝑘 partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
𝑘
𝑛𝑖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1
where,
𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝
27 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
F INDING THE B EST S PLIT
30 / 48
D ECISION T REE - ENTROPY
31 / 48
D ECISION T REE - ENTROPY
32 / 48
D ECISION T REE - ENTROPY
Step 3:
Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Outlook has the max gain. So choose Outlook as the root decision point.
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further splitting.
33 / 48
D ECISION T REE - ENTROPY
Step 4a: Consider only rows which have Outlook =
Step 4: Repeat
Sunny
Windy has max the max gain. Choose Windy as the next
decision point.
34 / 48
D ECISION T REE - ENTROPY
Step 4b: Consider only rows which have Outlook = Step 5: Draw the final decision tree.
Rainy.
35 / 48
Splitting Criteria based on Classification
Error
• Classification error at a node t :
C1 0
Error (t ) 1 max P(i | t )
i C2 6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
distributed among all classes, implying least C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
interesting information
• Minimum (0.0) when all records belong to one C1 2 P(C1) = 2/6 P(C2) = 4/6
class, implying most interesting information C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
Comparison among Splitting Criteria
For a 2-class problem:
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
Dealing with continuous-valued attributes
• Given a continuous-valued attribute A,
dynamically define new discrete valued
attribute Ac that partition the continuous
attribute value into a discrete set of intervals
• Ac = True if A < c, False otherwise
• How to select the best value for the
threshold c?
38 / 67
Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
• Example. Temperature in the example
Sort the examples according to Temperature
Temperature 60 65 67 68 70 71 72 72 75 80 81 83 85 85
PlayTennis Y N Y Y Y N Y Y N N Y Y N Y
39 / 67
Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
60 65 67 68 70 71 72 72 75 80 81 83 85 85
The candidate thresholds identified are:
• 62.5 ,66,70.5,71.5,73.5,80.5,84 Y N Y Y Y N Y Y N N Y Y N Y
40 / 67
Dealing with continuous-valued attributes
41 / 67
S t o p p i n g C r i t e r i a f o r Tr e e I n d u c t i o n
• Stop expanding a node when all the records belong to the same class
• Stop expanding a node when all the records have similar attribute values
• Early termination
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
D e c i s i o n Tr e e B a s e d C l a s s i f i c a t i o n
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple data sets
45 / 67
Decision Tree:Overfitting
46 / 67
Decision Tree :Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Decision Tree :Handling Overfitting
Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree
• Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if expanding the current node does not improve impurity measures (e.g., Gini
or information gain).
• Stop if estimated generalization error falls below certain threshold
Decision Tree :Handling Overfitting
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class
• Coverage of a rule: 1 Yes Single 125K No
• Fraction of records that satisfy the 2 No Married 100K No
antecedent of a rule 3 No Single 70K No
(Status=Single) No
Coverage = 40%, Accuracy = 50%
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
H o w d o e s R u l e - b a s e d C l a s s i f i e r Wo r k ?
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Rule -Based Classifier
• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
F r o m D e c i s i o n Tr e e s To R u l e s
Classification Rules
Refund (Refund=Yes) ==> No
Yes No
(Refund=No, Marital Status={Single,Divorced},
NO Taxable Income<80K) ==> No
Marita l
{Single, Status
(Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income>80K) ==> Yes
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rules Can Be Simplified
Tid Refund Marital Taxable
Refund Status Income Cheat
Yes No
1 Yes Single 125K No
NO Marita l 2 No Married 100K No
{Single, Status
{Married} 3 No Single 70K No
Divorced}
4 Yes Married 120K No
Taxable NO 5 No Divorced 95K Yes
Income
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordered Rule Set
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes
• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together
Rule-based Ordering Class-based Ordering
(Refund=Yes) ==> No (Refund=Yes) ==> No
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Building Classification Rules
• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R
• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Direct Method: Sequential Covering
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indirect Methods
P
No Yes
Q R Rule Set
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayes Classifier
• Bayes theorem:
P( A | C ) P(C )
P(C | A)
P( A)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayesian Classifiers
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayesian Classifiers
• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all values of C using the
Bayes theorem
P( A A A | C ) P(C )
P(C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from Data?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from
Data?
• For continuous attributes:
• Discretize the range into bins
• one ordinal attribute per bin
• Too many bins – training records are too few to provide reliable probability for each interval
• Too few bins – some intervals may aggregate from different classes & we may miss the correct decision
boundary
• Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ck)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from Data?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Naïve Bayes Classifier
Given a Test Record:
Tid Refund Marital Taxable Evade X (Refund No, Married, Income 120K)
Status Income
1 Yes Single 125K No P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
2 No Married 100K No P(Income=120K| Class=No)
3 No Single 70K No = 4/7 4/7 0.0069 = 0.0023
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier
N ic
Original : P( Ai | C )
Nc c: number of classes
N ic 1
Laplace : P( Ai | C )
Nc c
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals
salmon no no yes no non-mammals M: mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals N: non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P( A | M ) 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P( A | N ) 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P( A | M ) P( M ) 0.06 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P( A | N ) P( N ) 0.004 0.0027
eagle no yes no yes non-mammals 20
Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N)
yes no yes no ?
=> Mammals
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes (Summary)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
H o w D o e s A N N Wo r k ?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
Perceptron Model
Y I ( wi X i t ) or
i
Y sign ( wi X i t )
i
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Structure of ANN
x1 x2 x3 x4 x5
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s
• Find a linear hyperplane (decision boundary) that will separate the data
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s
B1
B2
B2
B2
B2
b21
b22
margin
b11
b12
w x b 0
w x b 1 w x b 1
b11
1 if w x b 1 b12 2
Margin 2
f ( x) || w ||
1 if w x b 1
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l in ea r S u p p o rt Ve c tor M a c h ine s
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ensemble Methods
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why does it work?
i
i 13
(1 ) 25i
0.06
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
5 Prediction
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recap
CS1 CS4
Why Data Mining?
Introduction to Classification
What Is Data Mining? Introduction to Classification
Applications of Data Mining Decision Tree
Example: Classification, Clustering, Association mining Rule Based Classifier
Bayes Classifier, ANN, SVM, Ensemble Methods
CS2
Data Preprocessing Concepts
Data Cleaning, Integration, reduction, transformation and discretization
Data preprocessing techniques
CS3
Data Description
Data Similarity/Dissimilarity
Data Visualization
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Agenda
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression for Prediction
• A regression task begins with a data set in which the target values are
known, e.g., a regression model that predicts house values could be
developed based on observed data for many houses over a period of time.
The data might track the age of the house, square footage, number of
rooms, taxes, school district, proximity to shopping centers, and so on.
House value would be the target, the other attributes would be the
predictors, and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data. These relationships between predictors and target are summarized in
a model, which can then be applied to a different data set in which the
target values are unknown
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression in general
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear regression in Economics
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prediction Techniques
• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Generalized linear model
• Poisson regression
• Log-linear models
• Regression trees
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Linear
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression analysis
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Which line “fits best”
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression Model
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Least Squares
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Estimating Model Parameter
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficients
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficient Equations
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -1
Parameter Estimation
(10, 0.5) and (25, 1.20) are actual purchase amount and tax
amount pairs. Predict the amount of the tax on the purchase
of amount 55 using linear regression.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Solution
Example -1
Parameter Estimation
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2
Parameter Estimation
Hours Spent Exam
Studying Scores
4 60
1 10
10 64
14 75
4 50
7 70
22 95
3 27
8 49
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2
Parameter Estimation
X - Hours Spent Studying
Y – Exam Score
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2
Prediction Error
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2
Parameter Estimation
Slope (𝜷𝟏)
• Exam score (Y) is expected to increases by 3.3 units for each 1 hour increase of study hours (X)
Intercept (𝜷𝟎)
• Exam score (Y) is 28.6 when you study 0 hours
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -3
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -3
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Often the relationship between x and y
cannot be approximated with a straight
line. In this case, a nonlinear regression
technique may be used. Alternatively,
the data could be preprocessed to
make the relationship linear.
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression model.
For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed to linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive calculation on
more complex formulae
• Regression and model trees tend to be more accurate than linear regression when the data are
not represented well by a simple linear model
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression - Example
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Regression Models
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
THANK YOU!
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani
Pilani|Dubai|Goa|Hyderabad M6: Association Analysis and Apriori Algorithm
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Association Analysis
Data Mining
August 30, 2025
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis Contd.
Data Mining
August 30, 2025
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Market Basket Analysis
Data Mining
August 30, 2025
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining
Data Mining
Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that do not satisfy the minsup and minconf
thresholds
Computationally prohibitive!
Data Mining
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Butter}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Data Mining
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Data Mining
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Data Mining
Data Mining
Apriori principle:
• If an itemset is frequent, then all of its subsets must also be
frequent
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Data Mining
Data Mining
Method:
1. Let k=1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent itemsets are identified
i. Generate length (k+1) candidate itemsets from length k
frequent itemsets
ii. Prune candidate itemsets containing subsets of length k that
are infrequent
iii. Count the support of each candidate by scanning the DB
iv. Eliminate candidates that are infrequent, leaving only those
that are frequent
Data Mining
Data Mining
Given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3.
It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5.
Data Mining
Data Mining
Data Mining
In this example, 5
out of the 9 leaf nodes are
visited and 9 out of the 15
itemsets are compared
against the transaction.
Data Mining
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Data Mining
Hash Function
Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
Hash on 345 356 367
2, 5 or 8 357 368
124 159 689
125
457 458
Data Mining
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Data Mining
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
Level 3 Subsets
Data Mining of 3 items
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Factors Affecting Complexity
Data Mining
Procedure:
“s ->(l-s)” if
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Perform pruning for again, this time to filter out all those association rules
that have Confidence(%) less than the min_conf(70%).
Data Mining
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Pilani Campus
Important Note to Students
Once you join the session, continue till the end to consider you as
present in the class.
Data Mining
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Data Mining
Data Mining
August 30, 2025
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns
Without Candidate Generation
• Grow long patterns from short ones using local
frequent items
• “abc” is a frequent pattern
• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB|abc abcd is a
frequent pattern
Data Mining
August 30, 2025
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Patterns(FP)-growth Algorithm
• Use a compressed representation of the database using
an FP-tree
• Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets
1.Scan DB once, find frequent 1-itemset (single item
pattern)
2.Sort frequent items in frequency descending order,
f-list
3.Scan DB again, construct FP-tree
Data Mining
min_support = 3 F-list=f-c-a-b-m-p
August 30, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure
• Completeness
• Preserve complete information for frequent pattern
mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared
• Never be larger than the original database (not
count node-links and the count field)
Data Mining 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Partition Patterns and Databases
• Frequent patterns can be partitioned into subsets
according to f-list
• F-list=f-c-a-b-m-p
• Patterns containing p
• Patterns having m but no p
•…
• Patterns having c but no a nor b, m, p
• Pattern f
• Completeness and non-redundency
Data Mining
August 30, 2025
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Conditional Pattern-bases to Conditional FP-trees
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
Data Mining
August 30, 2025
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Special Case: Single Prefix Path in FP-tree
• Suppose a (conditional) FP-tree T has a shared single
prefix-path P
• Mining can be decomposed into two parts
{} • Reduction of the single prefix path into one node
• Concatenation of the mining results of the two parts
a1:n1
a2:n2
a3:n3 {} r1
a1:n1
b1:m1 C1:k1 r1 =
a2:n2
+ b1:m1 C1:k1
Data Mining
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Is FP-Growth the Winner?
• Divide-and-conquer:
• decompose both the mining task and DB according
to the frequent patterns obtained so far
• leads to focused search of smaller databases
• Other factors
• no candidate generation, no candidate test
• compressed database: FP-tree structure
• no repeated scan of entire database
• basic ops—counting local freq items and building
sub FP-tree, no pattern search and matching
Data Mining
August 30, 2025 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining
Data Mining
Data Mining
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support
minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a
binary partitioning of a frequent itemset
Data Mining
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Data Mining
Support
distribution of a
retail data set
Data Mining
Data Mining
Data Mining
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Data Mining
P(Y | X )
Lift
P(Y )
P( X , Y )
Interest
P( X ) P(Y )
Data Mining
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Data Mining
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
0.1 0.9
Lift 10 Lift 1.11
(0.1)(0.1) (0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Data Mining
• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts
the
expectation of a user
• A pattern is subjectively interesting if it is
actionable
Data Mining
+ - Expected Patterns
- + Unexpected Patterns
• Recap of CS06
• Mining Association Rules
• Bottleneck of Frequent-pattern Mining
• Mining Frequent Patterns Without Candidate Generation
• Frequent Patterns(FP)-growth Algorithm
• From Conditional Pattern-bases to Conditional FP-trees
• Association Rule Mining
• Rule Generation for Apriori Algorithm
• Effect of Support Distribution
• Multiple Minimum Support, Pattern Evaluation
Thank You!