0% found this document useful (0 votes)
8 views381 pages

Data Mining SSWT ZC 425

The document provides an introduction to data mining, outlining its concepts, techniques, and applications. It discusses the evolution of data mining, the types of data that can be mined, and the various tools and technologies used in the field. Additionally, it highlights the importance of data mining in extracting valuable insights from large datasets across different domains.

Uploaded by

tayan sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views381 pages

Data Mining SSWT ZC 425

The document provides an introduction to data mining, outlining its concepts, techniques, and applications. It discusses the evolution of data mining, the types of data that can be mined, and the various tools and technologies used in the field. Additionally, it highlights the importance of data mining in extracting valuable insights from large datasets across different domains.

Uploaded by

tayan sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 381

Data Mining

SSWT ZC 425
BITS Pilani M1: Introduction to Data Mining
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
SSWT ZC 425

 M1: Introduction to Data Mining

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining:
Concepts and Techniques

3
Text Book(s)
 T1 :Tan P. N., Steinbach M & Kumar V. “Introduction to
Data Mining” Pearson Education, 2014

https://www.ceom.ou.edu/media/docs/upload/Pang-
Ning_Tan_Michael_Steinbach_Vipin_Kumar_-
_Introduction_to_Data_Mining-Pe_NRDK4fi.pdf

 T2: Data Mining: Concepts and Techniques, by Jiawei


Han and Micheline Kamber Morgan Kaufmann
Publishers, 2012
https://sves.org.in/ecap/Resources/_53.pdf
August 30, 2025 Data Mining: Concepts and Techniques 4
Reference Book(s) & other
resources
 R1: Predictive Analytics and Data Mining: Concepts
and Practice with RapidMiner by Vijay Kotu and Bala
Deshpande Morgan Kaufmann Publishers © 2015

 R2: Practical Text Mining and Statistical Analysis for


Non-structured Text Data Applications by Gary Miner
et al. Academic Press © 2012

 R3:Recommender Systems for Learning by Nikos


Manouselis, Hendrik Drachsler, KatrienVerbert and Erik
Duval Springer © 2013
August 30, 2025 Data Mining: Concepts and Techniques 5
SSWT ZC 425 Data Mining

WASE 2022_SSWT ZC 425_ Data Mining


Handout

August 30, 2025 Data Mining: Concepts and Techniques 6


Data Mining Tools
 MonkeyLearn | No-code text mining tools
 RapidMiner | Drag and drop workflows or data mining in
Python
 Oracle Data Mining | Predictive data mining models
 IBM SPSS Modeler | A predictive analytics platform for data
scientists
 Weka | Open-source software for data mining
 Knime | Pre-built components for data mining projects
 H2O | Open-source library offering data mining in Python
 Orange | Open-source data mining toolbox
 Apache Mahout | Ideal for complex and large-scale data
mining
 SAS Enterprise Miner | Solve business problems with data
mining

August 30, 2025 Data Mining: Concepts and Techniques 7


Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
8
Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

9
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!

 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
10
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

11
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
12
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

13
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
14
Example: A Web Mining Framework

 Web mining usually involves


 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base

15
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16
Example: Mining vs. Data Exploration

 Business intelligence view


 Warehouse, data cube, reporting but not much mining
 Business objects vs. data mining tools
 Supply chain example: tools
 Data presentation
 Exploration

17
KDD Process: A Typical View from ML and
Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities

18
Example: Medical Data Mining

 Health care & medical data mining – often


adopted such a view in statistics and machine
learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation

19
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
20
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous,

legacy), data warehouse, transactional data, stream, spatiotemporal,


time-series, sequence, text and web, multi-media, graphs & social
and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.


 Descriptive vs. predictive data mining

 Multiple/integrated functions and mining at multiple levels

 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics,

pattern recognition, visualization, high-performance, etc.


 Adopted by Applications
 Retail, telecommunication, banking, fraud analysis, bio-data mining,

stock market analysis, text mining, Web mining, etc.


21
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
22
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

23
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
24
Data Mining Function: (1) Generalization

 Information integration and data warehouse construction


 Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
 Scalable methods for computing (i.e., materializing)
multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description: Characterization
and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region

25
Data Mining Function: (2) Association and
Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your
Walmart?
 Association, correlation vs. causality
 A typical association rule
 Diaper  Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large
datasets?
 How to use such patterns for classification, clustering,
and other applications?
26
Data Mining Function: (3) Classification

 Classification and label prediction


 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

27
Data Mining Function: (4) Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

28
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

29
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g.,

regression and value prediction


 Sequential pattern mining

 e.g., first buy digital camera, then buy large SD

memory cards
 Periodicity analysis

 Motifs and biological sequence analysis

 Approximate and consecutive motifs

 Similarity-based analysis

 Mining data streams


 Ordered, time-varying, potentially infinite, data streams

30
Structure and Network Analysis

 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees

(XML), substructures (web fragments)


 Information network analysis
 Social networks: actors (objects, nodes) and relationships (edges)

 e.g., author networks in CS, terrorist networks

 Multiple heterogeneous networks

 A person could be multiple information networks: friends,

family, classmates, …
 Links carry a lot of semantic information: Link mining

 Web mining
 Web is a big information network: from PageRank to Google

 Analysis of Web information networks

 Web community discovery, opinion mining, usage mining, …

31
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …
32
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
33
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Visualization
Data Mining

Algorithm Database High-Performance


Technology Computing

34
Why Confluence of Multiple Disciplines?

 Tremendous amount of data


 Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
35
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
36
Applications of Data Mining
 Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
 Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining

37
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
38
Major Issues in Data Mining (1)

 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

39
Major Issues in Data Mining (2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

40
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
41
A Brief History of Data Mining Society

 1989 IJCAI Workshop on Knowledge Discovery in Databases


 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
 ACM Transactions on KDD starting in 2007
42
Conferences and Journals on Data Mining

 KDD Conferences  Other related conferences


 ACM SIGKDD Int. Conf. on  DB conferences: ACM SIGMOD,
Knowledge Discovery in
VLDB, ICDE, EDBT, ICDT, …
Databases and Data Mining (KDD)
 Web and IR conferences: WWW,
 SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
 (IEEE) Int. Conf. on Data Mining
(ICDM)  ML conferences: ICML, NIPS
 European Conf. on Machine  PR conferences: CVPR,
Learning and Principles and  Journals
practices of Knowledge Discovery
 Data Mining and Knowledge
and Data Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
 Pacific-Asia Conf. on Knowledge
Discovery and Data Mining  IEEE Trans. On Knowledge and
(PAKDD) Data Eng. (TKDE)
 Int. Conf. on Web Search and  KDD Explorations
Data Mining (WSDM)  ACM Trans. on KDD

43
Where to Find References? DBLP, CiteSeer, Google

 Data mining and KDD (SIGKDD: CDROM)


 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
 Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
 Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
 Web and IR
 Conferences: SIGIR, WWW, CIKM, etc.
 Journals: WWW: Internet and Web Information Systems,
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.
44
Chapter 1. Introduction
 Why Data Mining?

 What Is Data Mining?

 A Multi-Dimensional View of Data Mining

 What Kind of Data Can Be Mined?

 What Kinds of Patterns Can Be Mined?

 What Technology Are Used?

 What Kind of Applications Are Targeted?

 Major Issues in Data Mining

 A Brief History of Data Mining and Data Mining Society

 Summary
45
Summary
 Data mining: Discovering interesting patterns and knowledge from
massive amount of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Data mining technologies and applications
 Major issues in data mining

46
Recommended Reference Books
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed., 2011
 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
 B. Liu, Web Data Mining, Springer 2006.
 T. M. Mitchell, Machine Learning, McGraw Hill, 1997
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005

47
Queries ??

August 30, 2025 Data Mining: Concepts and Techniques 48


THANK YOU

August 30, 2025 Data Mining: Concepts and Techniques 49


Data Mining
M2: Data Preprocessing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

2.1 Data Preprocessing Techniques

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Preprocessing

Data Preprocessing 3
Data preprocessing

Data Preprocessing 4
Data
Data Pre-processing
Pre-processing
• D a t a P re - p ro c e s s i n g : A n O v e r v i ew, D a t a C l e a n i n g ,

• D a t a Re d u c t i o n - O v e r v i ew o f D a ta Re d u c t i o n S t ra teg i es , P C A ,
Att r i b u te S u bs et S e l e ct i o n , H i s to g ra m s , S a m p l i n g ;

• D a t a Tra n sfo r m at i o n a n d D a t a D i s c ret i zat i o n - D a ta


Tra n sfo r ma ti o n by N o r m a l i za ti o n ,

• D i s c re t i za t i o n by B i n n i n g , D i s c re t i zat i o n by H i s to g ra m
A n a l ys i s , D i s c re t i za t i o n by C l u s te r, D ec i s i o n Tre e , a n d
C o r rel ati o n A n a l y s es . Data Preprocessing 5
Data Data Mining
DataMining

• Wh at i s D ata M i n i n g ?

• M a ny p e o p l e tre at d ata m i n i n g a s a sy n o ny m fo r
a n o th e r p o p u l a rl y u s e d te rm , Kn o w l e d ge D i s co ve ry
f ro m D ata , o r K D D,

• W h i l e o th e rs v i ew d ata m i n i n g a s m e re ly a n
e s s e ntial ste p i n th e p ro c e s s o f k n o w l e d ge
d i s co ve ry.
Data Preprocessing 6
Knowledge Discovery Process
• The knowledge discovery process is shown in Figure below as an iterative sequence

Data Preprocessing 7
Major Tasks in Data Preprocessing

Data cleaning Data integration

Fill in missing values, smooth Integration of multiple


noisy data, identify or remove databases, data cubes, or
outliers, and resolve files
inconsistencies
Data
Preprocessing
Data transformation and data
Data reduction discretization

Dimensionality reduction Normalization

Data Preprocessing 8
Knowledge Discovery Process

The knowledge discovery process has the following steps:


1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Data Preprocessing 9
Knowledge Discovery Process

Steps 1 through 4 are different forms of data preprocessing, where data are
prepared for mining.

The data mining step may interact with the user or a knowledge base.

The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.

“Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses, the
Web, other information repositories, or data that are streamed into the system
dynamically” .
Data Preprocessing 10
Data Preprocessing

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent
data due to their typically huge size and their origin from multiple, heterogenous
sources.
Low-quality data will lead to low-quality mining results.
“How can the data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?”
There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in data.

Data Preprocessing 11
Data Preprocessing

Data Preprocessing
Data Preprocessing

Data integration merges data from multiple sources into a coherent data
store such as a data warehouse.

Data reduction can reduce data size by, aggregating, eliminating redundant
features, or clustering.

 Data transformations (e.g., normalization) may be applied, where data are


scaled to fall within a smaller range like 0.0 to 1.0.

This can improve the accuracy and efficiency of mining algorithms.


Data Preprocessing 12
Data Quality: Why Preprocess the Data?
Data Quality: Why Preprocess the Data?
Data have quality if they satisfy the requirements of the intended use.

There are many factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.

Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s
data with respect to your branch’s sales.

You carefully inspect the company’s database and data warehouse, identifying and selecting the
attributes or dimensions (e.g., item, price, and units sold) to be included in your analysis.

 You notice that several of the attributes for various tuples have no recorded value.

Data Preprocessing 13
Data Preprocessing stepssteps
• Figure below summarizes the data preprocessing steps described here

Data Preprocessing 14
Data Cleaning

Real-world data tend to be incomplete, noisy, and inconsistent.

Data cleaning routines attempt to fill in missing values, smooth


out noise while identifying outliers, and correct inconsistencies in
the data.

Data Preprocessing 15
Missing Values

 Imagine that you need to analyze AllElectronics sales and customer data.

 You note that many tuples have no recorded value for several attributes such as customer income. How
can you go about filling in the missing values for this attribute?

 Let’s look at the following methods.

1. Ignore the tuple: This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the
remaining attributes’ values in the tuple. Such data could have been useful to the task at hand.

2. Fill in the missing value manually:

Data Preprocessing 16
Missing Values

3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: measures of central tendency, which indicate the “middle”
value of a data distribution. For normal (symmetric) data distributions, the mean
can be used, while skewed data distribution should employ the median.

Data Preprocessing 17
Missing Values

5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit risk
category as that of the given tuple. If the data distribution for a given class is skewed, the
median value is a better choice.

6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.

Data Preprocessing 18
Data Quality: Why Preprocess the Data?

Measures for data quality

Accuracy: correct or wrong , accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling ,

Data Preprocessing 19
Major Tasks in Data Preprocessing

Data cleaning Data integration

Fill in missing values, smooth Integration of multiple


noisy data, identify or remove databases, data cubes, or
outliers, and resolve files
inconsistencies
Data
Preprocessing
Data transformation and data
Data reduction discretization

Dimensionality reduction Normalization

Data Preprocessing 20
Data Reduction
Data reduction:

D a t a re d u c t i o n : O bt a i n a r e d u c e d r e p re s e ntat i o n o f t h e
d a t a s e t t h a t i s m u c h s m a l l e r i n vo l u me b u t ye t p r o d u c e s
t h e s a m e ( o r a l m o st t h e s a m e ) a n a l y ti ca l re s u l t s .

W hy d a t a r e d u c t i o n ? A d a t a b a s e / d ata w a re h o u s e m ay s to re
t e ra by tes o f d a t a . C o m p l ex d a ta a n a l ys i s m ay ta ke a v e r y
l o n g t i m e to r u n o n t h e co m p l ete d a ta s et .

Data Preprocessing 21
Data Reduction
Data reduction strategies

Dimensionality reduction, Ex: remove unimportant attributes

• Wave let transforms

• Principal Components Analysis (PCA)

• Feature subset selection, feature creation

Numerosity reduction (some simply call it: Data Reduction)

• Regression and Log-Linear Models

• Histograms, clustering, sampling

• Data cube aggregation

Data compression
Data Preprocessing 22
Data Reduction
Dimensionality Reduction

Curse of dimensionality

• W h e n d i m e n s i o n a l i t y i n c re a s e s , d a ta b e co m e s i n c re a s i n g l y
s p a rs e

• D e n s i t y a n d d i s t a n c e b et w e en p o i nt s , w h i c h i s c r i t i ca l to
c l u s te r i n g , o u t l i er a n a l y s i s , b e co m e s l e s s m e a n i n gf u l

Data Preprocessing 23
Data Reduction
Dimensionality Reduction

Dimensionality reduction

• Av o i d t h e c u rs e o f d i m e n s i o n a l i t y

• H e l p e l i m i n ate i r re l eva nt fe a t u re s a n d re d u c e n o i s e

• Re d u c e t i m e a n d s p a c e re q u i re d i n d a t a m i n i n g

• A l l o w e a s i er v i s u a l i za ti o n

Data Preprocessing 24
Data Reduction
Dimensionality Reduction

Dimensionality reduction techniques

• Wavelet transforms

• Principal Component Analysis

• Supervised and nonlinea r techniques

(Ex: feature selection)


Data Preprocessing 25
Regression and Log-Linear Models

Linear regression
• D a t a m o d e l e d t o f i t a s t ra i ght l i n e

• O f t e n u s e s t h e l e a s t - s q u a re m e t h o d t o f i t t h e l i n e

Multiple regression
• Allows a response variable Y to be modeled as a linear function
of multidimensional feature vector

Data Preprocessing 26
y
Regression Analysis
• Regression analysis: A techniques for the modeling
and analysis of numerical data consisting of values of Y1

a dependent variable (response variable or


Y1’
measurement) and of one or more independent y=x+1
variables (explanatory variables or predictors)
• The parameters are estimated so as to give a "best X1 x
fit" of the data
• Used for prediction
• Most commonly the best fit is evaluated by using the (including forecasting of
least squares method time-series data),
inference, hypothesis
testing, and modeling of
Data Preprocessing
causal relationships 27
Regression and Log-Linear Models

Linear regression: Y = w X + b

Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
estimated by using the data at hand

Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2

Many nonlinear functions can be transformed into the above

Data Preprocessing 28
Sampling: With or without Replacement

Raw Data
Data Preprocessing 29
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Data Preprocessing 30
Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Data Preprocessing 31
Regression and Log-Linear Models

Linear regression: Y = w X + b

Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
estimated by using the data at hand

Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2

Many nonlinear functions can be transformed into the above

Data Preprocessing 32
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test (Observed  Expected) 2


 
2

Expected
• The larger the Χ2 value, the more likely the variables are related

• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count

• Correlation does not imply causality


• # of hospitals and # of car-theft in a city are correlated

• Both are causally linked to the third variable: population

Data Preprocessing 33
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

( 250  90) 2
(50  210) 2
( 200  360) 2
(1000  840) 2
2      507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the group

34
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


35
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers

• Discretization: Divide the range of a continuous


attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
Data Preprocessing 36
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-
up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

Data Preprocessing 37 37
Simple Discretization: Binning

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky

Data Preprocessing 38
Discretization by Classification & Correlation Analysis

• Classification (e.g., decision tree analysis)


• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter 7

• Correlation analysis (e.g., Chi-merge: χ2-based discretization)


• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

Data Preprocessing 39 39
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

Data Mining
40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting in dimensionality
reduction.
x2

Data Mining x1
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only

Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA

Data Mining
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heuristic Search in Attribute Selection

• There are 2d possible attribute combinations of d attributes


• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption: choose
by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking

Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
• Attribute construction
• Combining features
• Data discretization

Data Mining
45 45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Reduction: Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of data


representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

Data Mining
46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers

Data Mining
47
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank you
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani M3: Data Exploration
Pilani|Dubai|Goa|Hyderabad

The instructor is gratefully acknowledging the authors who made their course materials freely available online.
Note to Self: (Start Recording ! )
IMPORTANT Note to Students

 It is important to know that just login to the session does not guarantee the
attendance.
 Once you join the session, continue till the end to consider you as present in the
class.
 IMPORTANTLY, you need to make the class more interactive by responding to
Professors queries in the session.
 Whenever Professor calls your number / name, you need to respond,
otherwise it will be considered as ABSENT

BITS Pilani, Pilani Campus


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

3.1 Data Descriptions

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recap

• What is DM and what is not DM

• Prediction methods and description methods with examples

• Data Preprocessing concepts

• Data preprocessing techniques

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recap
Equal Depth (Frequency) Binning: Bins have an equal frequency.

Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output: [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215]

Equal Width (Distance) Binning : bins have equal width with a range of each bin are defined as
[min + w], [min + 2w] …. [min + nw]
w = (max – min) / (no of bins)
No. of bins = 3

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
min = 5 w = (215-5)/3 = 70
Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215]

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recap : Exercise
Consider that following scholar data set made available to you for carrying out some
mining activity. List 4 potential issues with this dataset identified while preprocessing?

Sno Name Age DateOfJoining Branch DateOfBirth

1 Anu 34 15-Jan-2015 CSE Feb 24, 1997

2 Banu 33 27-Jan-2015 Mech Mar 27, 1998

3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997

4 Chiru 32 30-Jan-2015 ECE Nov 25,1998

5 Amit 34 15-Jan-2015 NULL Feb 24, 1998

6 Arish 38 15/Jan/2017 ECE 24 Dec 1999

7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997

8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999

9 Arush 28 15-Jan-2027 ECE 24 Dec 2019

BITS Pilani, Pilani Campus


Solution
Consider that following scholar data set made available to you for carrying out some
mining activity. List 4 potential issues with this dataset identified while preprocessing?

Sno Name Age DateOfJoining Branch DateOfBirth

1 Anu 34 15-Jan-2015 CSE Feb 24, 1997

2 Banu 33 27-Jan-2015 Mech Mar 27, 1998

3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997

4 Chiru 32 30-Jan-2015 ECE Nov 25,1998

5 Amit 34 15-Jan-2015 NULL Feb 24, 1998

6 Arish 38 15/Jan/2017 ECE 24 Dec 1999

7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997

8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999

9 Arush 28 15-Jan-2027 ECE 24 Dec 2019

BITS Pilani, Pilani Campus


Types of Data Sets
• Record
• Relational records

timeout

season
• Data matrix, e.g., numerical matrix, crosstabs

coach

game
score
team

ball

lost
pla

wi
n
y
• Document data: text documents: term-frequency vector
• Transaction data

• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2


• World Wide Web
Document 2 0 7 0 2 1 0 0 3 0 0
• Social or information networks
• Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
• Ordered
• Video data: sequence of images TID Items
• Temporal data: time-series 1 Bread, Coke, Milk
• Sequential Data: transaction sequences 2 Beer, Bread
• Genetic sequence data 3 Beer, Coke, Diaper, Milk
• Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
• Spatial data: maps 5 Coke, Diaper, Milk
• Image data: MRI scans
Data Mining
• Video data: Movies
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects
• columns -> attributes.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes and their types-By set of possible values.
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature
of a data object.
– E.g., customer _ID, name, address
Types of attributes
– Nominal :categories, states, or “names of things”
• Examples: ID numbers, eye color, zip codes, Hair_color = {auburn, black, blond, brown, grey,
red, white}
– Ordinal :Values have a meaningful order (ranking) but magnitude between successive values is
not known.
• Examples: rankings (e.g., taste of soft drink on a scale from 1-10), grades{A,A-,B,B-,C,C-
,D,E}
– Interval-scaled:
• No true zero-point
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio-scaled
• Inherent zero-point
• Examples: temperature in Kelvin, length, weight, elapsed time(e.g., time to run a race)
BITS Pilani, Pilani Campus
Attribute Types
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous Attributes :
by the Number of Values
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a
finite number of digits
• Continuous attributes are typically represented as floating-point variables

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise

Classify the following attributes as binary, discrete, or continuous.


Also classify them as qualitative (nominal or ordinal) or quantitative
(interval or ratio).
• Time in terms of AM or PM.
Binary, qualitative, ordinal
• Angles as measured in degrees between 0◦ and 360◦.
Continuous, quantitative, ratio
• Bronze, Silver, and Gold medals as awarded at the Olympics.
Discrete, qualitative, ordinal
• Colors of cars Discrete,
qualitative, nominal

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
Identify whether the attributes are categorical or numerical
1) Aircraft manufacturers
2) Enrolled in a university?
3) Number of students
4) Weight
5) Customer Satisfaction
6) Monthly Income (in Rs)
7) Passed Entrance Exam? (Yes/No)
8) GPA (Grade Point Average)

BITS Pilani, Pilani Campus


Solution
Identify whether the attributes are categorical or numerical
1) Aircraft manufacturers :Categorical(nominal)
2) Enrolled in a university? :Categorical(binary)
3) Number of students :Numerical(Discrete)
4) Weight :Numerical(Continuous)
5) Customer Satisfaction :Categorical(ordinal)
6) Monthly Income (in Rs) :Numerical (Continuous)
7) Passed Entrance Exam? (Yes/No): Categorical (Binary)
8) GPA (Grade Point Average) : Numerical (Continuous)

BITS Pilani, Pilani Campus


Basic Statistical Descriptions of Data

• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): n
1 n
x   xi w x i i

Note: n is sample size and N is population size. x


x
i 1
n i 1 n

w 
Weighted arithmetic mean:
i
• i 1
N
• Trimmed mean: chopping extreme values

• Median:
• Middle value if odd number of values, or average of the
middle two values otherwise
• Estimated by interpolation (for grouped data):
n / 2  ( freq)l Median
interval
median  L1  ( ) width
• Mode freqmedian
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean  mode  3  (mean  median)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric,
positively and negatively skewed data

positively skewed symmetric negatively skewed

Skewness > Skewness = Skewness <


0 0 Data Mining 0
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n n n
s2  
n  1 i 1
( xi  x ) 2
 [ xi  ( xi ) 2 ]
n  1 i 1 n i 1  21
 ( xi   ) 
2 1
x i
2
 2
N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis

• Five-number summary of a distribution


• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is
IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %
of data are  xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
plane

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise

Suppose that the data for analysis includes the attribute age. The
age values for the data tuples are (in increasing order) 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 87
a) Calculate mean, and mode
b) Calculate the 5-number summary

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Solution
Suppose that the data for analysis includes the attribute age. The
age values for the data tuples are (in increasing order) 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 87
a) Calculate mean, and mode
b) Calculate the 5-number summary
Mean = 32, Mode = 25, 35 (bimodal) Median is 25+30/2 = 27.5
Q1 = 20, Q3 = 35 Min = 13, Max = 87
5-number summary : (13,20,27.5,35,87)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

3.2 Data Similarity/Dissimilarity

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix

• Data matrix  x11 ... x1f ... x1p 


 
• n data points with p dimensions  ... ... ... ... ... 
x ... xif ... xip 
• Two modes  i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

• Dissimilarity matrix  0 
• n data points, but registers only the distance  d(2,1) 0 
 
• A triangular matrix  d(3,1) d ( 3,2) 0 
• Single mode  
 : : : 
d ( n,1) d ( n,2) ... ... 0

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j)  p 
p
m

• Method 2: Use a large number of binary attributes


• creating a new binary attribute for each of the M nominal states

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Binary Attributes
• A contingency table for binary data Object j

Object i
• Distance measure for symmetric
binary variables:

• Distance measure for asymmetric


binary variables:

• Jaccard coefficient (similarity measure


for asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• Gender is a symmetric attribute


• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data
• Z-score: 
z  x
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in units
of the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation

where
s f
 1 (| x  m |  | x  m | ... | x  m |)
n 1f f 2f f nf f

mf  1 n (x1 f  x2 f  ...  xnf )


.

• standardized measure (z-score): xif  m f


zif sf
• Using mean absolute deviation is more robust than using standard
deviation
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Normalisation-Exercise

Suppose that the minimum and maximum values for the attribute income are $12,000
and $98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to
value of $73,600.

Solution v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

BITS Pilani, Pilani Campus


Normalisation-Exercise
Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively. Apply z-score normalization
to the salary value of $73,600

v  A
v' 
 A

73,600  54,000
 1.225
16,000

BITS Pilani, Pilani Campus


Example: Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects,
and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

• h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
• h  . “supremum” (Lmax norm, L norm) distance.
• This is the maximum difference between any component
(attribute) of the vectors

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Minkowski Distance (Dissimilarity Matrices)

point attribute 1 attribute 2


Manhattan (L1) L x1 x2 x3 x4
x1 1 2
x1 0
x2 3 5 x2 5 0
x3 2 0 x3 3 6 0
x4 4 5 x4 6 1 7 0

Euclidean (L2) L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif {1,...,M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-
th variable by rif 1
zif 
M f 1

• compute the dissimilarity using methods for interval-scaled variables


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise

Calculate the dissimilarity matrix and similarity matrix for the ordinal attributes

BITS Pilani, Pilani Campus


Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )

• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and r 1
• Treat zif as interval-scaled
zif 
if

M 1 f
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Cosine Similarity

• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise

For the following vectors x and y, calculate the indicated similarity or distance
measures:
x =(1,1,1,1), y=(2,2,2,2) cosine, Euclidian
Solution: cos(x, y) = 1, Euclidean(x, y) = 2

BITS Pilani, Pilani Campus


Exercise

For the following vectors x and y, calculate the indicated similarity or distance
measures:
x =(0,1,0,1), y=(1,0,1,0) Cosine, Correlation ,Euclidian,Jacquard
Solution: cos(x, y) = 0,Corr(x,y)=-3 Euclidean(x, y) = 2,Jacquard(x,y)=0

BITS Pilani, Pilani Campus


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

3.3 Data Visualization

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically representing
empirical information.

• Why data visualization?


• Gain insight into an information space by mapping data onto graphical primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative analysis
• Provide a visual proof of computer representations derived

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualization

Another Defination: Visualization is the conversion of data into a visual


or tabular format so that the characteristics of the data and the
relationships among data items or attributes can be analyzed or
reported.

 Visualization of data is one of the most powerful and appealing


techniques for data exploration.
– Humans have a well developed ability to analyze large amounts of information
that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns

02/03/2018 Introduction to Data Mining 49


Visualization techniques
Based on
• Visualization of a small number of attributes,
• Stem and Leaf Plots
• Histograms
• Box Plots
• Pie charts
• Scatter Plots etc
• Visualization of data with spatial and/or temporal attributes,
• Contour Plots
• Surface Plots
• Visualization of data with many attributes
• Data Matrix
• Chernoff Faces
• Stick Figures

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Iris Sample Data Set

 Many of the exploratory data techniques are illustrated


with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
Setosa
 Virginica

 Versicolour

– Four (non-class) attributes


 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
02/03/2018 Introduction to Data Mining 51
Visualization Techniques: Histograms

 Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
 Example: Petal Width (10 and 20 bins, respectively)

02/03/2018 Introduction to Data Mining 52


Visualization Techniques: Box Plots
outlier
 Box Plots
– Invented by J.
90th percentile
Tukey
– Another way of
displaying the
distribution of 75th percentile
data
50th percentile
– Following figure 25th percentile
shows the basic
part of a box
plot
10th percentile

02/03/2018 Introduction to Data Mining 53


Example of Box Plots

 Box plots can be used to compare attributes

02/03/2018 Introduction to Data Mining 54


Scatter Plots

A scatter plot displays 2-D data points


using Cartesian coordinates.
A third dimension can be added using
different colors or shapes to represent
different data points
Through this visualization, in the
adjacent figure, we can see that points
of types "+" and "×" tend to be
colocated

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter Plot Array of Iris Attributes

02/03/2018 Introduction to Data Mining 56


Example : Stem and Leaf Plot
The set of integers shown below is the sepal length in centimeters
(multiplied by 10 to make the values integers) taken from the Iris data set.
The values have also been sorted.
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48 49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 52 52 52 52 53
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 57 57 57 57
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 60 61 61 61
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68 68 68 69 69 69 69 70
71 72 72 72 73 74 76 77 77 77 77 79

Data Mining
Source : Figure 3.4. Sepal length data from the Iris data set[ Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education ] BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stem and Leaf Plot
4 : 34444566667788888999999
5 : 0000000000111111111222234444445555555666666777777778888888999
6 : 000000111111222233333333344444445555566777777778889999
7 : 0122234677779

4 : 3444
4 : 566667788888999999
5 : 000000000011111111122223444444
5 : 5555555666666777777778888888999
6 : 00000011111122223333333334444444
6 : 5555566777777778889999
7 : 0122234
7 : 677779 Data Mining
Source : Figure 3.5., Stem and leaf plot for the sepal length from the Iris data set. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Pie Chart
A pie chart that shows the distribution of Iris species in the Iris data set. In this case,
all three flower types have the same frequency.

• Used less frequently in technical publications because the size of relative areas can be hard to
judge.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatterplot Matrices

The scatter-plot matrix is an


extension to the scatter plot.

For k-dimensional data a


minimum of (k2-k)/2 scatterplots
of 2D will be required.

There can be maximum of k2


plots of 2D

In the adjoining figure , there are


k2 plots. Out of these, k are X-X
plots, and all X-Y plots (where X,
Y are distinct dimensions) are
given in 2 orientations (X vs Y and
Y vs, X)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
The correlation matrix for the Iris data set.

The rows and columns are organized so that all the


flowers of a particular species are together

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Other Visualization Techniques

 Star Plots
– Similar approach to parallel coordinates, but axes
radiate from a central point
– The line connecting the values of an object is a
polygon
 Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a
characteristic of a face
– The values of each attribute determine the appearance
of the corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
02/03/2018 Introduction to Data Mining 62
Chernoff Faces

• A way to display variables on a two-


dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose
length, etc.

• The figure shows faces produced


using 10 characteristics--head
eccentricity, eye size, eye spacing, eye
eccentricity, pupil size, eyebrow slant,
nose size, mouth shape, mouth size,
and mouth opening): Each assigned
one of 10 possible values.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Stick Figure
A census data figure showing age,
income, gender, education
A 5-piece stick figure (1 body and
4 limbs w. different angle/length)
Age, income are indicated by
position of the figure.
Gender, education are indicated
by angle/length.
Visualization can show a texture
pattern
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise
You are analysing the behaviour of students across 10 variables like
attendance, assignment scores, participation, etc.
Which visualization technique would you choose: Bar chart, Chernoff
faces, or Pie chart, Scatter Plot? Justify your answer.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualizing Complex Data and Relations

Most visualization techniques were


mainly for numeric data. Recently,
more and more non-numeric data,
such as text and social networks,
have become available.
Many people on the Web tag various
objects such as pictures, blog entries,
and product reviews.
A tag cloud is a visualization of
statistics of user-generated tags.
Often, in a tag cloud, tags are listed
alphabetically or in a user-preferred
order.
The importance of a tag is indicated
by font size or color
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualization do’s and don’ts.
• ACCENT Principles The following are the ACCENT principles for effective graphical display put forth by D. A.
Burn (as adapted by Michael Friendly):
• Apprehension Ability to correctly perceive relations among variables. Does the graph maximize
apprehension of the relations among variables?
• Clarity Ability to visually distinguish all the elements of a graph. Are the most important elements or
relations visually most prominent?
• Consistency Ability to interpret a graph based on similarity to previous graphs. Are the elements, symbol
shapes, and colors consistent with their use in previous graphs?
• Efficiency Ability to portray a possibly complex relation in as simple a way as possible. Are the elements of
the graph economically used? Is the graph easy to interpret?
• Necessity The need for the graph, and the graphical elements. Is the graph a more useful way to represent
the data than alternatives (table, text)? Are all the graph elements necessary to convey the relations?
• Truthfulness Ability to determine the true value represented by any graphical element by its magnitude
relative to the implicit or explicit scale.
• Are the graph elements accurately positioned andData
scaled?
Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tufte’s Guidelines Edward R. Tufte has also enumerated the following
principles for graphical excellence:
 Graphical excellence is the well-designed presentation of interesting data—a
matter of substance, of statistics, and of design.
 Graphical excellence consists of complex ideas communicated with clarity,
precision, and efficiency.
 Graphical excellence is that which gives to the viewer the greatest number of
ideas in the shortest time with the least ink in the smallest space.
 Graphical excellence is nearly always multivariate.
 And graphical excellence requires telling the truth about the data

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References and Textbooks

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers

BITS Pilani, Pilani Campus


Data Mining
BITS Pilani M4: Classification
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Classification Concepts

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Classification: Definition

Given a collection of records (training set )


Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the
class label
x: attribute, predictor, independent variable, input
y: class, response, dependent variable, output

Task:
Learn a model that maps each attribute set x into one of the predefined class labels y

BITS Pilani, Pilani Campus


Classification
• Classification involves dividing up objects so that each is assigned to one of a number of mutually
exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
‒ customers who are likely to buy or not buy a particular product in a supermarket
‒ people who are at high, medium or low risk of acquiring a certain illness
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees
‒ people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12 months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months
‒ people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very unlikely).

August 30, 2025 Data Mining


4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data

• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values

August 30, 2025


Data Mining: Concepts and Techniques Data Mining
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised vs. Unsupervised Learning

• Supervised learning (classification)


• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data

August 30, 2025 Data Mining


6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
C l a s s i f i cat i o n — A Two - S te p P ro c e s s
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the
model
• Test set is independent of training set, otherwise over-fitting will occur
• If the accuracy is acceptable, use the model to classify data tuples whose class labels are not
known

August 30, 2025 Data Mining


7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Approach for Building
Classification Model

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
• Eager learning (the methods mentioned in next slide): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many local linear
functions to form its implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the entire instance space

August 30, 2025 Data Mining 9


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Instance-Based Methods
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

August 30, 2025 Data Mining 10


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
C l a s s i f i c a t i o n Te c h n i q u e s

• Decision Tree based Methods

• Rule-based Methods

• Neural Networks
- computational networks that simulate the decision process in neurons (networks of nerve cell)

• Naïve Bayes and Bayesian Belief Networks


- uses the probability theory to find the most likely of the possible classifications

• Support Vector Machines


- fits a boundary to a region of points that are all alike; uses the boundary to classify a new point

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Representation

• Flowchart-like tree structure.


• The topmost node in a tree is the root
node.
• Each internal node denotes a test on an
attribute.
• Each branch represents an outcome of
the test.
• Each leaf node holds a class label.

• Given a tuple for which the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that tuple.

12 / 67
Decision Tree –Vertebrate Data Set

13 / 67
D ECISION T REE E XAMPLE
L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M
Start from the root of tree. Test Data

Home Marital Annual Defaulted


Owner Status Income Borrower
Home
Owner No Married 80K ?
Yes No 10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

14
BITS Pilani, Pilani Campus
D ECISION T REE E XAMPLE
L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


15
BITS Pilani, Pilani Campus
D ECISION T REE- MORE THAN ONE TREE
Single, Divorced
MarSt
Married
Home Marital Annual Defaulted
ID
Owner Status Income Borrower NO Home
1 Yes Single 125K No Yes Owner No

2 No Married 100K No NO Income


3 No Single 70K No < 80K > 80K
4 Yes Married 120K No
NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree that
9 No Married 75K No fits the same data!
10 No Single 90K Yes 16
10

BITS Pilani, Pilani Campus


How to Build a Decision Tree

Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT

17 / 67
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Let Dt be the set of training records that are 1 Yes Single 125K No

associated with a node t 2


3
No
No
Married
Single
100K
70K
No
No

General Procedure: 4 Yes Married 120K No


5 No Divorced 95K Yes
If Dt contains records that belong the same 6 No Married 60K No

class yt, 7 Yes Divorced 220K No

• then t is a leaf node labeled as yt


8 No Single 85K Yes
9 No Married 75K No

If Dt contains records that belong to more 10


10 No Single 90K Yes

than one class,


• use an attribute test to split the data
into smaller subsets.
Recursively apply the procedure to each
subset.

18 / 67
D E C I S I O N T R E E C O N S T R U C T I O N - H U N T ’ S A LG
Home
Owner
Yes No Home Marital Annual Defaulted
ID
Defaulted = No Owner Status Income Borrower
Defaulted = No Defaulted = No 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(a) (b)
4 Yes Married 120K No
5 No Divorced 95K Yes
Home 6 No Married 60K No
Owner
Yes No 7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes
Defaulted = No Marital
Yes No 9 No Married 75K No
Status
Single,
Marital Married 10 No Single 90K Yes
Defaulted = No Divorced 10

Status
Single, Annual Defaulted = No
Married
Divorced Income
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(c) (d)
BITS Pilani, Pilani Campus
D ESIGN I SSUES OF D ECISION T REE I NDUCTION

• Splitting Criterion
Method for specifying test condition
Depends on attribute types-Binary, Nominal, Ordinal, Continuous
Measure for evaluating the goodness of a test condition

• Stopping Criterion
Stop splitting if all the records belong to the same class or have identical
attribute values
Early termination

20 / 67
T EST C ONDITION FOR N OMINAL ATTRIBUTES

Marital
Status
• Multi-way split:
• Use as many partitions as distinct values.
• Binary split: Single Divorced Married
• Divides values into two subsets
• Preserve order property among attribute
values Marital Marital Marital
Status Status Status
OR OR
Shirt
Size This grouping
{Married} {Single, {Single} {Married, {Single, {Divorced}
violates order Divorced} Divorced} Married}

property
{Small, {Medium,
Large} Extra Large}

21 / 67
T EST C ONDITION FOR C ONTINUOUS ATTRIBUTES

Binary Decision :
Discretization :
• consider all possible splits and
• to form an ordinal categorical attribute
finds the best cut
• can be more compute intensive
22 / 67
H OW TO DETERMINE THE B EST S PLIT
Before Splitting:
10 records of class 0
10 records of class 1

Which test condition is the best?

23 / 67
H OW TO DETERMINE THE B EST S PLIT
Impurity is a measure of homogeneity of the labels at the node at hand
Greedy approach:
Nodes with purer class distribution are preferred
Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity/ Relating it to Low degree of impurity /


information theory: Relating it to information theory:
Low information content High information content
e.g It rained heavily in Shillong yesterday e.g There was a heavy rainfall in Rajasthan last night.

24 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
I MPURITY M EASURE FOR A S INGLE N ODE
𝑐−1
2
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 Where 𝒑𝒊 𝒕 is the frequency of
𝑖=0 class 𝒊 at node t, and 𝒄 is the total
𝑐−1
number of classes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)
𝑖=0

Measures seek to determine which variable would split the data to lead to the
underlying child nodes being most homogenous or pure.

In other words splitting on attribute should reduce impurity

25 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
I MPURITY M EASURE FOR A S INGLE N ODE
p(C1) = 0/6 = 0 and p(C2) = 6/6 = 1 Low entropy and
GINI index
Gini = 1 − (0)2 − (1)2 = 0 preferred
Entropy = −(0) log2(0) − (1) log2(1) = 0
:– low impurity

p(C1) = 1/6 and p(C2) = 5/6


Gini = 1 − (1/6)2 − (5/6)2 = 0.278
Entropy = −(1/6) log2(1/6) − (5/6) log2(5/6) = 0.650

p(C1) = 3/6 p(C2) = 3/6


Gini = 1 − (3/6)2 − (3/6)2 = 0.5
Entropy = −(3/6) log2(3/6) − (3/6) log2(3/6) = 1

26 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
C OLLECTIVE I MPURITY OF C HILD N ODES
When a node 𝑝 is split into 𝑘 partitions (children)

𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1

𝑘
𝑛𝑖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1
where,
𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝

27 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
F INDING THE B EST S PLIT

• Compute impurity measure of the parent before splitting (P)


• Compute the collective impurity measure of children after splitting(M)
• Choose the attribute test condition that produces the highest gain
Gain = P – M
• or equivalently, lowest impurity measure after splitting (M)
• Information gain is the expected reduction in entropy caused by partitioning
the examples on an attribute.
• Maximizing the gain == minimizing the weighted average impurity measure of
children nodes
𝑘
Choose the split that achieves most
𝑛𝑖 reduction (maximizes GAIN)
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) Used in the ID3 and C4.5 decision
𝑛
𝑖=1 tree algorithms
28 / 67
D ECISION T REE - ENTROPY
Construct a decision tree for the given dataset using Information gain.
Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No
29 / 48
D ECISION T REE - ENTROPY

30 / 48
D ECISION T REE - ENTROPY

31 / 48
D ECISION T REE - ENTROPY

32 / 48
D ECISION T REE - ENTROPY
Step 3:
Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Outlook has the max gain. So choose Outlook as the root decision point.
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further splitting.

33 / 48
D ECISION T REE - ENTROPY
Step 4a: Consider only rows which have Outlook =
Step 4: Repeat
Sunny

Windy has max the max gain. Choose Windy as the next
decision point.

34 / 48
D ECISION T REE - ENTROPY
Step 4b: Consider only rows which have Outlook = Step 5: Draw the final decision tree.
Rainy.

Humidity has the max gain. Choose Humidity


as the next decision point.

35 / 48
Splitting Criteria based on Classification
Error
• Classification error at a node t :

C1 0
Error (t )  1  max P(i | t )
i C2 6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally C1 1 P(C1) = 1/6 P(C2) = 5/6

distributed among all classes, implying least C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
interesting information
• Minimum (0.0) when all records belong to one C1 2 P(C1) = 2/6 P(C2) = 4/6
class, implying most interesting information C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
Comparison among Splitting Criteria
For a 2-class problem:

37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
Dealing with continuous-valued attributes
• Given a continuous-valued attribute A,
dynamically define new discrete valued
attribute Ac that partition the continuous
attribute value into a discrete set of intervals
• Ac = True if A < c, False otherwise
• How to select the best value for the
threshold c?

38 / 67
Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
• Example. Temperature in the example
Sort the examples according to Temperature
Temperature 60 65 67 68 70 71 72 72 75 80 81 83 85 85
PlayTennis Y N Y Y Y N Y Y N N Y Y N Y

Determine candidate thresholds by averaging


consecutive values where there is a change in
classification:

39 / 67
Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
60 65 67 68 70 71 72 72 75 80 81 83 85 85
The candidate thresholds identified are:
• 62.5 ,66,70.5,71.5,73.5,80.5,84 Y N Y Y Y N Y Y N N Y Y N Y

Evaluate Candidate Thresholds


• Threshold = 62.5
• Group 1 (Temperature ≤ 62.5): [60, Yes]
• Group 2 (Temperature > 62.5): [65, No; 67, Yes; 68, Yes; 70, Yes; 71, No; 72, Yes; 75,
No; 80, No; 81, Yes; 83, Yes; 85, No]
• Calculate Entropy for Group 1 and Group 2 and their weighted Entropy, then find the
information gain.
Calculate Information gain
We will calculate the Information gain for each of these candidate thresholds .
Pick the one with the maximum gain.

40 / 67
Dealing with continuous-valued attributes

41 / 67
S t o p p i n g C r i t e r i a f o r Tr e e I n d u c t i o n
• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination

42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/30/2025
D e c i s i o n Tr e e B a s e d C l a s s i f i c a t i o n

• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple data sets

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification

Underfitting vs. Overfitting


• Underfitting results in decision trees that are too simple to solve the problem. They may offer superior interpretability.
• Overfitting results in decision trees that are more complex than necessary
• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
• Need new ways for estimating errors

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree:Overfitting

Training errors (apparent errors)--Errors committed on the training set

Test errors -- Errors committed on the test set

Generalization errors -- Expected error of a model over random selection of


records from same distribution

45 / 67
Decision Tree:Overfitting

Two class problem:


+ : 5200 instances
• 5000 instances generated from
a Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution

46 / 67
Decision Tree :Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Decision Tree :Handling Overfitting
Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree
• Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if expanding the current node does not improve impurity measures (e.g., Gini
or information gain).
• Stop if estimated generalization error falls below certain threshold
Decision Tree :Handling Overfitting

Post Pruning :Idea


• Post construction, scan the tree bottom-up
• At every decision node
• Retain the attribute node & evaluate it against the prune set (validation set)
• Remove the attribute node & reevaluate it with the same prune set
• If there is a reduction in error, prune the node else retain the node
• Repeat this in other branches of the tree
C l a s s i f i c a t i o n Te c h n i q u e : R u l e - B a s e d
Classifier
• Classify records by using a collection of “if…then…” rules

• Rule: (Condition)  y where


• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or condition
• RHS: rule consequent
• Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of the instance satisfy the
condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class
• Coverage of a rule: 1 Yes Single 125K No
• Fraction of records that satisfy the 2 No Married 100K No
antecedent of a rule 3 No Single 70K No

• Accuracy of a rule: 4 Yes Married 120K No

• Fraction of records that satisfy both the 5 No Divorced 95K Yes

antecedent and consequent of a rule 6 No Married 60K No


7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

(Status=Single)  No
Coverage = 40%, Accuracy = 50%
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
H o w d o e s R u l e - b a s e d C l a s s i f i e r Wo r k ?
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Rule -Based Classifier

• Mutually exclusive rules


• Classifier contains mutually exclusive rules if the rules are
independent of each other
• Every record is covered by at most one rule

• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
F r o m D e c i s i o n Tr e e s To R u l e s

Classification Rules
Refund (Refund=Yes) ==> No
Yes No
(Refund=No, Marital Status={Single,Divorced},
NO Taxable Income<80K) ==> No
Marita l
{Single, Status
(Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income>80K) ==> Yes

Taxable NO (Refund=No, Marital Status={Married}) ==> No


Income
< 80K > 80K

NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rules Can Be Simplified
Tid Refund Marital Taxable
Refund Status Income Cheat
Yes No
1 Yes Single 125K No
NO Marita l 2 No Married 100K No
{Single, Status
{Married} 3 No Single 70K No
Divorced}
4 Yes Married 120K No
Taxable NO 5 No Divorced 95K Yes
Income
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No)  (Status=Married)  No


Simplified Rule: (Status=Married)  No
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Effe ct o f Ru le S im p lif icatio n

• Rules are no longer mutually exclusive


• A record may trigger more than one rule
• Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

• Rules are no longer exhaustive


• A record may not trigger any rules
• Solution?
• Use a default class

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordered Rule Set

• Rules are rank ordered according to their priority


• An ordered rule set is known as a decision list
• When a test record is presented to the classifier
• It is assigned to the class label of the highest ranked rule it has triggered
• If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes

• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together
Rule-based Ordering Class-based Ordering
(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Building Classification Rules

• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R

• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayes Classifier

• A probabilistic framework for solving classification problems


• Conditional Probability:
P ( A, C )
P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

• Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Bayes Theorem

• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability he/she has


meningitis?

P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayesian Classifiers

• Consider each attribute and class label as random variables

• Given a record with attributes (A1, A2,…,An)


• Goal is to predict class C
• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayesian Classifiers

• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all values of C using the
Bayes theorem
P( A A  A | C ) P(C )
P(C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

• Choose value of C that maximizes


P(C | A1, A2, …, An)

• Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier

• Assume independence among attributes Ai when class is given:


• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

• Can estimate P(Ai| Cj) for all Ai and Cj.

• New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from Data?

Tid Refund Marital Taxable Evade • Class: P(C) = Nc/N


Status Income • e.g., P(No) = 7/10,
1 Yes Single 125K No P(Yes) = 3/10
2 No Married 100K No
3 No Single 70K No • For discrete attributes:
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nck
5 No Divorced 95K Yes • where |Aik| is number of instances
6 No Married 60K No having attribute Ai and belongs to
7 Yes Divorced 220K No class Ck
8 No Single 85K Yes
• Examples:
9 No Married 75K No P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10 No Single 90K Yes

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from
Data?
• For continuous attributes:
• Discretize the range into bins
• one ordinal attribute per bin
• Too many bins – training records are too few to provide reliable probability for each interval
• Too few bins – some intervals may aggregate from different classes & we may miss the correct decision
boundary
• Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ck)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Estimate Probabilities from Data?

Tid Refund Marital Taxable Evade • Normal distribution:


Status Income 
( Ai   ij ) 2
1 2 ij2
P( Ai | c j )  e
1 Yes Single 125K No 2 ij2
2 No Married 100K No
3 No Single 70K No • One for each (Ai,ci) pair
4 Yes Married 120K No
5 No Divorced 95K Yes
• For (Income, Class=No):
6 No Married 60K No • If Class=No
7 Yes Divorced 220K No • sample mean = 110
• sample variance = 3179.16
8 No Single 85K Yes
9 No Married 75K No 
(120 110) 2
1
P( Income  120 | No)  e 2 (3179)
 0.0069
10 No Single 90K Yes 2 (56.38)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Naïve Bayes Classifier
Given a Test Record:
Tid Refund Marital Taxable Evade X  (Refund  No, Married, Income  120K)
Status Income
1 Yes Single 125K No  P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
2 No Married 100K No  P(Income=120K| Class=No)
3 No Single 70K No = 4/7  4/7  0.0069 = 0.0023

4 Yes Married 120K No  P(X|Class=Yes) = P(Refund=No| Class=Yes)


5 No Divorced 95K Yes  P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
6 No Married 60K No = 1  0  1.2  10-9 = 0
7 Yes Divorced 220K No
Since P(X|No)P(No) > P(X|Yes)P(Yes)
8 No Single 85K Yes
Therefore P(No|X) > P(Yes|X)
9 No Married 75K No
=> Class = No
10 No Single 90K Yes

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier

• If one of the conditional probability is zero, then the entire


expression becomes zero
• Probability estimation is done with Laplacian correction:

N ic
Original : P( Ai | C ) 
Nc c: number of classes
N ic  1
Laplace : P( Ai | C ) 
Nc  c

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals
salmon no no yes no non-mammals M: mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals N: non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P( A | M )      0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P( A | M ) P( M )  0.06   0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P( A | N ) P( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N)
yes no yes no ?
=> Mammals
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes (Summary)

• Robust to isolated noise points

• Handle missing values by ignoring the instance during probability


estimate calculations

• Robust to irrelevant attributes

• Independence assumption may not hold for some attributes


• Use other techniques such as Bayesian Belief Networks (BBN)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)

• Artificial Neural Network (ANN) is a computational model inspired by


the way the human brain processes information.
• It is made up of interconnected nodes (neurons) organized in layers:
• Input Layer: Takes in raw data.
• Hidden Layers: Perform feature extraction.
• Output Layer: Produces the final prediction or result.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)

X1 X2 X3 Y Input Black box


1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3
0 0 0 0

Output Y is 1 if at least two of the three inputs are equal to 1.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
H o w D o e s A N N Wo r k ?

• Each neuron receives inputs, multiplies them with weights, adds


a bias, and passes through an activation function.
• The network learns by adjusting weights using algorithms
like backpropagation and gradient descent.

• Basic Structure of ANN


Input Layer → Hidden Layer(s) → Output Layer
(x1, x2, ...) (h1, h2, ...) (y)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)

Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3 
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0

Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)


1 if z is true
where I ( z )  
0 otherwise
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN)
Input
• Model is an assembly of inter-connected nodes
nodes and weighted links Black box
Output
• Output node sums up each of its input X1 w1 node
value according to the weights of its links
w2
• Compare output node against some X2  Y
threshold t w3
X3 t

Perceptron Model
Y  I (  wi X i  t ) or
i
Y  sign (  wi X i  t )
i
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Structure of ANN
x1 x2 x3 x4 x5

Input Input Neuron i Output


Layer
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
g(Si )
Hidden I3
Layer
threshold, t

Output Training ANN means learning the


Layer weights of the neurons

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s

• Find a linear hyperplane (decision boundary) that will separate the data
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s

B1

• One Possible Solution


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s

B2

• Another possible solution


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s

B2

• Other possible solutions


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s
B1

B2

• Which one is better? B1 or B2?


• How do you define better?
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s
B1

B2

b21
b22

margin
b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2


Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s
B1

 
w x  b  0
 
 
w  x  b  1 w  x  b  1

b11

 
  1 if w x  b 1 b12 2
Margin   2
f ( x)     || w ||
  1 if w  x  b  1
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s

• What if decision boundary is not linear?

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l in ea r S u p p o rt Ve c tor M a c h ine s

• Transform data into higher dimensional space

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating


predictions made by multiple classifiers

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why does it work?

• Suppose there are 25 base classifiers


• Each classifier has error rate,  = 0.35
• Assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction:
  i
25 25

  
 i 
i 13 
 (1   ) 25i
 0.06

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

5 Prediction

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recap
CS1 CS4
Why Data Mining?
Introduction to Classification
What Is Data Mining? Introduction to Classification
Applications of Data Mining Decision Tree
Example: Classification, Clustering, Association mining Rule Based Classifier
Bayes Classifier, ANN, SVM, Ensemble Methods
CS2
Data Preprocessing Concepts
Data Cleaning, Integration, reduction, transformation and discretization
Data preprocessing techniques

CS3
Data Description
Data Similarity/Dissimilarity
Data Visualization

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Agenda

• Prediction Vs. Classification


• Regression for Prediction
• Simple Linear Regression
• Parameter Estimation, Examples
• Non Linear Regression
• Multiple Linear Regression

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction

Prediction vs. Classification


• How is (Numerical) prediction similar to classification?
• construct a model
• use model to predict continuous or ordered value for a given input
• Difference between Prediction and classification
• Classification refers to predict categorical class label
• Prediction models continuous-valued functions
• Major method for prediction: Regression
• model the relationship between one or more independent or predictor variables and a
dependent or response variable
• Profit, sales, mortgage rates, house values, square footage, temperature, or distance could all
be predicted using regression techniques. For example, a regression model could be used to
predict the value of a house based on location, number of rooms, lot size, and other factors.

August 30, 2025 Data Mining


6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications

Prediction vs. Classification

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression for Prediction
• A regression task begins with a data set in which the target values are
known, e.g., a regression model that predicts house values could be
developed based on observed data for many houses over a period of time.
The data might track the age of the house, square footage, number of
rooms, taxes, school district, proximity to shopping centers, and so on.
House value would be the target, the other attributes would be the
predictors, and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data. These relationships between predictors and target are summarized in
a model, which can then be applied to a different data set in which the
target values are unknown

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression in general

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear regression in Economics

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prediction Techniques

• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Generalized linear model
• Poisson regression
• Log-linear models
• Regression trees

August 30, 2025 Data Mining


11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Linear

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression analysis

• Regression analysis seeks to determine the values of parameters for


a function that cause the function to best fit a set of data
observations that you provide.

• Regression is the process of estimating the value of a continuous


target (y) as a function (F) of one or more predictors (x1 , x2 , ..., xn),
a set of parameters (w1 , w2 , ..., wn), and a measure of error (e).
y = F(x,w) + e

• The predictors can be understood as independent variables (x1 , x2 ,


..., xn) and the target as a dependent variable (Y).
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression analysis

• The error, also called the residual, is the difference between


the expected and predicted value of the dependent variable.

• The process of training a regression model involves finding the


parameter values that minimize a measure of the error, the
mean of squared errors (MSE)or mean of absolute errors
(MAE).

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Which line “fits best”

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression Model

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Least Squares

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Estimating Model Parameter

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficients

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficient Equations

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -1

Parameter Estimation

(10, 0.5) and (25, 1.20) are actual purchase amount and tax
amount pairs. Predict the amount of the tax on the purchase
of amount 55 using linear regression.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Solution
Example -1

Parameter Estimation

(10, 0.5) and (25, 1.20) are actual purchase


amount and tax amount pairs. Predict the
amount of the tax on the purchase of amount
55 using linear regression.

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2

Parameter Estimation
Hours Spent Exam
Studying Scores
4 60
1 10
10 64
14 75
4 50
7 70
22 95
3 27
8 49

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2

Parameter Estimation
X - Hours Spent Studying
Y – Exam Score

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2

Prediction Error

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2

Parameter Estimation

Slope (𝜷𝟏)
• Exam score (Y) is expected to increases by 3.3 units for each 1 hour increase of study hours (X)

Intercept (𝜷𝟎)
• Exam score (Y) is 28.6 when you study 0 hours

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -3

Parameter Estimation – Self Practice


x 2 4 6 8
y 3 7 5 10

Construct the table and find the value

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -3

Parameter Estimation – Self Practice (Solution)

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Often the relationship between x and y
cannot be approximated with a straight
line. In this case, a nonlinear regression
technique may be used. Alternatively,
the data could be preprocessed to
make the relationship linear.

• Nonlinear regression models define y


as a function of x using an equation
that is more complicated than the
linear regression equation

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression model.
For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed to linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive calculation on
more complex formulae

August 30, 2025 Data Mining


33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees and Model Trees
• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that reach the leaf

• Model tree: proposed by Quinlan (1992)


• Each leaf holds a regression model—a multivariate linear equation for the predicted attribute
• A more general case than regression tree

• Regression and model trees tend to be more accurate than linear regression when the data are
not represented well by a simple linear model

August 30, 2025 Data Mining


34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression - Example

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Regression Models

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
THANK YOU!

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining

BITS Pilani
Pilani|Dubai|Goa|Hyderabad M6: Association Analysis and Apriori Algorithm
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Association Analysis Concepts


and Apriori Algorithm

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Association Analysis

• Association analysis measures the strength of co-


occurrence between one item and another.
• The objective of this class of data mining algorithms is not
to predict an occurrence of an item, like classification or
regression do, but to find usable patterns in the co-
occurrences of the items.
• Association rules learning is a branch of an unsupervised
learning process that discovers hidden patterns in data, in
the form of easily recognizable rules.

Data Mining
August 30, 2025
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis Contd.

• Association algorithms are widely used in retail analysis


of transactions, recommendation engines, and online
clickstream analysis across web pages.
• One of the popular applications of this technique is called
market basket analysis, which finds co-occurrences of one
retail item with another item within the same retail
purchase transaction.
• retailer can take advantage of this association for bundle
pricing, product placement, and even shelf space
optimization within the store layout.

Data Mining
August 30, 2025
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Market Basket Analysis

Data Mining
August 30, 2025
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining

• Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Butter},
1 Bread, Milk {Milk, Bread}  {Eggs, Coke},
2 Bread, Diaper, Beer, Eggs {Butter, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Definition: Frequent Itemset
• Itemset
• A collection of one or more items
TID Items
• Example: {Milk, Bread, Diaper}
1 Bread, Milk
• k-itemset
• An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
• Support count ()
4 Bread, Milk, Diaper, Beer
• Frequency of occurrence of an itemset
5 Bread, Milk, Diaper, Coke
• E.g. ({Milk, Bread, Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup
threshold
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X  Y, where X and Y are item 3 Milk, Diaper, Beer, Coke
sets 4 Bread, Milk, Diaper, Beer
– Example: 5 Bread, Milk, Diaper, Coke
{Milk, Diaper}  {Butter}
 Rule Evaluation Metrics
Example:
– Support (s)
{Milk, Diaper}  Butter
 Fraction of transactions that
contain both X and Y  (Milk, Diaper, Butter) 2
s   0.4
|T| 5
– Confidence (c)
 Measures how often items in Y  (Milk, Diaper, Butter) 2
c   0.67
appear in transactions that  (Milk, Diaper) 3
contain X
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Mining Task

Given a set of transactions T, the goal of association rule


mining is to find all rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold

Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that do not satisfy the minsup and minconf
thresholds
 Computationally prohibitive!

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Milk,Diaper}  {Butter} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Butter}  {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Beer {Diaper,Butter}  {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Butter}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Butter} (s=0.4, c=0.5)
Observations: {Milk}  {Diaper,Butter} (s=0.4, c=0.5)

• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Butter}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining Association Rules

• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

• Frequent itemset generation is still computationally


expensive

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Frequent Itemset Generation
• Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database
• Match each transaction against every candidate
• Complexity ~ O(NMw) => Expensive since M = 2d !!!

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Frequent Itemset Generation Strategies

Reduce the number of candidates (M)


• Complete search: M=2d
• Use pruning techniques to reduce M

Reduce the number of transactions (N)


• Reduce size of N as the size of itemset increases
• Used by DHP ( Direct Hashing and Pruning)and vertical-based
mining algorithms

Reduce the number of comparisons (NM)


• Use efficient data structures to store the candidates or
transactions
• No need to match every candidate against every transaction

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Reducing Number of Candidates

Apriori principle:
• If an itemset is frequent, then all of its subsets must also be
frequent

Apriori principle holds due to the following property of the


support measure:
• Support of an itemset never exceeds the support of its
subsets
• This is known as the anti-monotone property of support
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Illustrating Apriori Principle null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, No 3-itemsets are frequent
6C + 6C + 6C = 41
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Apriori Algorithm

Method:

1. Let k=1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent itemsets are identified
i. Generate length (k+1) candidate itemsets from length k
frequent itemsets
ii. Prune candidate itemsets containing subsets of length k that
are infrequent
iii. Count the support of each candidate by scanning the DB
iv. Eliminate candidates that are infrequent, leaving only those
that are frequent

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Reducing Number of Comparisons
Candidate counting:
• Scan the database of transactions to determine the support of each candidate
itemset
• To reduce the number of comparisons, store the candidates in a hash
structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Support Counting

 Given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3.
 It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Support Counting Using a Hash Tree

In the Apriori, algorithm, candidate itemsets are partitioned


into different buckets and stored in a hash tree.

During support counting, itemsets contained in each


transaction are also hashed into their appropriate buckets.

That way, instead of comparing each itemset in the transaction


with every candidate itemset, it is matched only against
candidate itemsets that belong to the same bucket,

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Support Counting Using a Hash Tree

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hashing a transaction at the root node of a hash
tree

In this example, 5
out of the 9 leaf nodes are
visited and 9 out of the 15
itemsets are compared
against the transaction.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Generate Hash Tree

Suppose you have 15 candidate itemsets of length 3:


{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5
6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function . Let h(p)= (p-1) mod 3
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 367
145 345 356
2,5,8 136 368
357
124 689
457 125 159
458
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Hash tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Hash tree Contd.

Hash Function
Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
Hash on 345 356 367
2, 5 or 8 357 368
124 159 689
125
457 458
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Hash tree Contd.

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation

Given a transaction t, what are Transaction, t


the possible subsets of size 3?
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets
Data Mining of 3 items

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Factors Affecting Complexity

Choice of minimum support threshold


• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
• since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of
hash tree (number of subsets in a transaction increases with its width)

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation

Procedure:

• For each frequent itemset “ l ”, generate all nonempty


subsets of l.

• For every nonempty subset s of l, output the rule

“s ->(l-s)” if

supportcount(l) / supportcount(s) >= minconf where


minconf is minimum confidence threshold.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation Contd.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
For the itemset { Bread, Milk} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Some Association Rules are:
i) Bread  Milk, conf= Supcount({Bread,

Milk})/Supcount{Bread}=(3/ 4)*100= 75%

ii) Milk  Bread, conf=Supcount({Bread,


Milk})/Supcount{Milk}= (3/ 4)*100= 75%

If the minconf threshold is 70%, then both of these two rules


are strong association rules.
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation Contd.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Examples
On
Apriori Algorithm

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-1

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-2

Given Support Threshold (or min_supp) = 30% and Confidence


Threshold ( min_conf)= 70%, Drtermine the frequent itemsets and the
strong association rules using Apriori Algorithm.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-2 Contd..
Step 1:
i) Create 1-Itemset candidates and calculate support for all the items.
ii) Perform pruning to create L1 Frequent Itemset. In pruning, we will filter
out all items with Support less than the min_supp value(30%=3).

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Step 2:
i) Create 2-Itemset candidates from L1 Frequent
Itemset and calculate support for all of them.
ii) Perform Pruning to create L2 Frequent Itemset. As
before, we will again filter out all the Itemsets with
Support less than min_supp value(30%=3).

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Step 3:
i) Create 3-Itemset candidates from L2 Frequent Itemset and calculate
support for all of them.
ii) Perform pruning to create L3 Frequent Itemset. But here if you see,
Support is less than min_supp value(30) for all the 3-itemsets so we cannot
go any forward and need to find Association Rules from L2 Frequent
Itemset only.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-2 Contd..

We need to calculate the Confidence for all combinations of items in the L2


Frequent Itemset.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-2 Contd..

Perform pruning for again, this time to filter out all those association rules
that have Confidence(%) less than the min_conf(70%).

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example-3

Find the frequent


item sets on this.
Assume that
minimum support
(s = 3)

Frequent Itemset (I) = {Coke, Chips}


Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
M7: Association Analysis and Apriori Algorithm – Part 2

Dr. Rama Satish K V


Guest Faculty, WILP- BITS,
BITS Pilani Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

CS-07 : FP-Tree Techniques and Association


Rule Formation

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Pilani Campus
Important Note to Students

 It is important to know that just login to the session does not


guarantee the attendance.

 Once you join the session, continue till the end to consider you as
present in the class.

 IMPORTANTLY, you need to make the class more interactive by


responding to Professors queries in the session.

 Answering to Polls / Quiz questions during the class are mandatory to


get attendance

 Attend the Class Quiz at the end of the session

 Whenever Professor calls your number / name, you need to


respond, otherwise it will be considered as ABSENT

BITS Pilani, Pilani Campus


Session Agenda
• Recap of CS06
• Mining Association Rules
• Bottleneck of Frequent-pattern Mining
• Mining Frequent Patterns Without Candidate Generation
• Frequent Patterns(FP)-growth Algorithm
• From Conditional Pattern-bases to Conditional FP-trees
• Association Rule Mining
• Rule Generation for Apriori Algorithm
• Effect of Support Distribution
• Multiple Minimum Support, Pattern Evaluation

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

• Frequent item set generation is computationally


expensive

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bottleneck of Frequent-pattern Mining

• Multiple database scans are costly


• Mining long patterns needs many passes of scanning
and generates lots of candidates
• To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1
= 1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?

Data Mining
August 30, 2025
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns
Without Candidate Generation
• Grow long patterns from short ones using local
frequent items
• “abc” is a frequent pattern
• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB|abc  abcd is a
frequent pattern

Data Mining
August 30, 2025
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Patterns(FP)-growth Algorithm
• Use a compressed representation of the database using
an FP-tree
• Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets
1.Scan DB once, find frequent 1-itemset (single item
pattern)
2.Sort frequent items in frequency descending order,
f-list
3.Scan DB again, construct FP-tree

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Construct FP-tree from a Transaction Database

min_support = 3 F-list=f-c-a-b-m-p

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

f=4 a=3 c=4 d=1


e=1 g=1 h=1 i=1
j=1 k=1 l=1 m=3
n=1 o=2 p=3 s=1
August 30, 2025 Data Mining 10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Construct FP-tree from a Transaction Database
{}
Header Table
min_support = 3
Item frequency head f:4 c:1
F-list=f-c-a-b-m-p f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3
m:2 b:1

TID Items bought (ordered) frequent items p:2 m:1


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Data Mining
August 30, 2025
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Find Patterns Having P From P-conditional Database

• Starting at the frequent item header table in the FP-tree


• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Conditional pattern bases
Item frequency head f:4 c:1
item cond. pattern base
f 4
c 4 c:3 b:1 b:1 c f:3
a 3 a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
12
Data Mining

August 30, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure

• Completeness
• Preserve complete information for frequent pattern
mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared
• Never be larger than the original database (not
count node-links and the count field)

Data Mining 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Partition Patterns and Databases
• Frequent patterns can be partitioned into subsets
according to f-list
• F-list=f-c-a-b-m-p
• Patterns containing p
• Patterns having m but no p
•…
• Patterns having c but no a nor b, m, p
• Pattern f
• Completeness and non-redundency

Data Mining
August 30, 2025
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Conditional Pattern-bases to Conditional FP-trees

• For each pattern-base


• Accumulate the count for each item in the base
• Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3  fm, cm, am,
f:3
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
15
Data Mining
August 30, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recursion: Mining Each Conditional FP-tree
{}

Cond. pattern base of “am”: (fc:3) f:3


{}
c:3
f:3
am-conditional FP-tree
c:3 {}

a:3 Cond. pattern base of “cm”: (f:3)


f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
Data Mining
August 30, 2025
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Special Case: Single Prefix Path in FP-tree
• Suppose a (conditional) FP-tree T has a shared single
prefix-path P
• Mining can be decomposed into two parts
{} • Reduction of the single prefix path into one node
• Concatenation of the mining results of the two parts
a1:n1
a2:n2

a3:n3 {} r1

a1:n1
b1:m1 C1:k1  r1 =
a2:n2
+ b1:m1 C1:k1

C2:k2 C3:k3 a3:n3 C2:k2 C3:k3


17
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining Frequent Patterns With FP-trees

• Idea: Frequent pattern growth


• Recursively grow frequent patterns by pattern and
database partition
• Method
• For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
• Repeat the process on each newly created conditional
FP-tree
• Until the resulting FP-tree is empty, or it contains
only one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern

Data Mining
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Is FP-Growth the Winner?

• Divide-and-conquer:
• decompose both the mining task and DB according
to the frequent patterns obtained so far
• leads to focused search of smaller databases
• Other factors
• no candidate generation, no candidate test
• compressed database: FP-tree structure
• no repeated scan of entire database
• basic ops—counting local freq items and building
sub FP-tree, no pattern search and matching

Data Mining
August 30, 2025 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining

• Given a set of transactions, find rules that will predict


the occurrence of an item based on the occurrences of
other items in the transaction

Market-Basket transactions Example of Association Rules

TID Items {Diaper}  {Butter},


1 Bread, Milk {Milk, Bread}  {Eggs, Coke},
2 Bread, Diaper, Butter, Eggs {Butter, Bread}  {Milk},
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form 2 Bread, Diaper, Butter, Eggs
X  Y, where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
– Example: 4 Bread, Milk, Diaper, Butter
{Milk, Diaper}  {Bread} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


– Support (s) Example:
BREAD
 Fraction of transactions that contain {Milk, Diaper}  Butter
both X and Y
BREAD
– Confidence (c)  (Milk, Diaper, Butter) 2
s   0.4
|T| 5
 Measures how often items in Y
appear in transactions that BREAD
 (Milk, Diaper, Butter) 2
contain X c   0.67
 (Milk, Diaper) 3

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining Association Rules

• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support 
minsup

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a
binary partitioning of a frequent itemset

• Frequent itemset generation is still


computationally expensive
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation

• Given a frequent itemset L, find all non-empty subsets


f  L such that f  L – f satisfies the minimum
confidence requirement
• If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

• If |L| = k, then there are 2k – 2 candidate association


rules (ignoring L   and   L)

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation

• How to efficiently generate rules from frequent


itemsets?
• In general, confidence does not have an anti-
monotone property
c(ABC D) can be larger or smaller than c(AB D
• But confidence of rules generated from the same
itemset has an anti-monotone property
• e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)

• Confidence is anti-monotone w.r.t. number of items on


the RHS of the rule
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rule Generation with Apriori Algorithm

• Candidate rule is generated by merging two rules that


share the same prefix
in the rule consequent CD=>AB BD=>AC

• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

• Prune rule D=>ABC if its D=>ABC


subset AD=>BC does not have
high confidence

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Effect of Support Distribution

• Many real data sets have skewed support distribution

Support
distribution of a
retail data set

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Effect of Support Distribution

• How to set the appropriate minsup threshold?


• If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)

• If minsup is set too low, it is computationally


expensive and the number of itemsets is very large

• Using a single minimum support threshold may not be


effective

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Multiple Minimum Support

• How to apply multiple minimum supports?


• MS(i): minimum support for item i
• e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

• Challenge: Support is no longer anti-monotone


• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli}


is frequent
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Pattern Evaluation
• Association rule algorithms tend to produce too many
rules
• many of them are uninteresting or redundant
• Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence

• Interestingness measures can be used to prune/rank


the derived patterns

• In the original formulation of association rules,


support & confidence are the only measures used

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X  Y
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y

Used to define various measures


 support, confidence, lift, Gini,
J-measure, etc.
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Statistical Independence

• Population of 1000 students


• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)

• P(SB) = 420/1000 = 0.42


• P(S)  P(B) = 0.6  0.7 = 0.42

• P(SB) = P(S)  P(B) => Statistical independence


• P(SB) > P(S)  P(B) => Positively correlated
• P(SB) < P(S)  P(B) => Negatively correlated

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Statistical-based Measures

• Measures that take into account statistical


dependence

P(Y | X )
Lift 
P(Y )
P( X , Y )
Interest 
P( X ) P(Y )

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Drawback of Lift & Interest

Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100

0.1 0.9
Lift   10 Lift   1.11
(0.1)(0.1) (0.9)(0.9)

Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subjective Interestingness Measure

• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).

• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts
the
expectation of a user
• A pattern is subjectively interesting if it is
actionable
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Interestingness via Unexpectedness

• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent


Pattern found to be frequent

Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence


from data (i.e., extracted patterns)
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Class Summary

• Recap of CS06
• Mining Association Rules
• Bottleneck of Frequent-pattern Mining
• Mining Frequent Patterns Without Candidate Generation
• Frequent Patterns(FP)-growth Algorithm
• From Conditional Pattern-bases to Conditional FP-trees
• Association Rule Mining
• Rule Generation for Apriori Algorithm
• Effect of Support Distribution
• Multiple Minimum Support, Pattern Evaluation

Revision CS01 – CS07


Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Important note to self

Thank You!

You might also like