0% found this document useful (0 votes)

12 views40 pages

CH 1

ch1

Uploaded by

S SMRITI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

CH 1

ch1

Uploaded by

S SMRITI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Data Mining

An Introduction
Introduction
• Organizations now collect huge amounts of data, but extracting meaningful
insights is challenging.

• Traditional analysis methods often fall short due to the size, type, or
complexity of the data.

• Data mining combines traditional analysis with advanced algorithms to

handle large and complex datasets.

• It enables new ways of analyzing both emerging and existing types of data.
Large-scale Data is Everywhere!

▪ There has been enormous data growth in both

commercial and scientific databases due to
advances in data generation and collection
technologies Cyber Security
E-Commerce

▪ New mantra
▪ Gather whatever data you can whenever and
wherever possible.

Traffic Patterns Social Networking: Twitter

▪ Expectations
▪ Gathered data will have value either for the
purpose collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 3

Business
• Retailers collect real-time customer data through technologies like
• Barcodes
• RFID (Radio Frequency Identification) to improve decision-making.

• Lots of data is being collected and warehoused

• Web data
• Yahoo has Peta Bytes of web data
• Facebook has billions of active users
• purchases at department/grocery stores, e-commerce
• Amazon handles millions of visits/day
• Bank/Credit Card transactions
• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer Relationship Management)

• Data mining enables business intelligence applications like customer profiling,

marketing, and fraud detection.
Medicine, Science and Engineering
• Researchers in fields like medicine and climate
science
• Generate massive, complex datasets
• High-throughput biological data
• Traditional methods struggle to analyze.
fMRI Data from Brain

• Data mining techniques

• Uncover patterns
• Answer critical questions
Surface Temperature of Earth
• Help in predictions
Medicine, Science and Engineering
• Data collected and stored at enormous speeds
• remote sensors on a satellite
• NASA EOSDIS archives over
petabytes of earth science data / year
• telescopes scanning the skies Sky Survey Data
• Sky survey data
• scientific simulations
• terabytes of data generated in a few hours

• Data mining helps scientists

• in automated analysis of massive datasets
• In hypothesis formation

01/17/2018 Introduction to Data Mining, 2nd Edition 7

Medicine, Science and Engineering – Contd..
• Advances like microarray technology allow biologists to analyze
• Thousands of genes simultaneously,
• Offer insights into gene functions
• Disease links.

Gene Expression Data

• Due to the complex and high-dimensional nature of data - DM is

essential for tasks like
• Protein structure prediction
• Pathway modeling
What (is not) /(is) Data Mining?

● What is not Data Mining? ● What is Data Mining?

– Look up phone number
in phone directory – Certain names are more prevalent
in certain US locations (O’Brien,
O’Rourke, O’Reilly… in Boston area)
– Query a Web search
engine for information – Group together similar documents
about “Amazon” returned by search engine according
to their context (e.g., Amazon
rainforest, Amazon.com)

01/17/2018 Introduction to Data Mining, 2nd Edition 9

What is Data Mining? - Many Definitions

• Non-trivial extraction of implicit, previously unknown and potentially

useful information from data

• Exploration & analysis, by automatic or semi-automatic means, of

large quantities of data in order to discover meaningful patterns

01/17/2018 Introduction to Data Mining, 2nd Edition 10

Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in
databases (KDD)
• The overall process of converting raw data into useful
information.

01/17/2018 Introduction to Data Mining, 2nd Edition 11

Series of transformation steps
• From Data preprocessing to postprocessing of data mining results.

01/17/2018 Introduction to Data Mining, 2nd Edition 12

• Consists of a series of transformation steps
• From Data preprocessing to postprocessing of data mining results.

• Input
• Stored in variety of formats (flat files, spread sheets, or relational tables)
• May reside in a centralized data repository or be distributed across multiple sites.

• Preprocessing - To transform the raw input data into an appropriate format for
subsequent analysis.
• Steps - fusing data from multiple sources - cleaning data to remove noise and duplicate
observations - selecting records and features that are relevant to the data mining task at
hand.

• Data - collected and stored - many ways

• Data preprocessing - most laborious and time-consuming step in the overall
knowledge discovery process.
• Closing the loop - Phrase - refers to integrating data mining results
into decision support systems.
• Example - In business applications
• Insights from data mining results - integrated with campaign management
tools so that effective marketing promotions can be conducted and tested.
• Such integration requires a postpro cessing step that ensures that
only valid and useful results are incorporated into the decision
support system.
• Postprocessing is visualiza tion which allows analysts to explore the
data and the data mining results from a variety of viewpoints.
• Statistical measures or hypoth esis testing methods can also be
applied during postprocessing to eliminate spurious data mining
results.
Motivating Challenges
• Scalability
• The ability to handle very large datasets efficiently.
• To achieve this, algorithms use smart strategies like sampling, parallel
processing, or special data structures.

• High Dimensionality
• High dimensionality refers to datasets with a large number of features or
attributes, which is common in fields like bioinformatics or time-series
analysis.
• Traditional techniques often struggle with such data, as computational
complexity increases with the number of dimensions.
• Heterogeneous and Complex Data
• Includes different types like text, images, graphs, or time series, making
analysis more challenging.
• Modern data mining techniques must handle diverse formats and
relationships such as sequences, structures, and hierarchies.

• Data Ownership and Distribution

• Refer to situations where data is spread across multiple locations or
organizations.
• This requires distributed data mining techniques that reduce communication,
combine results efficiently, and ensure data security.
• Non-traditional Analysis
• Data mining automates the generation and testing of many hypotheses,
unlike the manual, hypothesis-driven approach of traditional statistics.
• It also works with opportunistic and complex datasets that aren't from
controlled experiments.
The Origins of Data Mining
• Data mining originated by combining methods from
• Statistics and AI to analyze large, complex datasets.

• Adopted techniques from fields like

• optimization and information retrieval to improve efficiency and scalability.

• Database systems support data mining by enabling

• efficient storage
• Indexing
• querying of large datasets.

• High-performance and distributed computing help

• manage massive data sizes
• enable analysis across multiple locations.
Data mining as a confluence of many
discipline Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values of other variables.
• The attribute to be predicted is known as the target or dependent variable
• The attributes used for making the prediction are known as the
explanatory or independent variables.

• Description Tasks
• Find human-interpretable patterns that describe the data. that summarize
the underlying relationships in data.
• Exploratory
• Frequently require postprocessing techniques to validate and explain the
results
01/17/2018 Introduction to Data Mining, 2nd Edition 21
Data Mining Tasks … Four of the core data mining tasks

Clu
ste Data g
rin n
g eli
M od
iv e
ct
e di
Pr

An
n De om
tio
o cia tec aly
s tio
As s n
le
Ru

Mil
k
01/17/2018 Introduction to Data Mining, 2nd Edition 22
Predictive modeling
• Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables.

• There are two types of predictive modeling tasks:

• Classification - used for discrete target variables,
• Regression - used for continuous target variables..
Predictive modeling
• Example
• Classification - Predicting whether a Web user will make a purchase
at an online bookstore is a classification task because the target
variable is binary-valued.

• Regression - Forecasting the future price of a stock is a regression

task because price is a continuous-valued attribute.

• The goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable.
Predicting the Type of a Flower
Predictive Modeling: Classification
• Find a model for class attribute as a function of the
values of other attributes Model for predicting credit
worthiness

Class

01/17/2018 Introduction to Data Mining, 2nd Edition 26

Classification Example
l l e
a a iv
r i c r i c at
o go tit
teg t e a n
ass
ca ca qu cl

Test
Set

Learn
Training Model
Set Classifier

01/17/2018 Introduction to Data Mining, 2nd Edition 27

Examples of Classification Task

• Classifying credit card transactions as legitimate or fraudulent

• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein as alpha-helix, beta-sheet,

or random coil

01/17/2018 Introduction to Data Mining, 2nd Edition 28

Classification – Application - Fraud Detection

• Goal: Predict fraudulent cases in credit card transactions.

• Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.

01/17/2018 Introduction to Data Mining, 2nd Edition 29

Classifying Galaxies Courtesy: http://aps.umn.edu

Early Class: Attributes:

• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

01/17/2018 Introduction to Data Mining, 2nd Edition 30

Regression
• Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
• Time series prediction of stock market indices.
01/17/2018 Introduction to Data Mining, 2nd Edition 31
Association Analysis
• Association analysis is used to discover patterns that describe
strongly associated features in the data.

• The discovered patterns are typically represented in the form of

implication rules or feature subsets.
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items
from a given collection

• Produce dependency rules which will predict occurrence of an item based on

occurrences of other items.

01/17/2018 Introduction to Data Mining, 2nd Edition 33

Association Analysis: Applications
• Market-basket analysis
• Rules are used for sales promotion, shelf management, and inventory
management

• Medical Informatics
• Rules are used to find combination of patient symptoms and test results
associated with certain diseases

01/17/2018 Introduction to Data Mining, 2nd Edition 34

Association Rule Discovery: Application
-Market Basket Analysis

Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

01/17/2018 Introduction to Data Mining, 2nd Edition 35

Clustering

• Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in
other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

01/17/2018 Introduction to Data Mining, 2nd Edition 36

Clustering: Application 2
• Document Clustering:

• Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
• Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to
cluster.

01/17/2018 Introduction to Data Mining, 2nd Edition 37

Anomaly Detection
• Anomaly detection - Task of identifying observations whose characteristics
are significantly different from the rest of the data.

• Known as anomalies or outliers

• The goal of an anomaly detection algorithm is to discover the real

anomalies and avoid falsely labeling normal objects as anomalous.

• Good anomaly detector has

• high detection rate
• low false alarm rate.
Deviation/Anomaly/Change Detection

• Detect significant deviations from normal behavior

• Applications:
• Credit Card Fraud Detection
• Network Intrusion
Detection
• Identify anomalous behavior from sensor networks
for monitoring and surveillance.
• Detecting changes in the global forest cover.

01/17/2018 Introduction to Data Mining, 2nd Edition 40

UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
40 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
DM Chapter 1
No ratings yet
DM Chapter 1
37 pages
Chapter 1
No ratings yet
Chapter 1
313 pages
Data Mining Introduction Guide
No ratings yet
Data Mining Introduction Guide
95 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
1 - DM
No ratings yet
1 - DM
5 pages
UNIT 5 Introduction To Data Mining-1
No ratings yet
UNIT 5 Introduction To Data Mining-1
185 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
28 pages
Unit 1 A
No ratings yet
Unit 1 A
39 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Chapter 1 - What Is Data Mining
No ratings yet
Chapter 1 - What Is Data Mining
8 pages
Module 1
No ratings yet
Module 1
40 pages
01 Intro
No ratings yet
01 Intro
41 pages
Lec.01 Introduction To DM
No ratings yet
Lec.01 Introduction To DM
56 pages
Data Mining Basics for Beginners
No ratings yet
Data Mining Basics for Beginners
59 pages
Lec 1
No ratings yet
Lec 1
33 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
41 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
Lecture 1. Introduction
No ratings yet
Lecture 1. Introduction
42 pages
Data Analysis-2
No ratings yet
Data Analysis-2
41 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
43 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
LECTURE 1 Data Mining
No ratings yet
LECTURE 1 Data Mining
41 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
01 Intro
No ratings yet
01 Intro
23 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Lec.01 Introduction To DM
No ratings yet
Lec.01 Introduction To DM
56 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
CH 1 Intro To Data Mining
No ratings yet
CH 1 Intro To Data Mining
17 pages
DWDM Unit 1 Part 1
No ratings yet
DWDM Unit 1 Part 1
35 pages
Lec Slides Combined Mid Quiz With Old Quizzes
No ratings yet
Lec Slides Combined Mid Quiz With Old Quizzes
378 pages
Day-2 BE-VIII DMDW (Into. Contd..)
No ratings yet
Day-2 BE-VIII DMDW (Into. Contd..)
23 pages
Data Mining
No ratings yet
Data Mining
254 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
35 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
Unit 1a
No ratings yet
Unit 1a
39 pages
Data Mining & Predictive Analytics Guide
No ratings yet
Data Mining & Predictive Analytics Guide
17 pages
Data Mining SSWT ZC 425
No ratings yet
Data Mining SSWT ZC 425
381 pages
01 Intro
No ratings yet
01 Intro
40 pages
Unit 1
No ratings yet
Unit 1
59 pages
Data Mining & BI Course Guide
No ratings yet
Data Mining & BI Course Guide
25 pages
Intro to Data Mining Course
No ratings yet
Intro to Data Mining Course
56 pages
DMiningKuliah1 (Introduction)
No ratings yet
DMiningKuliah1 (Introduction)
45 pages
DB 14
No ratings yet
DB 14
97 pages
01 Intro
No ratings yet
01 Intro
40 pages
VIPDMTheory Chapter 1
No ratings yet
VIPDMTheory Chapter 1
25 pages
Sumeru
No ratings yet
Sumeru
2 pages
Prunning 2
No ratings yet
Prunning 2
21 pages
FP-Growth Algorithm New
No ratings yet
FP-Growth Algorithm New
25 pages
Naive Bayes and Rule Based Classification
No ratings yet
Naive Bayes and Rule Based Classification
22 pages
CA Diagram
No ratings yet
CA Diagram
22 pages
Hackathon Discord Setup
No ratings yet
Hackathon Discord Setup
7 pages
Brochure
No ratings yet
Brochure
12 pages
Invite Community Hours
No ratings yet
Invite Community Hours
1 page
Descending
No ratings yet
Descending
3 pages
Big Data Analytics Essentials
No ratings yet
Big Data Analytics Essentials
143 pages
Kmean Clustering
No ratings yet
Kmean Clustering
10 pages
Lec 05 - K-Means
No ratings yet
Lec 05 - K-Means
4 pages
Web Mining for Keyword Detection
No ratings yet
Web Mining for Keyword Detection
12 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Rank-Order Clustering for Face Tagging
No ratings yet
Rank-Order Clustering for Face Tagging
8 pages
Bus Scheduling Model User Interfae
No ratings yet
Bus Scheduling Model User Interfae
5 pages
Customer Clustering with K-Means
No ratings yet
Customer Clustering with K-Means
3 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Random Forest Unsupervised Learning Guide
No ratings yet
Random Forest Unsupervised Learning Guide
14 pages
Data Mining P9-SVM
No ratings yet
Data Mining P9-SVM
30 pages
Density-Based Clustering Based On Hierarchical Density Estimates
No ratings yet
Density-Based Clustering Based On Hierarchical Density Estimates
13 pages
Data Projections & Visualization: Student Eng.: Maria-Alexandra MATEI
No ratings yet
Data Projections & Visualization: Student Eng.: Maria-Alexandra MATEI
18 pages
Data-Driven Soccer Scouting
No ratings yet
Data-Driven Soccer Scouting
3 pages
Visualizing Fraud in Your Data by Raymond Kiprotich Bett
No ratings yet
Visualizing Fraud in Your Data by Raymond Kiprotich Bett
32 pages
Bana1 Visualization
No ratings yet
Bana1 Visualization
22 pages
Module 1 - Introduction Data Mining
No ratings yet
Module 1 - Introduction Data Mining
46 pages
Michelle Cook
No ratings yet
Michelle Cook
273 pages
Zaher Et Al-2009-Wind Energy
No ratings yet
Zaher Et Al-2009-Wind Energy
20 pages
0A007 Introduction To IBM SPSS Modeler CourseDesc
No ratings yet
0A007 Introduction To IBM SPSS Modeler CourseDesc
2 pages
Study Guide - FOR3705 - 2022
No ratings yet
Study Guide - FOR3705 - 2022
38 pages
Business Analytics: Aviral Apurva Anureet Bansal Devansh Agarwaal Dhwani Dhingra Chirag Verma
No ratings yet
Business Analytics: Aviral Apurva Anureet Bansal Devansh Agarwaal Dhwani Dhingra Chirag Verma
49 pages
Application of Data Mining Techniques To Support Customer Relationship Management at Ethiopian Airlines 2002 Thesis
No ratings yet
Application of Data Mining Techniques To Support Customer Relationship Management at Ethiopian Airlines 2002 Thesis
153 pages
Unit4 Mcqs
No ratings yet
Unit4 Mcqs
7 pages
CAS Forum Winter 2003: Data & Ratemaking
0% (1)
CAS Forum Winter 2003: Data & Ratemaking
680 pages
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
No ratings yet
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
75 pages
Data Warehousing Guide for IT Students
No ratings yet
Data Warehousing Guide for IT Students
77 pages
Romi DM Apr2020
No ratings yet
Romi DM Apr2020
720 pages
Free Data Sources: Instructor: Samuel I. G. Situmeang
No ratings yet
Free Data Sources: Instructor: Samuel I. G. Situmeang
8 pages
Data Mining
No ratings yet
Data Mining
2 pages

CH 1

Uploaded by

CH 1

Uploaded by

Data Mining

• Data mining combines traditional analysis with advanced algorithms to

▪ There has been enormous data growth in both

Traffic Patterns Social Networking: Twitter

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 3

• Lots of data is being collected and warehoused

• Data mining enables business intelligence applications like customer profiling,

• Data mining techniques

• Data mining helps scientists

01/17/2018 Introduction to Data Mining, 2nd Edition 7

Gene Expression Data

• Due to the complex and high-dimensional nature of data - DM is

● What is not Data Mining? ● What is Data Mining?

01/17/2018 Introduction to Data Mining, 2nd Edition 9

• Non-trivial extraction of implicit, previously unknown and potentially

• Exploration & analysis, by automatic or semi-automatic means, of

01/17/2018 Introduction to Data Mining, 2nd Edition 10

01/17/2018 Introduction to Data Mining, 2nd Edition 11

01/17/2018 Introduction to Data Mining, 2nd Edition 12

• Data - collected and stored - many ways

• Data Ownership and Distribution

• Adopted techniques from fields like

• Database systems support data mining by enabling

• High-performance and distributed computing help

• There are two types of predictive modeling tasks:

• Regression - Forecasting the future price of a stock is a regression

01/17/2018 Introduction to Data Mining, 2nd Edition 26

01/17/2018 Introduction to Data Mining, 2nd Edition 27

• Classifying credit card transactions as legitimate or fraudulent

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein as alpha-helix, beta-sheet,

01/17/2018 Introduction to Data Mining, 2nd Edition 28

• Goal: Predict fraudulent cases in credit card transactions.

01/17/2018 Introduction to Data Mining, 2nd Edition 29

Early Class: Attributes:

01/17/2018 Introduction to Data Mining, 2nd Edition 30

• The discovered patterns are typically represented in the form of

• Produce dependency rules which will predict occurrence of an item based on

01/17/2018 Introduction to Data Mining, 2nd Edition 33

01/17/2018 Introduction to Data Mining, 2nd Edition 34

01/17/2018 Introduction to Data Mining, 2nd Edition 35

01/17/2018 Introduction to Data Mining, 2nd Edition 36

01/17/2018 Introduction to Data Mining, 2nd Edition 37

• Known as anomalies or outliers

• The goal of an anomaly detection algorithm is to discover the real

• Good anomaly detector has

• Detect significant deviations from normal behavior

01/17/2018 Introduction to Data Mining, 2nd Edition 40

You might also like