Data Mining
An Introduction
Introduction
• Organizations now collect huge amounts of data, but extracting meaningful
insights is challenging.
• Traditional analysis methods often fall short due to the size, type, or
complexity of the data.
• Data mining combines traditional analysis with advanced algorithms to
handle large and complex datasets.
• It enables new ways of analyzing both emerging and existing types of data.
Large-scale Data is Everywhere!
▪ There has been enormous data growth in both
commercial and scientific databases due to
advances in data generation and collection
technologies Cyber Security
E-Commerce
▪ New mantra
▪ Gather whatever data you can whenever and
wherever possible.
Traffic Patterns Social Networking: Twitter
▪ Expectations
▪ Gathered data will have value either for the
purpose collected or for a purpose not
envisioned.
Sensor Networks Computational Simulations
01/17/2018 Introduction to Data Mining, 2nd Edition 3
Business
• Retailers collect real-time customer data through technologies like
• Barcodes
• RFID (Radio Frequency Identification) to improve decision-making.
• Lots of data is being collected and warehoused
• Web data
• Yahoo has Peta Bytes of web data
• Facebook has billions of active users
• purchases at department/grocery stores, e-commerce
• Amazon handles millions of visits/day
• Bank/Credit Card transactions
• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer Relationship Management)
• Data mining enables business intelligence applications like customer profiling,
marketing, and fraud detection.
Medicine, Science and Engineering
• Researchers in fields like medicine and climate
science
• Generate massive, complex datasets
• High-throughput biological data
• Traditional methods struggle to analyze.
fMRI Data from Brain
• Data mining techniques
• Uncover patterns
• Answer critical questions
Surface Temperature of Earth
• Help in predictions
Medicine, Science and Engineering
• Data collected and stored at enormous speeds
• remote sensors on a satellite
• NASA EOSDIS archives over
petabytes of earth science data / year
• telescopes scanning the skies Sky Survey Data
• Sky survey data
• scientific simulations
• terabytes of data generated in a few hours
• Data mining helps scientists
• in automated analysis of massive datasets
• In hypothesis formation
01/17/2018 Introduction to Data Mining, 2nd Edition 7
Medicine, Science and Engineering – Contd..
• Advances like microarray technology allow biologists to analyze
• Thousands of genes simultaneously,
• Offer insights into gene functions
• Disease links.
Gene Expression Data
• Due to the complex and high-dimensional nature of data - DM is
essential for tasks like
• Protein structure prediction
• Pathway modeling
What (is not) /(is) Data Mining?
● What is not Data Mining? ● What is Data Mining?
– Look up phone number
in phone directory – Certain names are more prevalent
in certain US locations (O’Brien,
O’Rourke, O’Reilly… in Boston area)
– Query a Web search
engine for information – Group together similar documents
about “Amazon” returned by search engine according
to their context (e.g., Amazon
rainforest, Amazon.com)
01/17/2018 Introduction to Data Mining, 2nd Edition 9
What is Data Mining? - Many Definitions
• Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
01/17/2018 Introduction to Data Mining, 2nd Edition 10
Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in
databases (KDD)
• The overall process of converting raw data into useful
information.
01/17/2018 Introduction to Data Mining, 2nd Edition 11
Series of transformation steps
• From Data preprocessing to postprocessing of data mining results.
01/17/2018 Introduction to Data Mining, 2nd Edition 12
• Consists of a series of transformation steps
• From Data preprocessing to postprocessing of data mining results.
• Input
• Stored in variety of formats (flat files, spread sheets, or relational tables)
• May reside in a centralized data repository or be distributed across multiple sites.
• Preprocessing - To transform the raw input data into an appropriate format for
subsequent analysis.
• Steps - fusing data from multiple sources - cleaning data to remove noise and duplicate
observations - selecting records and features that are relevant to the data mining task at
hand.
• Data - collected and stored - many ways
• Data preprocessing - most laborious and time-consuming step in the overall
knowledge discovery process.
• Closing the loop - Phrase - refers to integrating data mining results
into decision support systems.
• Example - In business applications
• Insights from data mining results - integrated with campaign management
tools so that effective marketing promotions can be conducted and tested.
• Such integration requires a postpro cessing step that ensures that
only valid and useful results are incorporated into the decision
support system.
• Postprocessing is visualiza tion which allows analysts to explore the
data and the data mining results from a variety of viewpoints.
• Statistical measures or hypoth esis testing methods can also be
applied during postprocessing to eliminate spurious data mining
results.
Motivating Challenges
• Scalability
• The ability to handle very large datasets efficiently.
• To achieve this, algorithms use smart strategies like sampling, parallel
processing, or special data structures.
• High Dimensionality
• High dimensionality refers to datasets with a large number of features or
attributes, which is common in fields like bioinformatics or time-series
analysis.
• Traditional techniques often struggle with such data, as computational
complexity increases with the number of dimensions.
• Heterogeneous and Complex Data
• Includes different types like text, images, graphs, or time series, making
analysis more challenging.
• Modern data mining techniques must handle diverse formats and
relationships such as sequences, structures, and hierarchies.
• Data Ownership and Distribution
• Refer to situations where data is spread across multiple locations or
organizations.
• This requires distributed data mining techniques that reduce communication,
combine results efficiently, and ensure data security.
• Non-traditional Analysis
• Data mining automates the generation and testing of many hypotheses,
unlike the manual, hypothesis-driven approach of traditional statistics.
• It also works with opportunistic and complex datasets that aren't from
controlled experiments.
The Origins of Data Mining
• Data mining originated by combining methods from
• Statistics and AI to analyze large, complex datasets.
• Adopted techniques from fields like
• optimization and information retrieval to improve efficiency and scalability.
• Database systems support data mining by enabling
• efficient storage
• Indexing
• querying of large datasets.
• High-performance and distributed computing help
• manage massive data sizes
• enable analysis across multiple locations.
Data mining as a confluence of many
discipline Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values of other variables.
• The attribute to be predicted is known as the target or dependent variable
• The attributes used for making the prediction are known as the
explanatory or independent variables.
• Description Tasks
• Find human-interpretable patterns that describe the data. that summarize
the underlying relationships in data.
• Exploratory
• Frequently require postprocessing techniques to validate and explain the
results
01/17/2018 Introduction to Data Mining, 2nd Edition 21
Data Mining Tasks … Four of the core data mining tasks
Clu
ste Data g
rin n
g eli
M od
iv e
ct
e di
Pr
An
n De om
tio
o cia tec aly
s tio
As s n
le
Ru
Mil
k
01/17/2018 Introduction to Data Mining, 2nd Edition 22
Predictive modeling
• Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables.
• There are two types of predictive modeling tasks:
• Classification - used for discrete target variables,
• Regression - used for continuous target variables..
Predictive modeling
• Example
• Classification - Predicting whether a Web user will make a purchase
at an online bookstore is a classification task because the target
variable is binary-valued.
• Regression - Forecasting the future price of a stock is a regression
task because price is a continuous-valued attribute.
• The goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable.
Predicting the Type of a Flower
Predictive Modeling: Classification
• Find a model for class attribute as a function of the
values of other attributes Model for predicting credit
worthiness
Class
01/17/2018 Introduction to Data Mining, 2nd Edition 26
Classification Example
l l e
a a iv
r i c r i c at
o go tit
teg t e a n
ass
ca ca qu cl
Test
Set
Learn
Training Model
Set Classifier
01/17/2018 Introduction to Data Mining, 2nd Edition 27
Examples of Classification Task
• Classifying credit card transactions as legitimate or fraudulent
• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data
• Identifying intruders in the cyberspace
• Predicting tumor cells as benign or malignant
• Classifying secondary structures of protein as alpha-helix, beta-sheet,
or random coil
01/17/2018 Introduction to Data Mining, 2nd Edition 28
Classification – Application - Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.
01/17/2018 Introduction to Data Mining, 2nd Edition 29
Classifying Galaxies Courtesy: http://aps.umn.edu
Early Class: Attributes:
• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
01/17/2018 Introduction to Data Mining, 2nd Edition 30
Regression
• Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
• Time series prediction of stock market indices.
01/17/2018 Introduction to Data Mining, 2nd Edition 31
Association Analysis
• Association analysis is used to discover patterns that describe
strongly associated features in the data.
• The discovered patterns are typically represented in the form of
implication rules or feature subsets.
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items
from a given collection
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
01/17/2018 Introduction to Data Mining, 2nd Edition 33
Association Analysis: Applications
• Market-basket analysis
• Rules are used for sales promotion, shelf management, and inventory
management
• Medical Informatics
• Rules are used to find combination of patient symptoms and test results
associated with certain diseases
01/17/2018 Introduction to Data Mining, 2nd Edition 34
Association Rule Discovery: Application
-Market Basket Analysis
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
01/17/2018 Introduction to Data Mining, 2nd Edition 35
Clustering
• Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in
other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
01/17/2018 Introduction to Data Mining, 2nd Edition 36
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
• Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to
cluster.
01/17/2018 Introduction to Data Mining, 2nd Edition 37
Anomaly Detection
• Anomaly detection - Task of identifying observations whose characteristics
are significantly different from the rest of the data.
• Known as anomalies or outliers
• The goal of an anomaly detection algorithm is to discover the real
anomalies and avoid falsely labeling normal objects as anomalous.
• Good anomaly detector has
• high detection rate
• low false alarm rate.
Deviation/Anomaly/Change Detection
• Detect significant deviations from normal behavior
• Applications:
• Credit Card Fraud Detection
• Network Intrusion
Detection
• Identify anomalous behavior from sensor networks
for monitoring and surveillance.
• Detecting changes in the global forest cover.
01/17/2018 Introduction to Data Mining, 2nd Edition 40