0% found this document useful (0 votes)
12 views40 pages

CH 1

ch1

Uploaded by

S SMRITI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

CH 1

ch1

Uploaded by

S SMRITI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Mining

An Introduction
Introduction
• Organizations now collect huge amounts of data, but extracting meaningful
insights is challenging.

• Traditional analysis methods often fall short due to the size, type, or
complexity of the data.

• Data mining combines traditional analysis with advanced algorithms to


handle large and complex datasets.

• It enables new ways of analyzing both emerging and existing types of data.
Large-scale Data is Everywhere!

▪ There has been enormous data growth in both


commercial and scientific databases due to
advances in data generation and collection
technologies Cyber Security
E-Commerce

▪ New mantra
▪ Gather whatever data you can whenever and
wherever possible.

Traffic Patterns Social Networking: Twitter

▪ Expectations
▪ Gathered data will have value either for the
purpose collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 3


Business
• Retailers collect real-time customer data through technologies like
• Barcodes
• RFID (Radio Frequency Identification) to improve decision-making.

• Lots of data is being collected and warehoused


• Web data
• Yahoo has Peta Bytes of web data
• Facebook has billions of active users
• purchases at department/grocery stores, e-commerce
• Amazon handles millions of visits/day
• Bank/Credit Card transactions
• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer Relationship Management)

• Data mining enables business intelligence applications like customer profiling,


marketing, and fraud detection.
Medicine, Science and Engineering
• Researchers in fields like medicine and climate
science
• Generate massive, complex datasets
• High-throughput biological data
• Traditional methods struggle to analyze.
fMRI Data from Brain

• Data mining techniques


• Uncover patterns
• Answer critical questions
Surface Temperature of Earth
• Help in predictions
Medicine, Science and Engineering
• Data collected and stored at enormous speeds
• remote sensors on a satellite
• NASA EOSDIS archives over
petabytes of earth science data / year
• telescopes scanning the skies Sky Survey Data
• Sky survey data
• scientific simulations
• terabytes of data generated in a few hours

• Data mining helps scientists


• in automated analysis of massive datasets
• In hypothesis formation

01/17/2018 Introduction to Data Mining, 2nd Edition 7


Medicine, Science and Engineering – Contd..
• Advances like microarray technology allow biologists to analyze
• Thousands of genes simultaneously,
• Offer insights into gene functions
• Disease links.

Gene Expression Data

• Due to the complex and high-dimensional nature of data - DM is


essential for tasks like
• Protein structure prediction
• Pathway modeling
What (is not) /(is) Data Mining?

● What is not Data Mining? ● What is Data Mining?


– Look up phone number
in phone directory – Certain names are more prevalent
in certain US locations (O’Brien,
O’Rourke, O’Reilly… in Boston area)
– Query a Web search
engine for information – Group together similar documents
about “Amazon” returned by search engine according
to their context (e.g., Amazon
rainforest, Amazon.com)

01/17/2018 Introduction to Data Mining, 2nd Edition 9


What is Data Mining? - Many Definitions

• Non-trivial extraction of implicit, previously unknown and potentially


useful information from data

• Exploration & analysis, by automatic or semi-automatic means, of


large quantities of data in order to discover meaningful patterns

01/17/2018 Introduction to Data Mining, 2nd Edition 10


Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in
databases (KDD)
• The overall process of converting raw data into useful
information.

01/17/2018 Introduction to Data Mining, 2nd Edition 11


Series of transformation steps
• From Data preprocessing to postprocessing of data mining results.

01/17/2018 Introduction to Data Mining, 2nd Edition 12


• Consists of a series of transformation steps
• From Data preprocessing to postprocessing of data mining results.

• Input
• Stored in variety of formats (flat files, spread sheets, or relational tables)
• May reside in a centralized data repository or be distributed across multiple sites.

• Preprocessing - To transform the raw input data into an appropriate format for
subsequent analysis.
• Steps - fusing data from multiple sources - cleaning data to remove noise and duplicate
observations - selecting records and features that are relevant to the data mining task at
hand.

• Data - collected and stored - many ways


• Data preprocessing - most laborious and time-consuming step in the overall
knowledge discovery process.
• Closing the loop - Phrase - refers to integrating data mining results
into decision support systems.
• Example - In business applications
• Insights from data mining results - integrated with campaign management
tools so that effective marketing promotions can be conducted and tested.
• Such integration requires a postpro cessing step that ensures that
only valid and useful results are incorporated into the decision
support system.
• Postprocessing is visualiza tion which allows analysts to explore the
data and the data mining results from a variety of viewpoints.
• Statistical measures or hypoth esis testing methods can also be
applied during postprocessing to eliminate spurious data mining
results.
Motivating Challenges
• Scalability
• The ability to handle very large datasets efficiently.
• To achieve this, algorithms use smart strategies like sampling, parallel
processing, or special data structures.

• High Dimensionality
• High dimensionality refers to datasets with a large number of features or
attributes, which is common in fields like bioinformatics or time-series
analysis.
• Traditional techniques often struggle with such data, as computational
complexity increases with the number of dimensions.
• Heterogeneous and Complex Data
• Includes different types like text, images, graphs, or time series, making
analysis more challenging.
• Modern data mining techniques must handle diverse formats and
relationships such as sequences, structures, and hierarchies.

• Data Ownership and Distribution


• Refer to situations where data is spread across multiple locations or
organizations.
• This requires distributed data mining techniques that reduce communication,
combine results efficiently, and ensure data security.
• Non-traditional Analysis
• Data mining automates the generation and testing of many hypotheses,
unlike the manual, hypothesis-driven approach of traditional statistics.
• It also works with opportunistic and complex datasets that aren't from
controlled experiments.
The Origins of Data Mining
• Data mining originated by combining methods from
• Statistics and AI to analyze large, complex datasets.

• Adopted techniques from fields like


• optimization and information retrieval to improve efficiency and scalability.

• Database systems support data mining by enabling


• efficient storage
• Indexing
• querying of large datasets.

• High-performance and distributed computing help


• manage massive data sizes
• enable analysis across multiple locations.
Data mining as a confluence of many
discipline Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values of other variables.
• The attribute to be predicted is known as the target or dependent variable
• The attributes used for making the prediction are known as the
explanatory or independent variables.

• Description Tasks
• Find human-interpretable patterns that describe the data. that summarize
the underlying relationships in data.
• Exploratory
• Frequently require postprocessing techniques to validate and explain the
results
01/17/2018 Introduction to Data Mining, 2nd Edition 21
Data Mining Tasks … Four of the core data mining tasks

Clu
ste Data g
rin n
g eli
M od
iv e
ct
e di
Pr

An
n De om
tio
o cia tec aly
s tio
As s n
le
Ru

Mil
k
01/17/2018 Introduction to Data Mining, 2nd Edition 22
Predictive modeling
• Predictive modeling refers to the task of building a model for the
target variable as a function of the explanatory variables.

• There are two types of predictive modeling tasks:


• Classification - used for discrete target variables,
• Regression - used for continuous target variables..
Predictive modeling
• Example
• Classification - Predicting whether a Web user will make a purchase
at an online bookstore is a classification task because the target
variable is binary-valued.

• Regression - Forecasting the future price of a stock is a regression


task because price is a continuous-valued attribute.

• The goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable.
Predicting the Type of a Flower
Predictive Modeling: Classification
• Find a model for class attribute as a function of the
values of other attributes Model for predicting credit
worthiness

Class

01/17/2018 Introduction to Data Mining, 2nd Edition 26


Classification Example
l l e
a a iv
r i c r i c at
o go tit
teg t e a n
ass
ca ca qu cl

Test
Set

Learn
Training Model
Set Classifier

01/17/2018 Introduction to Data Mining, 2nd Edition 27


Examples of Classification Task

• Classifying credit card transactions as legitimate or fraudulent

• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data

• Identifying intruders in the cyberspace

• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein as alpha-helix, beta-sheet,


or random coil

01/17/2018 Introduction to Data Mining, 2nd Edition 28


Classification – Application - Fraud Detection

• Goal: Predict fraudulent cases in credit card transactions.

• Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.

01/17/2018 Introduction to Data Mining, 2nd Edition 29


Classifying Galaxies Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

01/17/2018 Introduction to Data Mining, 2nd Edition 30


Regression
• Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
• Time series prediction of stock market indices.
01/17/2018 Introduction to Data Mining, 2nd Edition 31
Association Analysis
• Association analysis is used to discover patterns that describe
strongly associated features in the data.

• The discovered patterns are typically represented in the form of


implication rules or feature subsets.
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items
from a given collection

• Produce dependency rules which will predict occurrence of an item based on


occurrences of other items.

01/17/2018 Introduction to Data Mining, 2nd Edition 33


Association Analysis: Applications
• Market-basket analysis
• Rules are used for sales promotion, shelf management, and inventory
management

• Medical Informatics
• Rules are used to find combination of patient symptoms and test results
associated with certain diseases

01/17/2018 Introduction to Data Mining, 2nd Edition 34


Association Rule Discovery: Application
-Market Basket Analysis

Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

01/17/2018 Introduction to Data Mining, 2nd Edition 35


Clustering

• Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in
other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

01/17/2018 Introduction to Data Mining, 2nd Edition 36


Clustering: Application 2
• Document Clustering:

• Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
• Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to
cluster.

01/17/2018 Introduction to Data Mining, 2nd Edition 37


Anomaly Detection
• Anomaly detection - Task of identifying observations whose characteristics
are significantly different from the rest of the data.

• Known as anomalies or outliers

• The goal of an anomaly detection algorithm is to discover the real


anomalies and avoid falsely labeling normal objects as anomalous.

• Good anomaly detector has


• high detection rate
• low false alarm rate.
Deviation/Anomaly/Change Detection

• Detect significant deviations from normal behavior


• Applications:
• Credit Card Fraud Detection
• Network Intrusion
Detection
• Identify anomalous behavior from sensor networks
for monitoring and surveillance.
• Detecting changes in the global forest cover.

01/17/2018 Introduction to Data Mining, 2nd Edition 40

You might also like