0% found this document useful (0 votes)
16 views66 pages

02 DM BI Data Mining

The document discusses the importance of data mining in uncovering hidden patterns and insights from large datasets, facilitating data-driven decision-making, and enhancing customer understanding. It outlines the Knowledge Discovery in Data (KDD) process, which includes steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it highlights various data mining functionalities, applications, and the challenges faced in the field.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views66 pages

02 DM BI Data Mining

The document discusses the importance of data mining in uncovering hidden patterns and insights from large datasets, facilitating data-driven decision-making, and enhancing customer understanding. It outlines the Knowledge Discovery in Data (KDD) process, which includes steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it highlights various data mining functionalities, applications, and the challenges faced in the field.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Data Mining

and
Business Intelligence
Data Mining
Module 2
Created/Adopted/Modified for
Data Mining and Business Intelligence – MCA II Semester
Vidya Vikas Institute of Engineering & Technology
Mysore
2023-24
GPD
Why Data Mining?
Why Data Mining?
 Data Mining

 Enables knowledge discovery by uncovering hidden patterns and insights


in large datasets.
 Supports data-driven decision-making across various domains.
 Enhances customer understanding and enables personalized marketing.
 Helps detect fraud and manage risks effectively.
 Facilitates market and competitive analysis for strategic decision-making.
 Optimizes business processes by identifying inefficiencies.
 Contributes to scientific research and advancements across diverse fields.
What Is Data Mining?
Data Mining (Knowledge
Discovery from Data)
Extraction of interesting
(non-trivial, implicit,
previously unknown and
potentially useful) patterns
or knowledge from huge
amount of data
Knowledge Discovery (KDD)
KDD Process
 This is a view from typical database systems Pattern Evaluation

and data warehousing communities


 Data mining plays an essential role in the Data Mining
knowledge discovery process
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Knowledge Discovery
(KDD)
KDD Process
 This is a view from typical database systems
and data warehousing communities
 Data mining plays an essential role in the
knowledge discovery process
Knowledge Discovery from Data (KDD) Process
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or KDD,
while others view data mining as merely an essential step in the
process of knowledge discovery.
 The knowledge discovery process is an iterative sequence of the
following 7 steps:
Knowledge Discovery from Data (KDD) Process
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be
combined)
 3. Data selection (where data relevant to the analysis task are
retrieved from the database)
 4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
Knowledge Discovery from Data (KDD) Process
 5. Data mining (an essential process where intelligent methods
are applied to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
 7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
Data Mining or KDD ?
 The term data mining is often used to refer to the entire knowledge
discovery process.
 Therefore, we adopt a broad view of data mining functionality:

 Data mining is the process of discovering interesting


patterns and knowledge from large amounts of
data.
 The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the
system dynamically.
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
KDD Process: A View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Classification Pattern selection
Normalization
Clustering Pattern interpretation
Feature selection Outlier analysis
Dimension reduction Pattern visualization
…………

 This is a view from typical machine learning and statistics communities


Data Mining: On What Kinds of Data?
 Data Mining is performed on different kinds of data :

1.Database
2.Data Warehouse
3.Transactional Database
4.Other Kinds of Data
 1. Database-oriented data sets and applications
 Relational database, Object-relational databases,
Heterogeneous databases and legacy databases
Data Mining: On What Kinds of Data?
 2. Data Warehouse - A data warehouse is usually
modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an
attribute or a set of attributes in the schema, and each
cell stores the value of some aggregate measure such as
count or sum(sales amount).
 A data cube provides a multidimensional view of data
and allows the precomputation and fast access of
summarized data.
Data Mining: On What Kinds of Data?
 2. Data Warehouse
Data Mining: On What Kinds of Data?
 2. Data Warehouse
 Data Cube Example :
 A data cube provides a
multidimensional view of
data and allows the
precomputation and fast
access of summarized data.
Data Mining: On What Kinds of Data?
 2. Data Warehouse
 Provides multidimensional data views and
precomputation of summarized data.
 OLAP (OnLine Analytical Processing) operations
make use of background knowledge regarding
the domain of the data being studied to
allow the presentation of data at
different levels of abstraction.
Data Mining: On What Kinds of Data?
 3. A transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web
page.
 A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the transaction,
such as the items purchased in the transaction.
 A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information
about the salesperson or the branch, and so on.
Data Mining: On What Kinds of Data?
 4. Other kinds of data : Advanced data sets and advanced
applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and information
networks
 Spatial data and spatio-temporal data
 Multimedia database, Text databases, The World-Wide Web
What Kinds of Patterns Can Be Mined?
 There are a number of data mining functionalities.
 These include (1) characterization and discrimination; the (2) mining of
frequent patterns, associations, and correlations; (3) classification and
regression; (4) clustering analysis; and (5) outlier analysis.
 Data mining functionalities are used to specify the kinds of patterns to be
found in data mining tasks
 Such tasks can be classified into two categories:

 Descriptive mining tasks characterize properties of the data in a target


data set.
 Predictive mining tasks perform induction on the current data in order to
make predictions.
Data Mining Functionalities (Tasks)
 Descriptive Mining

 Goal: Find human-interpretable patterns that describe the


data.
 Example: Which products are often bought together?
 Predictive Mining

 Goal: Use some variables (observations from the past) to


predict unknown or future values of other variables.
 Example: Will a person click a online advertisement? - given
her browsing history
Data Mining & Machine Learning
 Machine Learning Terminology

 Descriptive Mining== Unsupervised Learning


 Predictive Mining == Supervised Learning

 Machine learning investigates how computers can learn (or improve


their performance) based on data.
 A main research area is for computer programs to automatically
learn to recognize complex patterns and make intelligent decisions
based on data.
Data Mining & Machine Learning
 Data Mining Tasks & Applications :

1.Classification [Predictive]
2.Cluster Analysis [Descriptive]
3.Association Analysis [Descriptive]
4.Regression Analysis [Predictive]
5.Sequential Pattern Discovery [Descriptive]
6.Deviation Detection (Anomaly Detection) [Predictive]
Data Mining Functions:(1)
Characterization and Discrimination
 Data characterization is a summarization of the general characteristics or features of a
target class of data.
 For example : To study the characteristics of software products with sales that increased
by 10% in the previous year.
 The data cube-based OLAP roll-up operation can be used
to perform user-controlled data summarization along a
specified dimension.
 Data discrimination is a comparison of the general features
of the target class data objects against the general features
of objects from one or multiple contrasting classes.
 For example, a user may want to compare the general features
of software products with sales that increased by 10% last
year against those with sales that decreased by at least 30%
during the same period.
Data Mining Functions: (2) Pattern Discovery
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your Supermarket?
 Association and Correlation Analysis

 A typical association rule


 Nail polish - Eyeliner [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
Data Mining Functions: (3) Classification & Regression
 Classification and label prediction. Supervised Learning
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 Ex. 1. Classify countries based on (climate)
 Ex. 2. Classify cars based on (mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification,
support vector machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,

Data Mining Functions: (3) Classification & Regression
 A decision tree is a flowchart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
 A neural network is acollection of neuron-like processing units with weighted
connections between the units.
 Regression analysis is a statistical methodology that is most often used for numeric
prediction.
 Regression models
continuous-valued
functions.
Data Mining Functions: (4) Cluster Analysis
 Unsupervised learning (i.e., Class
label is unknown)
 Group data to form new categories
(i.e., clusters), e.g., cluster houses to
find distribution patterns
 Principle: Maximizing intra-class
similarity & minimizing interclass
similarity
 Many methods and applications
Data Mining Functions: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception?―One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
Data Mining Functions: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis
 e.g., regression and value prediction
 Sequential pattern mining
 e.g., buy digital camera, then buy large memory cards
 Periodicity analysis
 Motifs and biological sequence analysis
 Approximate and consecutive motifs
 Similarity-based analysis
 Mining data streams
 Ordered, time-varying, potentially infinite, data streams
Data Mining Functions: (7) Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
 Information network analysis
 Social networks: actors (objects, nodes) and relationships (edges)
 e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks
 A person could be multiple information networks: friends, family, classmates, …
 Links carry a lot of semantic information: Link mining
 Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks
 Web community discovery, opinion mining, usage mining, …
Are All Mined Knowledge Interesting ?
 A data mining system has the potential to generate thousands or even millions
of patterns, or rules.
 1. Are all of the patterns interesting?
 No.
No Only a small fraction would actually be of interest to a given user.
 2. What makes a pattern interesting?
 A pattern is interesting if it is (1) easily understood by humans, (2) valid on
new or test data with some degree of certainty, (3) potentially useful, and (4)
novel.
 A pattern is also interesting if it validates a hypothesis that the user sought to
confirm.

An interesting pattern represents knowledge.


Are All Mined Knowledge Interesting ?
Objective measures of pattern interestingness :
 Consider X ⇒ Y

 Support : the % of transactions from a transaction database that the given rule
satisfies. This is the probability P(X ∪ Y ),
where X ∪ Y indicates that a transaction contains both X and Y , that is, the union of
itemsets X and Y.
support(X ⇒ Y ) = P(X ∪ Y )
 Confidence : assesses the degree of certainty of the detected association. This is
conditional probability P(Y | X), that is, the probability that a transaction containing
X also contains Y.
confidence(X ⇒ Y ) = P(Y | X)
 Example: Idly Rice ⇒ Uddina Bele [0.4%, 80%] (support, confidence)
Are All Mined Knowledge Interesting ?
 3. Can a data mining system generate all of the interesting
patterns?
 Refers to the completeness of a data mining algorithm.
 It is unrealistic and inefficient for data mining systems to generate
all possible patterns.
 Association rule mining is an example where the use of constraints
and interestingness measures can ensure the completeness of
mining.
Are All Mined Knowledge Interesting ?
 4. Can the system generate only the interesting ones?
 is an optimization problem in data mining.
 It is highly desirable for data mining systems to generate only
interesting patterns.
 Progress has been made in this direction; however, such
optimization remains a challenging issue in data mining.
Data Mining: Which Technologies are Used?
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be scalable to handle big data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social and information networks
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Applications of Data Mining
 Web page analysis: classification, clustering, ranking
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis
 Data mining and software engineering
 Data mining and text analysis
 Data mining and social and information network analysis
 Built-in (invisible data mining) functions in Google, MS, Yahoo!, Linked, Facebook, …
 Major dedicated data mining systems/tools
 SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
Major Issues in Data Mining (1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
Major Issues in Data Mining (2)
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
Summary
 Data mining: Discovering interesting patterns and knowledge from massive amount of data
 A natural evolution of science and information technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection, transformation, data
mining, pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination, association, classification,
clustering, trend and outlier analysis, etc.
 Data mining technologies and applications
 Major issues in data mining
Data,
Types of Data,
Datasets 42
Data

 Data sets differ in a number of ways


– Quantitative or Qualitative (Nominal, ordinal, interval, ratio)
– Binary or Discrete or Continuous
– Asymmetric, ordered, sequential, time-series, etc….
 The type of data determines which tools and techniques can be
used to analyze that data.

43
What is Data?

 Attributes
A Data Set is a collection of data objects.
Data objects are described by their
attributes Tid Refund Marital Taxable
Status Income Cheat
 An attribute is a property or characteristic of No
1 Yes Single 125K
an object that may vary from one object to 2 No Married 100K No
another or from one time to another 3 No Single 70K No

Objects
– Examples: eye color of a person, temperature, 4 Yes Married 120K No
etc. 5 No Divorced 95K Yes
 A collection of attributes describe an object 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
44
10
Data Objects and Attributes

 Data Objects : other names :


– Record, point, vector, pattern, event, case, sample,
entity.

 Attributes : other names :


– Variable, characteristic, field, feature,
feature dimension.
dimension

45
Attribute Values

 Attribute values are numbers or symbols assigned to an


attribute for a particular object

 Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values


 Example: Attribute values for ID and age are integers
– But properties of attribute can be different than the properties of
the values used to represent the attribute
46
Attributes and Measurement

A Measurement Scale is a rule (function) that


associates a numerical or symbolic value with an
attribute of an object.

 The process of measurement is the application of a


measurement scale to associate a value with a
particular attribute of a specific object.

47
Types of Attributes or
Levels of Measurements
 There are different types of attributes
– Nominal
 You can categorize (distinct) your data by labelling them in mutually exclusive groups,
but there is no order between the categories.
– Ordinal
 You can categorize and rank (order) your data in an order, but you cannot say anything
about the intervals between the rankings.
– Interval
 You can categorize, rank, and infer equal intervals (differences are meaningful) between
neighboring data points, but there is no true zero point.
– Ratio
 You can categorize, rank, and infer equal intervals between neighboring data points, and
there is a true zero point (ratio).
48
Types of Attributes or Levels of Measurements
 Nominal : You can categorize (distinct) your data by labelling them in mutually exclusive groups, but there
is no order between the categories.
– Examples: ID numbers, eye color, zip codes, City of birth, Gender, Ethnicity, Car brands, Marital status
 Ordinal : You can categorize and rank (order) your data in an order, but you cannot say anything about the
intervals between the rankings.
– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium,
short}, Top 5 Olympic medallists, Language ability (e.g., beginner, intermediate, fluent), Likert-type
questions (e.g., very dissatisfied to very satisfied)

 Interval : You can categorize, rank, and infer equal intervals (differences are meaningful) between
neighboring data points, but there is no true zero point.
– Examples: calendar dates, temperatures in Celsius or Fahrenheit, Test scores (e.g., IQ or exams).

 Ratio : You can categorize, rank, and infer equal intervals between neighboring data points, and there is a
true zero point (ratio).
– Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race), Height, Age,
Weight. 49
Properties of Attribute Values

 Thetype of an attribute depends on which of the following


properties/operations it possesses:
– Distinctness: = ≠
– Order: < >
– Differences are meaningful : + -
– Ratios are meaningful : * /
– Nominal attribute : distinctness
– Ordinal attribute : distinctness & order
– Interval attribute : distinctness, order & meaningful differences
– Ratio attribute : all 4 properties/operations 50
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
Categorical

values only ID numbers, eye contingency


Qualitative

distinguish. (=, ) color, sex: {male, correlation, 2


female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
Quantitative

attributes, temperature in deviation,


Numeric

differences between Celsius or Fahrenheit Pearson's


values are correlation, t and
meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is done by S. S. Stevens


51
Types of Attributes by Number of Values

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

52
Asymmetric Attributes

 Only presence (a non-zero attribute value) is regarded as


important – Sparsity of the data
 Words present in documents
 Items present in customer transactions
 Example : Student object has an attribute if the student has arrears(back
paper). The possible values are 1(true) and 0(false).
– Only few students will have arrears. So, it will be meaningful to consider 1
and not consider 0 for any comparison - Since most students will have 0.
 Consider only presence of an attribute. Don’t consider absence of it.

53
Asymmetric Attributes

 Binary attributes where only non-zero values are important are


called as Asymmetric Binary Attributes.

Useful in Association Analysis.
 Asymmetric Discrete and Asymmetric Continuous attributes are
also possible.

 Ifwe met a friend in the grocery store would we ever say the
following?
“I see our purchases are very similar since we didn’t buy most of the same
things.” 54
Types of data sets
 There are many different types of Data Sets. New types keep poping
up. We will group the types into 3 groups.

Record

Graph

Ordered
 But, first, we will look at some characteristics that apply to data sets :

Dimensionality

Sparcity

Resolution

55
Types of Data Sets
Important Characteristics of Data

Dimensionality (number of attributes that objects in the dataset
possess)
 Data with small number of dimensions tend to be qualitatively different than
moderate or high dimensional data.
 High dimensional data brings a number of challenges.
 Curse of Dimensionality : Difficulty associated with analysing high
dimensional data.

Sparsity
 Only presence counts – like in asymmetric attributes
 Advantage : Only non-zero values need to be stored and processed

Size
 Type of analysis may depend on size of data 56
Types of Data Sets
Important Characteristics of Data

Resolution
 Properties of data are
different at different
levels of resolution
 Patterns depend
on the scale

57
Types of Data Sets
Important Characteristics of Data

Resolution
 Patterns depend
on the scale

Perspective

58
Types of data sets
 Record
– Data Matrix
– Sparse Data Matrix (Document Data)
– Transaction Data
 Graph
– Data with Relationship among Objects (WWW)
– Data with Objects that are Graphs (Molecule)
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

59
Types of Data Sets: (1) Record Data
 Relational records
 Relational tables, highly structured
 Data matrix, e.g., numerical matrix, crosstabs

 Transaction data

timeout

season
coach

game
score
team

ball

lost
pla
TID Items

wi
n
y
1 Bread, Coke, Milk
2 Beer, Bread
Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

 Document data: Term-frequency vector (matrix) of text documents


Types of Data Sets: (2) Graphs and Networks
 Transportation network

 World Wide Web

 Molecular Structures

 Social or information networks


Types of Data Sets: (3) Ordered Data
 Video data: sequence of images

 Temporal data: time-series

 Sequential Data: transaction sequences


 Genetic sequence data
Ordered Data – Time Series Data
 A special type of sequential data. Each record is a time series – a
series of measurements taken over time.
 Temporal Auto-correlation (if 2 measurements are close in time,
then their values are often similar).

63
Types of Data Sets: Spatial, image and multimedia Data
 Spatial data: maps

 Image data:

 Video data:
Spacial Data

 Objects have spacial attributes (like positions or areas).


 Spacial auto-correlation :
Objects that are
physically close
tend to be similar
in other ways also.
 Spatio-Temporal
Data

Average Monthly
Temperature of land
and ocean 65
Handling Non-Record Data

 Most data mining algorithms are designed for record data or its
variations.
 What do with non-record data ?

– Extract features from data objects and use these features to create a
record corresponding to each object.
 This works well for some cases.
 In some other cases, this type of representation does not capture all
information about the data.

66

You might also like