02 DM BI Data Mining
02 DM BI Data Mining
and
Business Intelligence
Data Mining
Module 2
Created/Adopted/Modified for
Data Mining and Business Intelligence – MCA II Semester
Vidya Vikas Institute of Engineering & Technology
Mysore
2023-24
GPD
Why Data Mining?
Why Data Mining?
Data Mining
Data Cleaning
Data Integration
Databases
Knowledge Discovery
(KDD)
KDD Process
This is a view from typical database systems
and data warehousing communities
Data mining plays an essential role in the
knowledge discovery process
Knowledge Discovery from Data (KDD) Process
Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or KDD,
while others view data mining as merely an essential step in the
process of knowledge discovery.
The knowledge discovery process is an iterative sequence of the
following 7 steps:
Knowledge Discovery from Data (KDD) Process
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
Knowledge Discovery from Data (KDD) Process
5. Data mining (an essential process where intelligent methods
are applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
Data Mining or KDD ?
The term data mining is often used to refer to the entire knowledge
discovery process.
Therefore, we adopt a broad view of data mining functionality:
Data Exploration
Statistical Summary, Querying, and Reporting
1.Database
2.Data Warehouse
3.Transactional Database
4.Other Kinds of Data
1. Database-oriented data sets and applications
Relational database, Object-relational databases,
Heterogeneous databases and legacy databases
Data Mining: On What Kinds of Data?
2. Data Warehouse - A data warehouse is usually
modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an
attribute or a set of attributes in the schema, and each
cell stores the value of some aggregate measure such as
count or sum(sales amount).
A data cube provides a multidimensional view of data
and allows the precomputation and fast access of
summarized data.
Data Mining: On What Kinds of Data?
2. Data Warehouse
Data Mining: On What Kinds of Data?
2. Data Warehouse
Data Cube Example :
A data cube provides a
multidimensional view of
data and allows the
precomputation and fast
access of summarized data.
Data Mining: On What Kinds of Data?
2. Data Warehouse
Provides multidimensional data views and
precomputation of summarized data.
OLAP (OnLine Analytical Processing) operations
make use of background knowledge regarding
the domain of the data being studied to
allow the presentation of data at
different levels of abstraction.
Data Mining: On What Kinds of Data?
3. A transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web
page.
A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the transaction,
such as the items purchased in the transaction.
A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information
about the salesperson or the branch, and so on.
Data Mining: On What Kinds of Data?
4. Other kinds of data : Advanced data sets and advanced
applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and information
networks
Spatial data and spatio-temporal data
Multimedia database, Text databases, The World-Wide Web
What Kinds of Patterns Can Be Mined?
There are a number of data mining functionalities.
These include (1) characterization and discrimination; the (2) mining of
frequent patterns, associations, and correlations; (3) classification and
regression; (4) clustering analysis; and (5) outlier analysis.
Data mining functionalities are used to specify the kinds of patterns to be
found in data mining tasks
Such tasks can be classified into two categories:
1.Classification [Predictive]
2.Cluster Analysis [Descriptive]
3.Association Analysis [Descriptive]
4.Regression Analysis [Predictive]
5.Sequential Pattern Discovery [Descriptive]
6.Deviation Detection (Anomaly Detection) [Predictive]
Data Mining Functions:(1)
Characterization and Discrimination
Data characterization is a summarization of the general characteristics or features of a
target class of data.
For example : To study the characteristics of software products with sales that increased
by 10% in the previous year.
The data cube-based OLAP roll-up operation can be used
to perform user-controlled data summarization along a
specified dimension.
Data discrimination is a comparison of the general features
of the target class data objects against the general features
of objects from one or multiple contrasting classes.
For example, a user may want to compare the general features
of software products with sales that increased by 10% last
year against those with sales that decreased by at least 30%
during the same period.
Data Mining Functions: (2) Pattern Discovery
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your Supermarket?
Association and Correlation Analysis
Support : the % of transactions from a transaction database that the given rule
satisfies. This is the probability P(X ∪ Y ),
where X ∪ Y indicates that a transaction contains both X and Y , that is, the union of
itemsets X and Y.
support(X ⇒ Y ) = P(X ∪ Y )
Confidence : assesses the degree of certainty of the detected association. This is
conditional probability P(Y | X), that is, the probability that a transaction containing
X also contains Y.
confidence(X ⇒ Y ) = P(Y | X)
Example: Idly Rice ⇒ Uddina Bele [0.4%, 80%] (support, confidence)
Are All Mined Knowledge Interesting ?
3. Can a data mining system generate all of the interesting
patterns?
Refers to the completeness of a data mining algorithm.
It is unrealistic and inefficient for data mining systems to generate
all possible patterns.
Association rule mining is an example where the use of constraints
and interestingness measures can ensure the completeness of
mining.
Are All Mined Knowledge Interesting ?
4. Can the system generate only the interesting ones?
is an optimization problem in data mining.
It is highly desirable for data mining systems to generate only
interesting patterns.
Progress has been made in this direction; however, such
optimization remains a challenging issue in data mining.
Data Mining: Which Technologies are Used?
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be scalable to handle big data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social and information networks
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Applications of Data Mining
Web page analysis: classification, clustering, ranking
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis
Data mining and software engineering
Data mining and text analysis
Data mining and social and information network analysis
Built-in (invisible data mining) functions in Google, MS, Yahoo!, Linked, Facebook, …
Major dedicated data mining systems/tools
SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
Summary
Data mining: Discovering interesting patterns and knowledge from massive amount of data
A natural evolution of science and information technology, in great demand, with wide
applications
A KDD process includes data cleaning, data integration, data selection, transformation, data
mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination, association, classification,
clustering, trend and outlier analysis, etc.
Data mining technologies and applications
Major issues in data mining
Data,
Types of Data,
Datasets 42
Data
43
What is Data?
Attributes
A Data Set is a collection of data objects.
Data objects are described by their
attributes Tid Refund Marital Taxable
Status Income Cheat
An attribute is a property or characteristic of No
1 Yes Single 125K
an object that may vary from one object to 2 No Married 100K No
another or from one time to another 3 No Single 70K No
Objects
– Examples: eye color of a person, temperature, 4 Yes Married 120K No
etc. 5 No Divorced 95K Yes
A collection of attributes describe an object 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
44
10
Data Objects and Attributes
45
Attribute Values
47
Types of Attributes or
Levels of Measurements
There are different types of attributes
– Nominal
You can categorize (distinct) your data by labelling them in mutually exclusive groups,
but there is no order between the categories.
– Ordinal
You can categorize and rank (order) your data in an order, but you cannot say anything
about the intervals between the rankings.
– Interval
You can categorize, rank, and infer equal intervals (differences are meaningful) between
neighboring data points, but there is no true zero point.
– Ratio
You can categorize, rank, and infer equal intervals between neighboring data points, and
there is a true zero point (ratio).
48
Types of Attributes or Levels of Measurements
Nominal : You can categorize (distinct) your data by labelling them in mutually exclusive groups, but there
is no order between the categories.
– Examples: ID numbers, eye color, zip codes, City of birth, Gender, Ethnicity, Car brands, Marital status
Ordinal : You can categorize and rank (order) your data in an order, but you cannot say anything about the
intervals between the rankings.
– Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium,
short}, Top 5 Olympic medallists, Language ability (e.g., beginner, intermediate, fluent), Likert-type
questions (e.g., very dissatisfied to very satisfied)
Interval : You can categorize, rank, and infer equal intervals (differences are meaningful) between
neighboring data points, but there is no true zero point.
– Examples: calendar dates, temperatures in Celsius or Fahrenheit, Test scores (e.g., IQ or exams).
Ratio : You can categorize, rank, and infer equal intervals between neighboring data points, and there is a
true zero point (ratio).
– Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race), Height, Age,
Weight. 49
Properties of Attribute Values
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
52
Asymmetric Attributes
53
Asymmetric Attributes
Ifwe met a friend in the grocery store would we ever say the
following?
“I see our purchases are very similar since we didn’t buy most of the same
things.” 54
Types of data sets
There are many different types of Data Sets. New types keep poping
up. We will group the types into 3 groups.
●
Record
●
Graph
●
Ordered
But, first, we will look at some characteristics that apply to data sets :
●
Dimensionality
●
Sparcity
●
Resolution
55
Types of Data Sets
Important Characteristics of Data
●
Dimensionality (number of attributes that objects in the dataset
possess)
Data with small number of dimensions tend to be qualitatively different than
moderate or high dimensional data.
High dimensional data brings a number of challenges.
Curse of Dimensionality : Difficulty associated with analysing high
dimensional data.
●
Sparsity
Only presence counts – like in asymmetric attributes
Advantage : Only non-zero values need to be stored and processed
●
Size
Type of analysis may depend on size of data 56
Types of Data Sets
Important Characteristics of Data
●
Resolution
Properties of data are
different at different
levels of resolution
Patterns depend
on the scale
57
Types of Data Sets
Important Characteristics of Data
●
Resolution
Patterns depend
on the scale
●
Perspective
58
Types of data sets
Record
– Data Matrix
– Sparse Data Matrix (Document Data)
– Transaction Data
Graph
– Data with Relationship among Objects (WWW)
– Data with Objects that are Graphs (Molecule)
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
59
Types of Data Sets: (1) Record Data
Relational records
Relational tables, highly structured
Data matrix, e.g., numerical matrix, crosstabs
Transaction data
timeout
season
coach
game
score
team
ball
lost
pla
TID Items
wi
n
y
1 Bread, Coke, Milk
2 Beer, Bread
Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0
Molecular Structures
63
Types of Data Sets: Spatial, image and multimedia Data
Spatial data: maps
Image data:
Video data:
Spacial Data
Average Monthly
Temperature of land
and ocean 65
Handling Non-Record Data
Most data mining algorithms are designed for record data or its
variations.
What do with non-record data ?
– Extract features from data objects and use these features to create a
record corresponding to each object.
This works well for some cases.
In some other cases, this type of representation does not capture all
information about the data.
66