0% found this document useful (0 votes)
11 views31 pages

DMDW Unit1

Data mining is the process of extracting knowledge from large datasets, and it involves several steps including data selection, preprocessing, transformation, mining, pattern evaluation, knowledge representation, and deployment. The KDD (Knowledge Discovery in Databases) process is a structured approach to data mining that aims to find meaningful patterns and insights. Data mining functionalities include classification, prediction, association analysis, and clustering, while challenges include data quality, scalability, privacy, and interpretability.

Uploaded by

krishnaveni kits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views31 pages

DMDW Unit1

Data mining is the process of extracting knowledge from large datasets, and it involves several steps including data selection, preprocessing, transformation, mining, pattern evaluation, knowledge representation, and deployment. The KDD (Knowledge Discovery in Databases) process is a structured approach to data mining that aims to find meaningful patterns and insights. Data mining functionalities include classification, prediction, association analysis, and clustering, while challenges include data quality, scalability, privacy, and interpretability.

Uploaded by

krishnaveni kits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DMDW-UNIT1

Q) What is Data mining? Explain the steps involved in data mining


process? Or
What is KDD? What are the steps involved in KDD?
Data mining refers to extracting or “mining” knowledge from large amounts
of data. Data mining is the process of analyzing enormous amounts of
information and datasets, extracting (or “mining”) useful intelligence to help
organizations solve problems, predict trends, reducing risks, and find new
opportunities.
KDD is referred to as Knowledge Discovery in Database and is defined as a
method of finding, transforming, and refining meaningful data and patterns
from a raw database in order to be utilized in different domains or
applications.
KDD Process in Data Mining
The KDD process in data mining is a multi-step process that involves various
stages to extract useful knowledge from large datasets. The following are
the main steps involved in the KDD process -
Data Selection - The first step in the KDD process is identifying and selecting
the relevant data for analysis. This involves choosing the relevant data
sources, such as databases, data warehouses, and data streams, and
determining which data is required for the analysis.
Data Preprocessing - After selecting the data, the next step is data
preprocessing. This step involves cleaning the data, removing outliers, and
removing missing, inconsistent, or irrelevant data. This step is critical, as the
data quality can significantly impact the accuracy and effectiveness of the
analysis.
Data Transformation - Once the data is preprocessed, the next step is to
transform it into a format that data mining techniques can analyze. This step
involves reducing the data dimensionality, aggregating the data, normalizing
it, and discretizing it to prepare it for further analysis.

1
DMDW-UNIT1

Data Mining - This is the heart of the KDD process and involves applying
various data mining techniques to the transformed data to discover hidden
patterns, trends, relationships, and insights. A few of the most common data
mining techniques include clustering, classification, association rule mining,
and anomaly detection.
Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This
involves assessing the quality of the patterns, evaluating their significance,
and selecting the most promising patterns for further analysis.
Knowledge Representation - This step involves representing the knowledge
extracted from the data in a way humans can easily understand and use.
This can be done through visualizations, reports, or other forms of
communication that provide meaningful insights into the data.
Deployment - The final step in the KDD process is to deploy the knowledge
and insights gained from the data mining process to practical applications.
This involves integrating the knowledge into decision-making processes or
other applications to improve organizational efficiency and effectiveness.

Q) Explain the functionalities of Data mining.


Classification
Classification is a supervised learning technique used to categorize data into
predefined classes or labels. It involves training a model using a labeled
dataset, which is then used to predict the class of new, unlabeled data.

2
DMDW-UNIT1

Common algorithms used for classification include decision trees, support


vector machines, and neural networks.
Prediction
Prediction in data mining is the process of using historical data and patterns
to make informed estimates about future or missing data values.
Finding missing data in a database is very important for the accuracy of the
analysis. Prediction is one of the data mining functionalities that help the
analyst find the missing numeric values.
Class Description
This is one of the data mining functionalities that is used to associate data
with a class or concept. There are two key categories in this context - data
characterization and data discrimination.
Data characterization
Data characterization is the process of summarizing general characteristics
or features of a class of data by defining the target class through specific
rules.
Data Discrimination - Data discrimination, in contrast, is concerned with
distinguishing between different classes or categories within a dataset. It
aims to find features or patterns that can effectively separate one class from
another.
Association Analysis
It examines the group of items that commonly appear together within a
transactional dataset. This technique is often called Market Basket Analysis
due to its prevalent application in the retail industry. To establish association
rules, two key parameters are employed:
Support - This parameter identifies the frequency of occurrence of a
particular item set within the database.

3
DMDW-UNIT1

Confidence - Confidence represents the conditional probability that an item


will appear in a transaction, given the occurrence of another item in the
same transaction.
Cluster Analysis
Cluster analysis in data mining is a method used to group similar data points
together based on their inherent characteristics or attributes. It aims to
discover patterns and relationships within data by identifying clusters or
clusters of data points that share common features.
Outlier Analysis
Outlier analysis in data mining is the process of identifying and examining
data points that significantly deviate from the expected or normal patterns
within a dataset. These outliers, often anomalies or exceptions, may hold
valuable information or indicate errors, fraud, or rare events.
Correlation Analysis
Correlation analysis in data mining involves examining the statistical
relationships between two or more variables within a dataset. It quantifies
the degree to which changes in one variable are associated with changes in
another, providing insights into their interdependence.
One of the most common examples of such attributes is height and weight.
Researchers often use these two variables to find if there is any relationship
between them.
Q) What are the Challenges and Limitations of Data Mining?
Below are some challenges and Limitations of Data Mining:
Data Quality: Poor data quality, including missing values, inconsistent
formatting, and errors, can lead to inaccurate or unreliable results
Scalability: Large datasets require significant computational resources and
efficient algorithms to process and analyze
Privacy and Security: Data mining raises concerns regarding data privacy
and security, as sensitive information may be exposed or misused

4
DMDW-UNIT1

Interpretability: Complex models may be difficult to understand and


interpret, making it challenging to explain their predictions and results.
Q) Discus about Classification of Data Mining systems.
Data mining is an interdisciplinary field in which different fields are
interconnecting, including database systems, statistics, artificial learning,
visualization, and information science.

Database Data
A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
A relational database is a collection of tables, each of which is assigned a
unique name. Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values.

5
DMDW-UNIT1

A relational database for AllElectronics. The company is described by the


following relation tables: customer, item, employee, and branch. The
headers of the tables described here are shown.

Relational data can be accessed by database queries written in a relational


query language (e.g., SQL) or with the assistance of graphical user
interfaces. When mining relational databases, we can go further by
searching for trends or data patterns.
Statistics
A statistical model is a set of mathematical functions that describe the
behavior of the objects in a target class in terms of random variables and
their associated probability distributions. Statistical models are widely used
to model data and data classes
Machine Learning
Machine learning investigates how computers can learn (or improve their
performance) based on data. They can be classified as
Supervised Learning
In Supervised Learning, the machine learns under supervision. It contains a
model that is able to predict with the help of a labeled dataset. A labeled
dataset is one where you already know the target answer.
E.g.:

6
DMDW-UNIT1

Unsupervised Learning:
In Unsupervised Learning, the machine uses unlabeled data and learns on
itself without any supervision. The machine tries to find a pattern in the
unlabeled data and gives a response.

Semi-supervised learning
Semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabeled examples when learning a model. In
one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between classes.
DataWarehouses
A data warehouse integrates data originating from multiple sources and
various timeframes. It consolidates data in multidimensional space to form
partially materialized data cubes. The data cube model not only facilitates
OLAP in multidimensional databases but also promotes multidimensional
data mining.

7
DMDW-UNIT1

Information Retrieval
Information retrieval (IR) is the science of searching for documents or
information in documents. Documents can be text or multimedia, and may
reside on the Web.
Q) Explain the Datamining Task primitives?
A Data mining task can be specified in the form of data mining query, which
is input to the data mining system.
A data mining query is defined in terms of data mining task primitives.
The data mining primitives specify the following:
1. The set of task-relevant data to be mined: This specifies the portions
of the database or the set of data in which the user is interested. This
includes the database attributes or dataware house dimensions of interest.
2. The kind of knowledge to be mined: This specifies the data mining
functions to be performed such as characterization, discrimination,
association or correlation analysis or evolution analysis, classification,
prediction, clustering, outlier analysis or evolution analysis.
Example:
The search for association rules is confined to those matching the given
meta rule, such as
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”) [2.2%,
60%]
and occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X,
“computer”) [1.4%, 70%]
The former rule states that customers in their thirties, with an annual
income of between 40K and 49K, are likely (with 60% confidence) to
purchase a VCR, and such cases represent about 2.2.% of the total number
of transactions.

8
DMDW-UNIT1

3. The background knowledge to be used in the discovery process:


The background knowledge allows data to be mined at multiple levels of
abstraction. For example, the Concept hierarchies are one of the background
knowledge that allows data to be mined at multiple levels of abstraction.
Concept hierarchy defines a sequence of mappings from a set of low –
level concepts to higher – level, more general concepts. A concept hierarchy
for the dimension location is shown in the figure, mapping low-level concepts
(i.e. cities) to more general concepts (i.e. countries).
Concept hierarchy consists of four levels. In our example, level 1
represents the concept country, while levels 2 and 3 represent the concepts
province_or_state and city respectively.

4. Interestingness measures:
Different kinds of knowledge may have different interesting measures. They
may be used to guide the mining process to evaluate the discovered
patterns. For example, interesting measures for association rules include
support and confidence.
Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
Simplicity: A factor contributing to the interestingness of a pattern is the
pattern’s overall simplicity for human comprehension.
Certainty (Confidence): A certainty measure for association rules of the
form "A =>B" where A and B are sets of items is confidence. Confidence is a

9
DMDW-UNIT1

certainty measure. Given a set of task-relevant data tuples, the confidence


of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing
A.
Utility (Support): The potential usefulness of a pattern is a factor defining
its interestingness. It can be estimated by a utility function, such as support.
The support of an association pattern refers to the percentage of task-
relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples.
5. The expected representation for visualizing the discovered
patterns
This refers to the form in which discovered patterns are to be displayed,
which may include rules, tables, cross tabs, charts, graphs, decision trees,
cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for
displaying the discovered patterns. Some representation forms may be
better suited than others for particular kinds of knowledge.
Q) Explain the Data Integration of a Data Mining System with a
Database or a Data Warehouse System.
Integration Of A Data Mining System With A Database Or Data Warehouse
System
DB and DW systems, possible integration schemes include no coupling, loose
coupling, semitight coupling, and tight coupling. We examine each of these
schemes, as follows:
1. No coupling: No coupling means that a DM system will not utilize any function
of a DB or DW system. It may fetch data from a particular source (such as a file
system), process data using some data mining algorithms, and then store the
mining results in another file.

10
DMDW-UNIT1

2. Loose coupling: Loose coupling means that a DM system will use some
facilities of a DB or DW system, fetching data from a data repository managed by
these systems, performing data mining, and then storing the mining results either
in a file or in a designated place in a database or data Warehouse.
3. Semitight coupling: Semitight coupling means that besides linking a DM
system to a DB/DW system, efficient implementations of a few essential data
mining primitives (identified by the analysis of frequently encountered data mining
functions) can be provided in the DB/DW system.
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated
into the DB/DW system. The data mining subsystem is treated as one functional
component of information system.

Q) Explain the Major Issues in Data Mining


Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here we will discuss the major issues regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

11
DMDW-UNIT1

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
12
DMDW-UNIT1

tasks, should be integrated with a data warehouse query language and


optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the
patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi

13
DMDW-UNIT1

structured or unstructured. Therefore mining the knowledge from


them adds challenges to data mining.
Q) Explain the CRISP Model?
Or
Discus the different phases of Data mining life cycle.
CRISP-DM, which stands for Cross-Industry Standard Process for Data
Mining, is an industry-proven way to guide your data mining efforts.

As a methodology, it includes descriptions of the typical phases of a project,


the tasks involved with each phase, and an explanation of the relationships
between these tasks.
As a process model, CRISP-DM provides an overview of the data mining life
cycle.

It has six sequential phases:


1. Business understanding:
The Business Understanding phase focuses on understanding the objectives
and requirements of the project. It has the following objectives:

14
DMDW-UNIT1

a. Determine business objectives: You should first “thoroughly


understand, from a business perspective, what the customer really wants to
accomplish”.
b. Assess situation: Determine resources availability, project
requirements, assess risks and contingencies, and conduct a cost-benefit
analysis.
c. Determine data mining goals: In addition to defining the business
objectives, you should also define what success looks like from a technical
data mining perspective.
d. Produce project plan: Select technologies and tools and define
detailed plans for each project phase.
2. Data Understanding:
Next is the Data Understanding phase. Adding to the foundation of Business
Understanding, it drives the focus to identify, collect, and analyze the data
sets that can help you accomplish the project goals. This phase also has four
tasks:
a. Collect initial data: Acquire the necessary data and (if necessary)
load it into your analysis tool.
b. Describe data: Examine the data and document its surface
properties like data format, number of records, or field identities.
c. Explore data: Dig deeper into the data. Query it, visualize it, and
identify relationships among the data.
d. Verify data quality: How clean/dirty is the data? Document any
quality issues.
3. Data Preparation:
This phase, which is often referred to as “data munging”, prepares the final
data set(s) for modeling. It has five tasks:
a. Select data: Determine which data sets will be used and document
reasons for inclusion/exclusion.

15
DMDW-UNIT1

b. Clean data: Often this is the lengthiest task. Without it, you’ll likely
fall victim to garbage-in, garbage-out. A common practice during this task is
to correct, impute, or remove erroneous values.
c. Construct data: Derive new attributes that will be helpful. For
example, derive someone’s body mass index from height and weight fields.
d. Integrate data: Create new data sets by combining data from
multiple sources.
e. Format data: Re-format data as necessary. For example, you might
convert string values that store numbers to numeric values so that you can
perform mathematical operations.
4. Modeling:
This phase has four tasks:
a. Select modeling techniques: Determine which algorithms to try
b. Generate test design: you might need to split the data into training, test,
and validation sets.
c. Build model: this might just be executing a few lines of code.
d. Assess model: Generally, multiple models are competing against each
other, and the data scientist needs to interpret the model results based on
domain knowledge, the pre-defined success criteria, and the test design.
5. Evaluation
This phase has three tasks:
a. Evaluate results: Do the models meet the business success criteria?
Which one(s) should we approve for the business?
b. Review process: Review the work accomplished.
c. Determine next steps: Based on the previous three tasks, determine
whether to proceed to deployment, iterate further.
6. Deployment

16
DMDW-UNIT1

“Depending on the requirements, the deployment phase can be as simple as


generating a report or as complex as implementing a repeatable data mining
process across the enterprise.”
Q) Discus about Data preprocessing.
Data preprocessing is an important step in the data mining process. It refers
to the cleaning, transforming, and integrating of data in order to make it
ready for analysis.
In data preprocessing we focus on the data quality which is very important
for data analyzing.
Data Quality:
There are many factors comprising data quality. They are accuracy,
completeness, consistency, timelines, believability and interpretability.
Inaccurate, incomplete and inconsistent data are common properties
of real world database and data warehouse. There are many possible
reasons for inaccurate data. The reasons include data entry by human or at
computer entry level. Users purposely submit incorrect data values for
mandatory fields when they don’t wish to submit personal information.
Errors in data transmission can also occur due to technology
limitations such as limited buffer size. Incorrect data may also result from
inconsistencies in data codes or inconsistent formats for input fields (e.g.
Date format). Duplicate tuples also require data cleaning.
Incomplete data can occur for a number of reasons. Attributes of
interest may not be available such as customer information for sales
transaction data. Other data may not be included because they were not
considered important at the time of entry. Data that were inconsistent with
other recorded data may be deleted.
Timeliness also affects data quality. Suppose there distribution of
monthly sales bonuses to the top sales representatives at “AllElectronics”.
Several sales representatives fail to submit their sales record on time at the

17
DMDW-UNIT1

end of the month. For a period of time following each month, the data stored
in the database is incomplete. Month end data are not updated in timely
fashion has negative impact on data quality.
Two others factors affecting data quality are believability and
interpretability. Believability reflects how much the data are trusted by
users, while interpretability reflects how easy the data are understood.
Suppose that a database at one point had several errors all of which have
since been corrected. The past errors had caused many problems so it no
longer to trust.
Q) What are the Data cleaning methods in Data mining?
In a real world data to be incomplete, noisy and inconsistent. Data cleaning
methods attempt to fill in missing values, smooth out noise by identifying
outliers and correct inconsistencies in the data.
In this section we discuss the basic methods for data cleaning.
Data cleaning methods:
Missing values:
The following are the basic methods for filling the missing values.
1. Ignore the tuple: This is usually done when the class label is missing
2. Fill in the missing value manually: In general this approach is time
consuming and may not be feasible given a large data set with many
missing values.
3. Use global constant to fill the missing value: replace all missing attribute
values by the same constant such as label like “unknown” or ∞
4. Use the attribute mean to fill in the missing value

5. Use the attribute mean for all samples belonging to the same class to fill
in the missing value
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference based tools using Bayesian formalism
or decision tree.

18
DMDW-UNIT1

For example:

Noise data
Noise is a random error or variance in measured variable. The following are
the methods to handle noisy data.
a. Binning
The binning method can be used for smoothing the data. Mostly data is full
of noise. Data smoothing is a data pre-processing technique using a different
kind of algorithm to remove the noise from the data set.
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means
For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
Bin 1 = 12, 12, 12, 12
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23

19
DMDW-UNIT1

For Bin 3:
(27 + 30 + 30 + 34 / 4) = 30
Bin 3 = 30, 30, 30, 30
Smoothing by bin boundaries
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
You need to pick the minimum and maximum value. Put the minimum on
the left side and maximum on the right side. Now, what will happen to the
middle values?
Middle values in bin boundaries move to its closest neighbor value with less
distance.
Unsorted data for price in dollars:
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin1: 8, 9, 15, 16
Bin2: 21, 21, 24, 26,
Bin3: 27, 30, 30, 34
Smooth data after bin Boundary
Before bin Boundary: Bin1: 8, 9, 15, 16
Here, 1 is the minimum value and 16 is the maximum value.9 is near to 8,
so 9 will be treated as 8. 15 is more near to 16 and farther away from 8. So,
15 will be treated as 16.
After bin Boundary: Bin 1: 8, 8, 16, 16
Before bin Boundary: Bin 2: 21, 21, 24, 26,
After bin Boundary: Bin 2: 21, 21, 26, 26,
Before bin Boundary: Bin 3: 27, 30, 30, 34
After bin Boundary: Bin 3: 27, 27, 27, 34

20
DMDW-UNIT1

b. Regression:
Data smoothing can also be done by regression, a technique that conforms
data values to a function. Linear regression involves finding the “best” line to
fit two attributes (or variables) so that one attribute can be used to predict
the other. Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to a
multidimensional surface.
c. Outlier analysis
Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers
d. Data cleaning as process:
To handle the inconsistency data we use data cleaning as process. The first
step in data cleaning as process is “discrepancy detection”.
This is occur due to badly designed data forms, human errors, data decay
(out dated values) This can be removed by “meta data”. Discrepancy can
also be removed by measuring the central tendency.
The data should also be examined regarding “unique rules”, “consecutive
rules” and “null rules”.
A “unique rule” says that each value of given attribute must be different
from all other values for that attribute.
A “consecutive rule” says that there can be no missing values between
lowest and highest values for the attribute and that all values must also be
unique.
A “null rule” specifies the use of blanks, question marks and special
characters may indicate the null condition.
Q) Write about data integration?
Data integration merges data from several heterogeneous sources to attain
meaningful data. The source involves several databases, multiple files or

21
DMDW-UNIT1

data cubes. The integrated data must exemption from inconsistencies,


discrepancies, redundancies.
Entity Identification Problem:
As we know the data is unified from the heterogeneous sources then how
can we ‘match the real-world entities from the data’. For example, we have
customer data from two different data source. An entity from one data
source has “customer_id” and the entity from the other data source has
“customer_number”. Now how the data analyst would understand that these
two entities refer to the same attribute?
Here the schema integration can be achieved using metadata of each
attribute. Metadata of an attribute incorporates its name, what does it mean
in the particular scenario, what is its data type, up to what range it can
accept the value. What rules does the attribute follow for the null value,
blank, or zero? Analyzing this metadata information will prevent error in
schema integration.
Redundancy and Correlation Analysis:
Redundancy is another important issue in data integration. An attribute (eg
annual revenue) may be redundant if it can be “derived” from another
attribute or set of attributes.
Some redundancies can be detected by correlation analysis. Given two
attributes such analysis can measure how strongly one attribute implies the
other based on available data.
For example:
Correlation Test for Nominal Data:
For nominal data, a correlation relationship between two attributes, A and B,

can be discovered by a (chi-square) test. Suppose A has c distinct


values, namely a1,a2, …..,ac . B has r distinct values, namely b1,b2, ….,br .
The data tuples described by A and B can be shown as a contingency

22
DMDW-UNIT1

table, with the c values of A making up the columns and the r values of B
making up the rows.
Let (Ai ,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj) Each and
every possible (Ai ,Bj) joint event has its own cell (or slot) in the table. The

value (also known as the Pearson statistic) is computed as

23
DMDW-UNIT1

Tuple Duplication:
In addition to detecting redundancies between attributes, duplication should
also be detected at the tuple level (e.g., where there are two or more
identical tuples for a given unique data entry). Inconsistencies often arise
between various duplicates, due to inaccurate data entry or updating some
but not all data occurrences.
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value
conflicts. For example, for the real-world entity, attribute values from
different sources may differ.
This may be due to differences in representation. For instance, a weight
attribute may be stored in metric units in one system and British imperial
units in another for suppose the price of a hotel room may be represented in
different currencies in different cities. This kind of issues is detected and
resolved during data integration.

24
DMDW-UNIT1

Q) What are the data reduction techniques?


Data reduction is a process that reduced the volume of original data and
represents it in a much smaller volume. Data reduction techniques ensure
the integrity of data while reducing the data.
Methods of data reduction:
Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing
technique. It transforms a vector into a numerically different vector (D to D’)
of wavelet coefficients.
Wavelet Transformation:
let consider the vector X consists of 'n' dimensional is transformed into X'
which also contains of 'n' dimensional by applying wavelet transformation.
Here X' contains 'n' dimensional means that coefficient contains only
constraints specified values.
For example
consider the vector X=56,40,8,24,48,48,40,16 note that the coefficients 2
power of n, if there is less then add 0 to it.
now take 3 columns as
resolution Averages(a+b/2) DWT Coefficients(a-b)/2
8 56+40)/2,(8+24)/2,(48+48)/2, (56-40)/2,(8-24)/2,
(40+16)/2 (48-48)/2,(40-16)/2
4 [48,16,48,28] apply the 8,-8,0,12 apply the recursive
recursive formula here we get formula here we get
2 [32,38] [16,16]
1 35 8

25
DMDW-UNIT1

Where resolution level reaches “1” we stop the recursive process and notes
the DWT coefficients
[35,-3, 16, 16, 8,-8, 0, 12]
User specified threshold (constraint) is no need of below ‘0’ values so we put
‘0’ value in place of negative values. So final coefficients are
[35, 0, 16, 10, 8, 0, 0, 12]
Attribute subset selection
The data set may have a large number of attributes. But some of
those attributes can be irrelevant or redundant. The goal of attribute
subset selection is to find a minimum set of attributes such that
dropping of those irrelevant attributes does not much affect the utility
of data and the cost of data analysis could be reduced.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
1. Stepwise Forward Selection: This procedure start with an empty set
of attributes as the minimal set. The most relevant attributes are chosen
(having minimum p-value) and are added to the minimal set. In each
iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination: Here all the attributes are
considered in the initial set of attributes. In each iteration, one attribute
is eliminated from the set of attributes whose p-value is higher than
significance level.
3. Combination of Forward Selection and Backward Elimination: The
stepwise forward selection and backward elimination are combined so as

26
DMDW-UNIT1

to select the relevant attributes most efficiently. This is the most


common technique which is generally used for attribute selection.
4. Decision Tree Induction: This approach uses decision tree for
attribute selection. It constructs a flow chart like structure having nodes
denoting a test on an attribute. Each branch corresponds to the outcome
of test and leaf nodes is a class prediction. The attribute that is not the
part of tree is considered irrelevant and hence discarded.

The above figure shows heuristic methods for attribute subset selection.
Clustering
Clustering techniques consider data tuples as objects. They partition the
objects into groups or clusters, so that objects in cluster are “similar” to
one another and “dissimilar” to other clusters.
Similarity is defined in terms of how “close” the objects are in space based
on distance function. The quality of cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is defined
as average distance of each cluster object from the cluster Centroid.
For example:

27
DMDW-UNIT1

Data Cube aggregation


Data cubes store multidimensional aggregated (summarized) information.
Each cell holds an aggregate data value, corresponding to the data point in
Multidimensional space.
Data cubes provide fast access to pre computed summarized data, thereby
benefiting on-line analytical processing as well as data mining.
The lowest level of a data cube (base cuboid)
 The cube created at the lowest level of abstraction is referred to as the base
cuboid.
 The aggregated data for an individual entity of interest
Eg

28
DMDW-UNIT1

Let's consider the situation of a company's data. This data consist of the
“AllElectronics” sales per quarter, for the years 2008 to 2010.
Now we are, generally, interested in the annual sales (total per year), rather
than the total per quarter.
Thus the data can be reduced so that the resulting data summarizes the
total sales per year instead of per quarter.
Eg:

Histograms:
Histograms use binning to approximate data distributions and are popular
form of data reduction. A histogram for an attribute “A” partitions the data
distribution of “A” into disjoint subsets, referred as buckets or bins.
If each bucket represents a single attribute value/frequency the buckets are
called “singleton buckets”.

29
DMDW-UNIT1

The following figure show histogram for the data using single ton buckets. To
further reduce the data, it is common to have bucket denote a continuous
value range for the given attribute.

In the following figure each bucket represents a different $10 range for
price.

30
DMDW-UNIT1

Buckets determined and attribute values are partitioned based on the


following rules
Equal-width: in equal width histogram, the width of each bucket range is
uniform.
Equal-frequency: In equal-frequency histogram, the buckets are created so
that roughly the frequency of each bucket is constant.
Regression and Log linear Models
Regression is a data mining technique used to predict a range of numeric
values (also called continuous values), given a particular dataset. For
example, regression might be used to predict the cost of a product or
service, given other variables.
Regression can be a simple linear regression or multiple linear
regression. When there is only single independent attribute, such
regression model is called simple linear regression. if there are multiple
independent attributes, then such regression models are called multiple
linear regression.
In linear regression, the data are modeled to a fit straight line. For
example, a random variable y can be modeled as a linear function of
another random variable x with the equation
y = ax+b
where a and b (regression coefficients) specifies the slope and y-intercept
of the line, respectively.

31

You might also like