CM5107
Unit 3. Data Preprocessing
3.1 Need of preprocessing
3.2 Tasks in Data Preprocessing
3.3 Data Cleaning
3.3.1 Missing Values
3.3.2 Noisy Data
3.3.3 Data Cleaning as a Process
3.4 Data Integration
3.4.1 Entity Identification Problem
3.4.2 Redundancy and Correlation Analysis
3.4.3 Tuple Duplication
3.4.4 Data Value Conflict Detection and Resolution
3.5 Data Reduction
3.5.1 Wavelet Transforms
3.5.2 Principal Component Analysis
3.5.3 Attribute Subset Selection
3.5.4 Parametric data reduction
3.5.4 Histograms
3.5.5 Clustering
***************************************************************************
Need of preprocessing in short
• Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
• Data have quality if they satisfy the requirements of the intended use.
• There are many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
• Real world data tend to be dirty, incomplete, and inconsistent.
• Data pre-processing techniques can improve the quality of the data. So that it improves
the accuracy and efficiency of the subsequent mining process.
• Data pre-processing is an important step in the knowledge discovery process, since
quality decisions must be based on quality data.
• It is used for detecting data anomalies, rectifying them early, and reducing the data to be
analyzed can lead to huge payoffs for decision making.
Need of preprocessing in detail
• Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
• Data have quality if they satisfy the requirements of the intended use.
• There are many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
• Incomplete data can occur for a number of reasons:
1|Page 186_Shraddha Keshao Bonde_N3
CM5107
1. Attributes of interest may not always be available
2. Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
3. Data that were inconsistent with other recorded data may have been deleted.
4. Missing data, particularly for tuples with missing values for some attributes, may need
to be inferred.
5. The data collection instruments used may be faulty.
6. There may have been human or computer errors occurring at data entry.Errors in data
transmission can also occur.
7. There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Why is Data Preprocessing Important?
• Preprocessing of data is mainly to check the data quality.
• The quality can be checked by the following:
o Accuracy: To check whether the data entered is correct or not.
o Completeness: To check whether the data is available or not recorded.
o Consistency: To check whether the same data is kept in all the places that do or do not
match.
o Timeliness: The data should be updated correctly.
o Believability: The data should be trustable.
o Interpretability: The understandability of the data.
Tasks in Data Preprocessing
1. Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies. Data cleaning routines work to clean the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
2. Data integration: using multiple databases, data cubes, or files. Data integration is the
process of integrating multiple databases cubes or files. Yet some attributes representing a
given may have different names in different databases, causing inconsistencies and
redundancies.
3. Data transformation: normalization and aggregation. Data transformation is a kind of
operations, such as normalization and aggregation, are additional data preprocessing
procedures that would contribute toward the success of the mining process.
4. Data reduction: reducing the volume but producing the same or similar analytical results.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results.
5. Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.
2|Page 186_Shraddha Keshao Bonde_N3
CM5107
Data Cleaning
1. Missing Values
2. Noisy Data
3. Inconsistent Data
4. Data Cleaning as a Process
Data Cleaning
• Data cleaning is a technique deal with detecting removing inconsistencies and error from
the data in-order to get better quality data.
• It involves identifying and correcting errors or inconsistencies in the data, such as missing
values, outliers, and duplicates.
• Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.
• Data cleaning is performed as a data preprocessing step while preparing the data for a
data warehouse, a good quality data requires passing a set of quality criteria.
• Those criteria include: Accuracy, Integrity, Completeness, validity, consistency,
uniformity, Density and uniqueness.
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers and correct inconsistencies in the data.
Missing Values
• Many tuples have no recorded value for several attributes, such as customer income. so
we can fill the missing values for this attributes.
• The following methods are useful for performing missing values over several attributes
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values within tuple. It is especially poor
when the percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time consuming and may not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value
for income.
5. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.
3|Page 186_Shraddha Keshao Bonde_N3
CM5107
Noisy Data
• Noise is a random error or variance in a measured variable.
• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc.
• It can be handled in following ways:
1. Binning Method: (Dividing data into clearly defined segments or bins)
• This method works on sorted data in order to smooth it.
• The whole data is divided into segments of equal size and then various methods
are performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values can be used
to complete the task.
4|Page 186_Shraddha Keshao Bonde_N3
CM5107
‘
2. Regression: (Fitting data into multiple or linear regression functions)
• Here data can be made smooth by fitting it to a regression function.
• The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering: (Putting data into bundles of similar data)
• This approach groups the similar data in a cluster.
• The outliers may be undetected or it will fall outside the clusters.
5|Page 186_Shraddha Keshao Bonde_N3
CM5107
Inconsistent Data
• Data inconsistency is a situation where there are multiple tables within a database that
deal with the same data but may receive it from different inputs.
• Inconsistencies exist in the data stored in the transaction.
• Inconsistencies occur due to occur during data entry, functional dependencies between
attributes and missing values.
• The inconsistencies can be detected and corrected either by manually or by knowledge
engineering tools.
• Inconsistent data can lead to incorrect analysis, decision-making, and outcomes.
• For Ex,
1. one employee may enter customer addresses as "block 1/23," while another may
use "block 1-23.“
2. if a company's financial records are inaccurate, it may result in incorrect tax
filings or legal compliance issues.
3. Incomplete data ead to incorrect business decisions, missedopportunities, and
customer frustration.
Data Cleaning as a Process
Data Cleaning as a Process is done in following two steps
a) Discrepancy detection
b) Data transformations
a) Discrepancy detection
• The first step in data cleaning is discrepancy detection.
• It considers the knowledge of meta data and examines the following rules for
detecting the discrepancy.
1. Unique rules- each value of the given attribute must be different from all other
values for that attribute.
6|Page 186_Shraddha Keshao Bonde_N3
CM5107
2. Consecutive rules – Implies no missing values between the lowest and highest
values for the attribute and that all values must also be unique.
3. Null rules - specifies the use of blanks, question marks, special characters, or other
strings that may indicates the null condition
• Discrepancy detection Tools:
1. Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal
addresses, and spell-checking) to detect errors and make corrections in the data
2. Data auditing tools – analyzes the data to discover rules and relationship, and
detecting data that violate such conditions.
b) Data transformations
• This is the second step in data cleaning as a process.
• After detecting discrepancies, we need to define and apply (a series of)
transformations to correct them.
• Data Transformations Tools:
1. Data migration tools – allows simple transformation to be specified, such to
replaced the string ―gender‖ by ―sex‖.
2. ETL (Extraction/Transformation/Loading) tools –allows users to specific
transforms through a graphical user interface(GUI).
Data Integration
1. Entity Identification Problem
2. Redundancy and Correlation Analysis
3. Tuple Duplication
4. Data Value Conflict Detection and Resolution
Data Integration
• Data Integration is a data preprocessing technique that takes data from one or more
sources and mapping it, field by field onto a new data structure.
• Idea is to merge the data from multiple sources into a coherent data store. Or Data mining
often requires data integration - the merging of data from stores into a coherent data store,
as in data warehousing.
7|Page 186_Shraddha Keshao Bonde_N3
CM5107
• Data may be distributed over different databases or data warehouses.
• Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. These sources may include multiple data bases, data cubes, or
flat files.
• Techniques such as record linkage and data fusion can be used for data integration. There
may be necessity of enhancement of data with additional (external) data. Issues like entity
identification problem.
• Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
Issues in Data Integration
• Schema integration and object matching: How can the data analyst or the computer be
sure that customer id in one database and customer number in another reference to the
same attribute.
• Redundancy: An attribute (such as annual revenue, for instance) may be redundant if it
can be derived from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.
• Detection and resolution of data value conflicts: For the same realworld entity, attribute
values from different sources may differ.
Entity Identification Problem
• There are a number of issues to consider during data integration.
• Schema integration and object matching can be tricky.
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched referred to as the entity
identification problem.
• Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.cust_id =B.cust_number
• For example, how can the data analyst or the computer be sure that cust_idin one database
and cust_number in another refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning, data type, and range
of values permitted for the attribute, and null rules for handling blank, zero, or null
values. Such metadata can be used to help avoid errors in schema integration.
• The metadata may also be used to help transform the data (e.g., where data codes for
pay_type in one database may be H & S but 1 & 2 in another). Hence, this step also
relates to data cleaning.
Redundancy and Correlation Analysis
• Redundancy is another important issue in data integration.
• Redundant data occur often when integration of multiple databases.
8|Page 186_Shraddha Keshao Bonde_N3
CM5107
• An attribute (such as annual revenue, for instance) may be redundant if it can be derived
or obtaining from another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set.
• Given two attributes, such analysis can measure how strongly one attribute implies the
other, based on the available data.
• Redundant data occur often when integration of multiple databases
o Object identification: The same attribute or object may have different names in
different databases.
o Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue.
Handling Redundancy in Data Integration
• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality.
• For numerical attributes, we can evaluate/measure the correlation (linear relationship)
between two attributes, A and B, by computing correlation coefficient. This is Correlation
Analysis.
• Ex– We have a data set having three attributes- person_name, is_male, is_female.
o is_male is 1 if the corresponding person is a male else it is 0 .
o is_female is 1 if the corresponding person is a female else it is 0.
• Following Correlation Analysis method is used to detect the redundancies:
o Χ2 Test/Correlation Analysis (Categorical Data) (Used for nominal Data or
categorical or qualitative data)
o Correlation coefficient and covariance/Correlation Analysis (Numerical Data) (Used
for numeric Data or quantitative data)
Χ2 Test/Correlation Analysis (Categorical Data)
• Called Χ 2 (chi-square) test
• It is used for qualitative or nominal, or categorical data. It is performed over qualitative
data.
• Let us suppose we have two attributes X and Y, in the set of data. To represent the data
tuples, you have to make a contingency table.
• The given formula is used for X2 Test:
• Where, Observed Values are the actual.
• Expected values are the count acquired from contingency table joint events.
• The X2 examines the hypothesis that X and Y are not dependent.
• If this hypothesis can be rejected, we can assume that X and Y are statistically related to
each other, and we can ignore any one of them (either X or Y).
• The larger the Χ2 value, the more likely the variables are related.
9|Page 186_Shraddha Keshao Bonde_N3
CM5107
Correlation coefficient & covariance
• Used for numeric Data or quantitative data) - Correlation Analysis (Numerical Data)
• In this test, the relation between the A attribute and B attribute is computed by Pearson's
product-moment coefficient, also called the correlation coefficient.
• A correlation coefficient measures the extent to which the value of one variable changes
with another. It is also called Pearson’s product moment correlation coefficient
(PPMCC/PCC).
• PPMCC or PCC is a measure of the linear relationship between two variables that have
been measured on interval or ratio scales. I
• There are several different correlation coefficients, each of which is appropriate for
different types of data. The most common is the Pearson r, used for continuous variables.
• It is a statistic that measures the degree to which one variable varies in tandem with
another. It ranges from -1 to +1.
• rA,B > 0 &+1 correlation means that as one variable rises, the other rises proportionally;
• rA,B < 0 & -1 correlation means that as one rises, the other falls proportionally.
• rA,B = 0 correlation means there is no relationship between the movements of the two
variables.
• The formula used to calculate the numeric data is
• Where, n = number of tuples, ai = value of x in tuple i
and bi = value of y in tuple i
• The greater the correlation coefficient, the more strongly the attributes are correlated to
each other, and we can ignore any one of them (either a or b). If the value of the
correlation constant is null, the attributes are independent. If the value of the correlation
constant is negative, one attribute discourages the other. It means that the value of one
attribute increases, then the value of another attribute is decreasing.
Why Correlation Analysis is Important
• Correlation analysis can reveal meaningful relationships between different metrics or
groups of metrics.
• Information about those connections can provide new insights and reveal
interdependencies, even if the metrics come from different parts of the business.
• Suppose there is a strong correlation between two variables or metrics, and one of them is
being observed acting in a particular way.
• In that case, you can conclude that the other one is also being affected similarly.
• This helps group related metrics together to reduce the need for individual data
processing.
• Advantages of data redundancy include:
a) Increased data availability and reliability, as there are multiple copies of the data that
can be used in case the primary copy is lost or becomes unavailable.
b) Improved data integrity, as multiple copies of the data can be compared to detect and
correct errors.
c) Increased fault tolerance, as the system can continue to function even if one copy of
the data is lost or corrupted.
10 | P a g e 186_Shraddha Keshao Bonde_N3
CM5107
• Disadvantages of data redundancy include:
a) Increased storage requirements, as multiple copies of the data must be maintained.
b) Increased complexity of the system, as managing multiple copies of the data can be
difficult and time-consuming.
c) Increased risk of data inconsistencies, as multiple copies of the data may become out
of sync if updates are not properly propagated to all copies
d) Reduced performance, as the system may have to perform additional work to maintain
and access multiple copies of the data.
Tuple Duplication
• In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level(e.g., where there are two or more identical tuples for a given
unique data entry case).
• The use of de-normalized tables (often done to improve performance by avoiding joins) is
another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
• For example, if a purchase order database contains attributes for the purchaser's name and
address instead of a key to this information in a purchaser database, discrepancies can
occur, such as the same purchaser's name appearing with different addresses within the
purchase order database.
Data Value Conflict Detection and Resolution
• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different sources may
differ.
• This is the third important issue in data integration.
• Attribute values from different sources may differ for the same realworld entity.
• An attribute in one system may be recorded at a lower level abstraction than the “same”
attribute in another.
• This may be due to differences in representation, scaling, or encoding.
• Attributes may also differ on the abstraction level, where an attribute in one system is
recorded at, say, a lower abstraction level than the same attribute in another.
• For example, the total_sales in one database may refer to one branch of All_Electronics,
while an attribute of the same name in another database may refer to the total sales for
All_Electronics stores in a given region.
• Possible reasons for Data Value Conflict : different representations, different scales, e.g.,
metric vs. British units.
• Examples,
o a weight attribute may be stored in metric units in one system and British imperial
units in another.
o For a hotel chain, the price of rooms in different cities may involve not only different
currencies but also different services (e.g., free breakfast) and taxes.
11 | P a g e 186_Shraddha Keshao Bonde_N3
CM5107
o When exchanging information between schools, for example, each school may have
its own curriculum and grading scheme
o One university may adopt a quarter system, offer three courses on database systems,
and assign grades from A+ to F, whereas another may adopt a semester system, offer
two courses on databases, and assign grades from 1 to10.
o It is difficult to work out precise course-to-grade transformation rules between the two
universities, making information exchange difficult.
**********************************************************************************
12 | P a g e 186_Shraddha Keshao Bonde_N3