0% found this document useful (0 votes)

35 views12 pages

Dmi Unit 3

The document outlines the importance and tasks of data preprocessing, emphasizing the need for data cleaning, integration, and reduction to enhance data quality for effective analysis. It details various techniques for handling missing values, noisy data, and inconsistencies, as well as methods for integrating data from multiple sources. The document serves as a guide for ensuring high-quality data in the knowledge discovery process.

Uploaded by

Sahil Badhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views12 pages

Dmi Unit 3

Uploaded by

Sahil Badhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CM5107

Unit 3. Data Preprocessing

3.1 Need of preprocessing

3.2 Tasks in Data Preprocessing
3.3 Data Cleaning
3.3.1 Missing Values
3.3.2 Noisy Data
3.3.3 Data Cleaning as a Process
3.4 Data Integration
3.4.1 Entity Identification Problem
3.4.2 Redundancy and Correlation Analysis
3.4.3 Tuple Duplication
3.4.4 Data Value Conflict Detection and Resolution
3.5 Data Reduction
3.5.1 Wavelet Transforms
3.5.2 Principal Component Analysis
3.5.3 Attribute Subset Selection
3.5.4 Parametric data reduction
3.5.4 Histograms
3.5.5 Clustering
***************************************************************************

Need of preprocessing in short

• Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
• Data have quality if they satisfy the requirements of the intended use.
• There are many factors comprising data quality, including accuracy, completeness,
consistency, timeliness, believability, and interpretability.
• Real world data tend to be dirty, incomplete, and inconsistent.
• Data pre-processing techniques can improve the quality of the data. So that it improves
the accuracy and efficiency of the subsequent mining process.
• Data pre-processing is an important step in the knowledge discovery process, since
quality decisions must be based on quality data.
• It is used for detecting data anomalies, rectifying them early, and reducing the data to be
analyzed can lead to huge payoffs for decision making.

Need of preprocessing in detail

1|Page 186_Shraddha Keshao Bonde_N3

CM5107

1. Attributes of interest may not always be available

2. Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
3. Data that were inconsistent with other recorded data may have been deleted.
4. Missing data, particularly for tuples with missing values for some attributes, may need
to be inferred.
5. The data collection instruments used may be faulty.
6. There may have been human or computer errors occurring at data entry.Errors in data
transmission can also occur.
7. There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.

Why is Data Preprocessing Important?

• Preprocessing of data is mainly to check the data quality.
• The quality can be checked by the following:
o Accuracy: To check whether the data entered is correct or not.
o Completeness: To check whether the data is available or not recorded.
o Consistency: To check whether the same data is kept in all the places that do or do not
match.
o Timeliness: The data should be updated correctly.
o Believability: The data should be trustable.
o Interpretability: The understandability of the data.

Tasks in Data Preprocessing

1. Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies. Data cleaning routines work to clean the data by filling in missing
values, smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
2. Data integration: using multiple databases, data cubes, or files. Data integration is the
process of integrating multiple databases cubes or files. Yet some attributes representing a
given may have different names in different databases, causing inconsistencies and
redundancies.
3. Data transformation: normalization and aggregation. Data transformation is a kind of
operations, such as normalization and aggregation, are additional data preprocessing
procedures that would contribute toward the success of the mining process.
4. Data reduction: reducing the volume but producing the same or similar analytical results.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results.
5. Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.

2|Page 186_Shraddha Keshao Bonde_N3

CM5107

Data Cleaning
1. Missing Values
2. Noisy Data
3. Inconsistent Data
4. Data Cleaning as a Process

Data Cleaning
• Data cleaning is a technique deal with detecting removing inconsistencies and error from
the data in-order to get better quality data.
• It involves identifying and correcting errors or inconsistencies in the data, such as missing
values, outliers, and duplicates.
• Various techniques can be used for data cleaning, such as imputation, removal, and
transformation.
• Data cleaning is performed as a data preprocessing step while preparing the data for a
data warehouse, a good quality data requires passing a set of quality criteria.
• Those criteria include: Accuracy, Integrity, Completeness, validity, consistency,
uniformity, Density and uniqueness.
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers and correct inconsistencies in the data.

Missing Values
• Many tuples have no recorded value for several attributes, such as customer income. so
we can fill the missing values for this attributes.
• The following methods are useful for performing missing values over several attributes
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values within tuple. It is especially poor
when the percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time consuming and may not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value
for income.
5. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.

3|Page 186_Shraddha Keshao Bonde_N3

CM5107

Noisy Data
• Noise is a random error or variance in a measured variable.
• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc.
• It can be handled in following ways:

1. Binning Method: (Dividing data into clearly defined segments or bins)

• This method works on sorted data in order to smooth it.
• The whole data is divided into segments of equal size and then various methods
are performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values can be used
to complete the task.

4|Page 186_Shraddha Keshao Bonde_N3

CM5107

‘
2. Regression: (Fitting data into multiple or linear regression functions)
• Here data can be made smooth by fitting it to a regression function.
• The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).

3. Clustering: (Putting data into bundles of similar data)

• This approach groups the similar data in a cluster.
• The outliers may be undetected or it will fall outside the clusters.

5|Page 186_Shraddha Keshao Bonde_N3

CM5107

Inconsistent Data
• Data inconsistency is a situation where there are multiple tables within a database that
deal with the same data but may receive it from different inputs.
• Inconsistencies exist in the data stored in the transaction.
• Inconsistencies occur due to occur during data entry, functional dependencies between
attributes and missing values.
• The inconsistencies can be detected and corrected either by manually or by knowledge
engineering tools.
• Inconsistent data can lead to incorrect analysis, decision-making, and outcomes.
• For Ex,
1. one employee may enter customer addresses as "block 1/23," while another may
use "block 1-23.“
2. if a company's financial records are inaccurate, it may result in incorrect tax
filings or legal compliance issues.
3. Incomplete data ead to incorrect business decisions, missedopportunities, and
customer frustration.

Data Cleaning as a Process

Data Cleaning as a Process is done in following two steps
a) Discrepancy detection
b) Data transformations
a) Discrepancy detection
• The first step in data cleaning is discrepancy detection.
• It considers the knowledge of meta data and examines the following rules for
detecting the discrepancy.
1. Unique rules- each value of the given attribute must be different from all other
values for that attribute.

6|Page 186_Shraddha Keshao Bonde_N3

CM5107

2. Consecutive rules – Implies no missing values between the lowest and highest
values for the attribute and that all values must also be unique.
3. Null rules - specifies the use of blanks, question marks, special characters, or other
strings that may indicates the null condition
• Discrepancy detection Tools:
1. Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal
addresses, and spell-checking) to detect errors and make corrections in the data
2. Data auditing tools – analyzes the data to discover rules and relationship, and
detecting data that violate such conditions.
b) Data transformations
• This is the second step in data cleaning as a process.
• After detecting discrepancies, we need to define and apply (a series of)
transformations to correct them.
• Data Transformations Tools:
1. Data migration tools – allows simple transformation to be specified, such to
replaced the string ―gender‖ by ―sex‖.
2. ETL (Extraction/Transformation/Loading) tools –allows users to specific
transforms through a graphical user interface(GUI).

Data Integration
1. Entity Identification Problem
2. Redundancy and Correlation Analysis
3. Tuple Duplication
4. Data Value Conflict Detection and Resolution

Data Integration
• Data Integration is a data preprocessing technique that takes data from one or more
sources and mapping it, field by field onto a new data structure.
• Idea is to merge the data from multiple sources into a coherent data store. Or Data mining
often requires data integration - the merging of data from stores into a coherent data store,
as in data warehousing.

7|Page 186_Shraddha Keshao Bonde_N3

CM5107

• Data may be distributed over different databases or data warehouses.

• Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. These sources may include multiple data bases, data cubes, or
flat files.
• Techniques such as record linkage and data fusion can be used for data integration. There
may be necessity of enhancement of data with additional (external) data. Issues like entity
identification problem.
• Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict

Issues in Data Integration

• Schema integration and object matching: How can the data analyst or the computer be
sure that customer id in one database and customer number in another reference to the
same attribute.
• Redundancy: An attribute (such as annual revenue, for instance) may be redundant if it
can be derived from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.
• Detection and resolution of data value conflicts: For the same realworld entity, attribute
values from different sources may differ.

Entity Identification Problem

• There are a number of issues to consider during data integration.
• Schema integration and object matching can be tricky.
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched referred to as the entity
identification problem.
• Entity identification problem: identify real world entities from multiple data sources, e.g.,
A.cust_id =B.cust_number
• For example, how can the data analyst or the computer be sure that cust_idin one database
and cust_number in another refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning, data type, and range
of values permitted for the attribute, and null rules for handling blank, zero, or null
values. Such metadata can be used to help avoid errors in schema integration.
• The metadata may also be used to help transform the data (e.g., where data codes for
pay_type in one database may be H & S but 1 & 2 in another). Hence, this step also
relates to data cleaning.

Redundancy and Correlation Analysis

• Redundancy is another important issue in data integration.
• Redundant data occur often when integration of multiple databases.

8|Page 186_Shraddha Keshao Bonde_N3

CM5107

• An attribute (such as annual revenue, for instance) may be redundant if it can be derived
or obtaining from another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set.
• Given two attributes, such analysis can measure how strongly one attribute implies the
other, based on the available data.
• Redundant data occur often when integration of multiple databases
o Object identification: The same attribute or object may have different names in
different databases.
o Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue.

Handling Redundancy in Data Integration

• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed and quality.
• For numerical attributes, we can evaluate/measure the correlation (linear relationship)
between two attributes, A and B, by computing correlation coefficient. This is Correlation
Analysis.
• Ex– We have a data set having three attributes- person_name, is_male, is_female.
o is_male is 1 if the corresponding person is a male else it is 0 .
o is_female is 1 if the corresponding person is a female else it is 0.
• Following Correlation Analysis method is used to detect the redundancies:
o Χ2 Test/Correlation Analysis (Categorical Data) (Used for nominal Data or
categorical or qualitative data)
o Correlation coefficient and covariance/Correlation Analysis (Numerical Data) (Used
for numeric Data or quantitative data)

Χ2 Test/Correlation Analysis (Categorical Data)

• Called Χ 2 (chi-square) test
• It is used for qualitative or nominal, or categorical data. It is performed over qualitative
data.
• Let us suppose we have two attributes X and Y, in the set of data. To represent the data
tuples, you have to make a contingency table.
• The given formula is used for X2 Test:
• Where, Observed Values are the actual.
• Expected values are the count acquired from contingency table joint events.
• The X2 examines the hypothesis that X and Y are not dependent.
• If this hypothesis can be rejected, we can assume that X and Y are statistically related to
each other, and we can ignore any one of them (either X or Y).
• The larger the Χ2 value, the more likely the variables are related.

9|Page 186_Shraddha Keshao Bonde_N3

CM5107

Correlation coefficient & covariance

• Used for numeric Data or quantitative data) - Correlation Analysis (Numerical Data)
• In this test, the relation between the A attribute and B attribute is computed by Pearson's
product-moment coefficient, also called the correlation coefficient.
• A correlation coefficient measures the extent to which the value of one variable changes
with another. It is also called Pearson’s product moment correlation coefficient
(PPMCC/PCC).
• PPMCC or PCC is a measure of the linear relationship between two variables that have
been measured on interval or ratio scales. I
• There are several different correlation coefficients, each of which is appropriate for
different types of data. The most common is the Pearson r, used for continuous variables.
• It is a statistic that measures the degree to which one variable varies in tandem with
another. It ranges from -1 to +1.
• rA,B > 0 &+1 correlation means that as one variable rises, the other rises proportionally;
• rA,B < 0 & -1 correlation means that as one rises, the other falls proportionally.
• rA,B = 0 correlation means there is no relationship between the movements of the two
variables.
• The formula used to calculate the numeric data is
• Where, n = number of tuples, ai = value of x in tuple i
and bi = value of y in tuple i
• The greater the correlation coefficient, the more strongly the attributes are correlated to
each other, and we can ignore any one of them (either a or b). If the value of the
correlation constant is null, the attributes are independent. If the value of the correlation
constant is negative, one attribute discourages the other. It means that the value of one
attribute increases, then the value of another attribute is decreasing.

Why Correlation Analysis is Important

• Correlation analysis can reveal meaningful relationships between different metrics or
groups of metrics.
• Information about those connections can provide new insights and reveal
interdependencies, even if the metrics come from different parts of the business.
• Suppose there is a strong correlation between two variables or metrics, and one of them is
being observed acting in a particular way.
• In that case, you can conclude that the other one is also being affected similarly.
• This helps group related metrics together to reduce the need for individual data
processing.
• Advantages of data redundancy include:
a) Increased data availability and reliability, as there are multiple copies of the data that
can be used in case the primary copy is lost or becomes unavailable.
b) Improved data integrity, as multiple copies of the data can be compared to detect and
correct errors.
c) Increased fault tolerance, as the system can continue to function even if one copy of
the data is lost or corrupted.

10 | P a g e 186_Shraddha Keshao Bonde_N3

CM5107

• Disadvantages of data redundancy include:

a) Increased storage requirements, as multiple copies of the data must be maintained.
b) Increased complexity of the system, as managing multiple copies of the data can be
difficult and time-consuming.
c) Increased risk of data inconsistencies, as multiple copies of the data may become out
of sync if updates are not properly propagated to all copies
d) Reduced performance, as the system may have to perform additional work to maintain
and access multiple copies of the data.

Tuple Duplication
• In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level(e.g., where there are two or more identical tuples for a given
unique data entry case).
• The use of de-normalized tables (often done to improve performance by avoiding joins) is
another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
• For example, if a purchase order database contains attributes for the purchaser's name and
address instead of a key to this information in a purchaser database, discrepancies can
occur, such as the same purchaser's name appearing with different addresses within the
purchase order database.

Data Value Conflict Detection and Resolution

• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different sources may
differ.
• This is the third important issue in data integration.
• Attribute values from different sources may differ for the same realworld entity.
• An attribute in one system may be recorded at a lower level abstraction than the “same”
attribute in another.
• This may be due to differences in representation, scaling, or encoding.
• Attributes may also differ on the abstraction level, where an attribute in one system is
recorded at, say, a lower abstraction level than the same attribute in another.
• For example, the total_sales in one database may refer to one branch of All_Electronics,
while an attribute of the same name in another database may refer to the total sales for
All_Electronics stores in a given region.
• Possible reasons for Data Value Conflict : different representations, different scales, e.g.,
metric vs. British units.
• Examples,
o a weight attribute may be stored in metric units in one system and British imperial
units in another.
o For a hotel chain, the price of rooms in different cities may involve not only different
currencies but also different services (e.g., free breakfast) and taxes.

11 | P a g e 186_Shraddha Keshao Bonde_N3

CM5107

o When exchanging information between schools, for example, each school may have
its own curriculum and grading scheme
o One university may adopt a quarter system, offer three courses on database systems,
and assign grades from A+ to F, whereas another may adopt a semester system, offer
two courses on databases, and assign grades from 1 to10.
o It is difficult to work out precise course-to-grade transformation rules between the two
universities, making information exchange difficult.
**********************************************************************************

12 | P a g e 186_Shraddha Keshao Bonde_N3

DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
DWM
No ratings yet
DWM
14 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
Unit - II
No ratings yet
Unit - II
56 pages
Correlation
No ratings yet
Correlation
14 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Ids Unit 2
No ratings yet
Ids Unit 2
26 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Session 2-Data Preprocessing
No ratings yet
Session 2-Data Preprocessing
29 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Fruit Disease Detection and Segregation Using Machine Learning
No ratings yet
Fruit Disease Detection and Segregation Using Machine Learning
6 pages
Human Rights and Social Movements in The 21 Century: By: Sereen Qurban 2005150 Section: ABG
No ratings yet
Human Rights and Social Movements in The 21 Century: By: Sereen Qurban 2005150 Section: ABG
26 pages
Cl3061-14 Tut1,2 Ass1
No ratings yet
Cl3061-14 Tut1,2 Ass1
366 pages
Forum Ideal and Maximum Length For A Web Survey: Melanie Revilla Carlos Ochoa
No ratings yet
Forum Ideal and Maximum Length For A Web Survey: Melanie Revilla Carlos Ochoa
9 pages
Business Mathematics and Statistics Assignment
No ratings yet
Business Mathematics and Statistics Assignment
14 pages
Lee 2021 Government For Leaving No One Behind Social Equity in Public Administration and Trust in Government
No ratings yet
Lee 2021 Government For Leaving No One Behind Social Equity in Public Administration and Trust in Government
11 pages
Physics Project
56% (9)
Physics Project
24 pages
Sample TP Dossier
0% (1)
Sample TP Dossier
42 pages
All in Diversity 2019 All Index Report
No ratings yet
All in Diversity 2019 All Index Report
68 pages
New Curriculum For Wales
No ratings yet
New Curriculum For Wales
19 pages
Lesson Plan - Cot 2
100% (1)
Lesson Plan - Cot 2
4 pages
Youth Sexual Health in Europe
No ratings yet
Youth Sexual Health in Europe
19 pages
Manufacturing Solutions for Auto Parts
No ratings yet
Manufacturing Solutions for Auto Parts
6 pages
Chapter 5. Advertising and Promotion Planningy
No ratings yet
Chapter 5. Advertising and Promotion Planningy
55 pages
Job Analysis: 1 Dr. Pallavi Srivastava
No ratings yet
Job Analysis: 1 Dr. Pallavi Srivastava
12 pages
Research Proposal - 2
No ratings yet
Research Proposal - 2
9 pages
Project Risk Management & Decision Analysis
No ratings yet
Project Risk Management & Decision Analysis
10 pages
Stai Anxiety Practical Writeup
80% (5)
Stai Anxiety Practical Writeup
4 pages
Conflict of Interest Policy and Procedures Objective
No ratings yet
Conflict of Interest Policy and Procedures Objective
1 page
Passenger Behavioral Mapping and Station Facilities Design at Commuter Line Train Station 2019
No ratings yet
Passenger Behavioral Mapping and Station Facilities Design at Commuter Line Train Station 2019
6 pages
The Influence of Using English Listening Application On Students Listening Skils
No ratings yet
The Influence of Using English Listening Application On Students Listening Skils
10 pages
Reliability of Accelerated Curing Techniques For Speedy Design of Concrete Mixes - An Appraisal of IS 9013:1978 Code
No ratings yet
Reliability of Accelerated Curing Techniques For Speedy Design of Concrete Mixes - An Appraisal of IS 9013:1978 Code
4 pages
Journal Leadership
No ratings yet
Journal Leadership
19 pages
Slovin's Formula: ( ) : Chapter 2: Frequency Distributions and Their Graph
No ratings yet
Slovin's Formula: ( ) : Chapter 2: Frequency Distributions and Their Graph
6 pages
A Comparative Study On Online Shopping
No ratings yet
A Comparative Study On Online Shopping
56 pages
Research Concept: Department of Commerce
No ratings yet
Research Concept: Department of Commerce
21 pages
Routledge Handbook of Research Methods in Military Studies 1st Edition Joseph Soeters Instant Download
100% (1)
Routledge Handbook of Research Methods in Military Studies 1st Edition Joseph Soeters Instant Download
88 pages
Effect of Experiential Value On Customer Satisfaction With Service Encounters in Luxury-Hotel Restaurants
No ratings yet
Effect of Experiential Value On Customer Satisfaction With Service Encounters in Luxury-Hotel Restaurants
8 pages
SPED 5302 Week 5 Assignment
No ratings yet
SPED 5302 Week 5 Assignment
21 pages
Qualitative Research Training Guide
No ratings yet
Qualitative Research Training Guide
29 pages

Dmi Unit 3

Uploaded by

Dmi Unit 3

Uploaded by

CM5107

Unit 3. Data Preprocessing

3.1 Need of preprocessing

Need of preprocessing in short

Need of preprocessing in detail

1|Page 186_Shraddha Keshao Bonde_N3

1. Attributes of interest may not always be available

Why is Data Preprocessing Important?

Tasks in Data Preprocessing

2|Page 186_Shraddha Keshao Bonde_N3

3|Page 186_Shraddha Keshao Bonde_N3

1. Binning Method: (Dividing data into clearly defined segments or bins)

4|Page 186_Shraddha Keshao Bonde_N3

3. Clustering: (Putting data into bundles of similar data)

5|Page 186_Shraddha Keshao Bonde_N3

Data Cleaning as a Process

6|Page 186_Shraddha Keshao Bonde_N3

7|Page 186_Shraddha Keshao Bonde_N3

• Data may be distributed over different databases or data warehouses.

Issues in Data Integration

Entity Identification Problem

Redundancy and Correlation Analysis

8|Page 186_Shraddha Keshao Bonde_N3

Handling Redundancy in Data Integration

Χ2 Test/Correlation Analysis (Categorical Data)

9|Page 186_Shraddha Keshao Bonde_N3

Correlation coefficient & covariance

Why Correlation Analysis is Important

10 | P a g e 186_Shraddha Keshao Bonde_N3

• Disadvantages of data redundancy include:

Data Value Conflict Detection and Resolution

11 | P a g e 186_Shraddha Keshao Bonde_N3

12 | P a g e 186_Shraddha Keshao Bonde_N3

You might also like