0% found this document useful (0 votes)
56 views13 pages

Preprocessing

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. The key steps are data cleaning to handle missing values, noisy data, and inconsistencies; data integration and transformation through normalization, aggregation, and dimensionality reduction; and data reduction to obtain a smaller yet representative version of the dataset to speed up analysis. The overall goal is to improve data quality for obtaining high quality modeling results.

Uploaded by

mba20238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views13 pages

Preprocessing

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. The key steps are data cleaning to handle missing values, noisy data, and inconsistencies; data integration and transformation through normalization, aggregation, and dimensionality reduction; and data reduction to obtain a smaller yet representative version of the dataset to speed up analysis. The overall goal is to improve data quality for obtaining high quality modeling results.

Uploaded by

mba20238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Preprocessing

Why Data Preprocessing?


• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Source of data
• Most mention.
• Mean/average .
• Sign ?
• Delete
• Smoothing.
Noisy Data
• Q: What is noise?
• A: Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
boundaries, etc.
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and
human inspection
– detect suspicious values and check manually
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data
Data Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set

• Solution?
– Data reduction…
Data Reduction
•Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
•Data reduction strategies
–Data cube aggregation
–Dimensionality reduction
–Data compression
–Discretization and concept hierarchy generation

You might also like