0% found this document useful (0 votes)

114 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views60 pages

M2 PPT

Data pre processing

Uploaded by

r8342254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

DATA (PRE-)PROCESSING

In Previous Class,
• We discuss various type of Data with examples
• In this Class,
• We focus on Data pre-processing – “an important
milestone of the Data Mining Process”
Data analysis pipeline
 Mining is not the only step in the analysis process

 Preprocessing: real data is noisy, incomplete and inconsistent.

Data cleaning is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature Selection.
 Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance & Visualization.
Why Preprocess the Data

Measures for Data Quality: A Multidimensional View

Accuracy: Correct or Wrong, Accurate or Not
Completeness: Not recorded, unavailable,…
Consistency: Come modified but some not,...
Timeliness: Timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!
Why can Data be
Incomplete?
 Attributes of interest are not available (e.g., customer
information for sales transaction data)
 Data were not considered important at the time of
transactions, so they were not recorded!
 Data not recorder because of misunderstanding or
malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Attribute Values
Data is described using attribute
values
Attribute values are numbers or symbols assigned to an attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
 Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
 Nominal
 Examples: ID numbers, eye color, zip codes
 Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
 Examples: calendar dates
 Ratio
 Examples: length, time, counts
Discrete and Continuous Attributes
 Discrete Attribute
Has only a finite or count able in finite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.

 Continuous Attribute
Has real numbers as attribute values
Examples : temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Data Preprocessing
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, E g.,
Instrument faulty, human or computer error, Transmission error
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks
in classification)—not effective when the percentage of missing values per
attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter

• Use the most probable value to fill in the missing value: inference-based such
as Bayesian formula or decision tree
How to Handle Missing
Data?

Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates
on global value distribution
E.g., put the average income here, or put the most probable income based on the fact
that the person is 39 years old
E.g., put the most frequent religion here
Data Quality

Data has attribute values

Then,
How good our Data w.r.t. these attribute
values?
Data Quality
 Examples of data quality problems:
 Noise and outliers
 Missing values
 Duplicate data
Data Quality: Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on
Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
Data Quality: Missing Values
 Reasons for missing values
•Information is not collected
•(e.g., people decline to give their age and weight)
•Attributes may not be applicable to all cases (e.g.,
annual income is not applicable to children)
• Handling missing values
•Eliminate Data Objects
•Estimate Missing Values
•Ignore the Missing Value During Analysis
•Replace with all possible values (weighted by their
probabilities)
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
 Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
Data Quality: Handle Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries,
etc.
 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check
by human
Data Quality: Handle Noise(Binning)
 Equal-width binning
 Divides the range into N intervals of equal size Width
of intervals:
 Simple
 Outliers may dominate result

 Equal-depth binning
Divides the range into N intervals,
each containing approximately same number of records
Skewed data is also handled well
Simple Methods: Binning
Data Quality: Handle
Noise(Binning)
Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
• Partition into three (equi-depth) bins
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
•Replace noisy or missing values by
predicted values
•Requires model of attribute
dependencies (maybe wrong!)
•Can be used for data smoothing or
for handling missing data
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
Data integration:
Combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts for the same real
world entity, attribute values from different sources are
different (e.g., J.D.Smith and Jonh Smith may refer to the same
person)
possible reasons: different representations, different scales,
e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

• Redundant data occur often when integration of multiple databases

• The same attribute may have different names in different
databases
• One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Normalization: Why normalization?
• Speeds-up some learning techniques (ex. neural networks)
• Helps prevent attributes with large ranges outweigh ones
with small ranges
• Example:
• income has range 3000-200000
• age has range 10-80
• gender has domain M/F
Data Transformation
Data has an attribute values
Then,
Can we compare these attribute values?
For Example: Compare following two records
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg) Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)
We need Data Transformation to makes different
dimension(attribute) records comparable ...
Data Transformation Techniques
Normalization: scaled to fall within a small,
specified range.
min-max normalization
z-score normalization
normalization by decimal scaling

 Centralization:
 Based on fitting a distribution to the data
 Distance function between distributions
 KL Distance
 Mean Centering
Data Transformation: Normalization
Example: Data Transformation
- Assume, min and max value for height and weight.
- Now, apply Min-Max normalization to both attributes as given
follow
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
Vs.
(1) (5.9 ft, 50 Kg)
(2) (5.6 ft, 56 Kg)
- Compare your results...
Data Transformation:
Aggregation
 Combining two or more attributes (or objects) into a
single attribute (or object)
 Purpose
 Data reduction
 Reduce the number of attributes or objects
 Change of scale
 Cities aggregated into regions, states, countries,
etc
 More “stable” data
 Aggregated data tends to have less variability
Data Transformation: Discretization

 Motivation for Discretization

 Some data mining algorithms only accept
categorical attributes
 May improve understandability of patterns
Data Transformation: Discretization
 Task
• Reduce the number of values for a given
continuous attribute by partitioning the range of
the attribute into intervals
• Interval labels replace actual attribute values
 Methods
• Binning (as explained earlier)
• Cluster analysis (will be discussed later)
• Entropy-based Discretization (Supervised)
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will
be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set
• Data reduction
• Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Data compression
• Numerosity reduction
• Discretization and concept hierarchy generation
Techniques of Data Reduction
Techniques or methods of data reduction in data mining, such
as
Dimensionality Reduction

• The reduction of random variables or

attributes is done so that the
dimensionality of the data set can be
reduced.
• Combining and merging the attributes of
the data without losing its original
characteristics. This also helps in the
reduction of storage space and
computation time is reduced.
Numerosity Reduction:
Reduce the volume of data
The representation of the data is made smaller by reducing the volume.
There will not be any loss of data in this reduction.
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Log-linear models: obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
data cube, when possible
Data Compression

Original Data Compressed

Data
lossless

s sy
lo
Original Data
Approximated
Histograms
40
• A popular data reduction
technique 35
• Divide data into buckets and 30
store average (or sum) for each
25
bucket
20
• Can be constructed optimally
in one dimension using 15
dynamic programming 10
• Related to quantization 5
problems.
0
10000 30000 50000 70000 90000
Histogram types
• Equal-width histograms:
• It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• V-optimal:
• It considers all histogram types for a given number of buckets and chooses the one
with the least variance.
• MaxDiff:
• After sorting the data to be approximated, it defines the borders of the buckets at
points where the adjacent values have the maximum difference
• Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three buckets
Clustering
• Partitions data set into clusters, and models it by one representative
from each cluster

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering

algorithms
Cluster Analysis
the distance between points in the
salary
same cluster should be small

cluster

outlier

age
Hierarchical Reduction
• Use multi-resolution structure with different degrees of reduction
• Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
• Parametric methods are usually not amenable to hierarchical
representation
• Hierarchical aggregation
• An index tree hierarchically divides a data set into partitions by value range
of some attributes
• Each partition can be considered as a bucket
• Thus an index tree with aggregates stored at each node is a hierarchical
histogram
Discretization
• Three types of attributes:
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into intervals
• why?
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization and Concept hierarchy
• Discretization
• reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.

• Concept hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
• Binning/Smoothing

• Histogram analysis

• Clustering analysis

• Entropy-based discretization

• Segmentation by natural partitioning

Entropy-Based Discretization m
Ent(S1 ) = - å pi log2 ( pi )
i =1

Entropy:

• Given a set of samples S, if S is partitioned into two

intervals S1 and S2 using boundary T, the information gain
I(S,T) after partitioning is
|S1 | |S 2 |
I (S, T ) = Ent(S1) + Ent(S 2)
|S | |S |
• The boundary that maximizes the information gain over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent(S ) - I (T , S ) > d
• Experiments show that it may reduce data size and improve
classification accuracy
Segmentation by natural partitioning
 Users often like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear intuitive or
“natural”. E.g., [50-60] better than [51.223-60.812]
The 3-4-5 rule can be used to segment numerical data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
partition the range into 3 equiwidth intervals for 3,6,9 or 2-3-2 for 7
* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the
range into 4 equiwidth intervals
* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the
range into 5 equiwidth intervals
The rule can be recursively applied for the resulting intervals
Python - Data Wrangling
Data wrangling is the process of cleaning and unifying messy and complex data sets
for easy access and analysis
Working with raw data sucks.

• Data comes in all shapes and sizes – CSV files, PDFs, stone
tablets, .jpg…

• Different files have different forming – Spaces instead of

NULLs, extra rows

• “Dirty” data – Unwanted anomalies – Duplicates

Principal Component Analysis

Principal Component Analysis, or PCA, is a

dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Principal Component Analysis or
Karhuren-Loeve (K-L) method
• Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal
component vectors
• Works for numeric data only
• Used when the number of dimensions is large
Principal Component Analysis

X1, X2: original axes (attributes) X2

Y1,Y2: principal components
Y1
Y2
significant component
(high variance)

Order principal components by significance and eliminate weaker ones

Java Code Compilation Quiz
No ratings yet
Java Code Compilation Quiz
65 pages
Java MCQ
No ratings yet
Java MCQ
15 pages
Core Java Exam for MBA Students
No ratings yet
Core Java Exam for MBA Students
6 pages
Java Quiz Questions with Answers
No ratings yet
Java Quiz Questions with Answers
18 pages
Java MCQs for IT Students
100% (1)
Java MCQs for IT Students
4 pages
Sunbeam OS Day4 MCQ's
No ratings yet
Sunbeam OS Day4 MCQ's
13 pages
CDAC Question Paper 2010 With Answers Free Paper
56% (9)
CDAC Question Paper 2010 With Answers Free Paper
3 pages
Answers Lab 2
No ratings yet
Answers Lab 2
2 pages
CSCI 1110 Test 1 Practice Questions
No ratings yet
CSCI 1110 Test 1 Practice Questions
23 pages
Java Lab Record
No ratings yet
Java Lab Record
110 pages
AI Search Algorithms Question Bank
100% (2)
AI Search Algorithms Question Bank
8 pages
Lab Manual JAVA EO
100% (1)
Lab Manual JAVA EO
134 pages
Solution Page 2
100% (1)
Solution Page 2
16 pages
OS MCQ Set 3
No ratings yet
OS MCQ Set 3
4 pages
Java Quiz Application Guide
No ratings yet
Java Quiz Application Guide
4 pages
Dbmslab 2
No ratings yet
Dbmslab 2
17 pages
Java Programming Bscit 42
No ratings yet
Java Programming Bscit 42
83 pages
WT MCQ For Insem
No ratings yet
WT MCQ For Insem
49 pages
Java Arrays for Beginners
100% (1)
Java Arrays for Beginners
22 pages
MCQ Datastructure Python
No ratings yet
MCQ Datastructure Python
18 pages
JDBC Mock Test II
No ratings yet
JDBC Mock Test II
6 pages
C++ MCQ Question Bank
No ratings yet
C++ MCQ Question Bank
10 pages
Web Programming Exam Guide
No ratings yet
Web Programming Exam Guide
8 pages
Soal Final Exam
100% (1)
Soal Final Exam
15 pages
MCQ
100% (1)
MCQ
468 pages
Object Oriented Analysis and Design Two Mark and Sixteen Mark Q & A Part - A Questions and Answers Unit-I
No ratings yet
Object Oriented Analysis and Design Two Mark and Sixteen Mark Q & A Part - A Questions and Answers Unit-I
39 pages
Java Exam Preparation Pactice Test - MCQ
No ratings yet
Java Exam Preparation Pactice Test - MCQ
7 pages
C Programs 1
100% (1)
C Programs 1
37 pages
Compiler Design-Code Optimization
No ratings yet
Compiler Design-Code Optimization
150 pages
JAVA Quiz 1
No ratings yet
JAVA Quiz 1
2 pages
Java MCQ 21 May PDF
No ratings yet
Java MCQ 21 May PDF
10 pages
Questions On Strings in Java
No ratings yet
Questions On Strings in Java
1 page
Short Questions: 1. What Is OOPS?
No ratings yet
Short Questions: 1. What Is OOPS?
9 pages
Java Programming Quiz Questions
No ratings yet
Java Programming Quiz Questions
28 pages
Fds MCQ Set1 Sppu Se Computer Fds MCQ
No ratings yet
Fds MCQ Set1 Sppu Se Computer Fds MCQ
4 pages
Data Munging & Storage Price Projections
No ratings yet
Data Munging & Storage Price Projections
14 pages
SQL Sample Question and Answer
No ratings yet
SQL Sample Question and Answer
15 pages
Oracle Placement Paper
No ratings yet
Oracle Placement Paper
8 pages
Java Programming Quiz
No ratings yet
Java Programming Quiz
161 pages
JDBC & SQL Quiz for Developers
100% (1)
JDBC & SQL Quiz for Developers
19 pages
1 - Core Java (TOC)
No ratings yet
1 - Core Java (TOC)
229 pages
Java MCQ
100% (1)
Java MCQ
18 pages
Gate 2023 Roadmap
100% (2)
Gate 2023 Roadmap
14 pages
Objective Questions-Java
No ratings yet
Objective Questions-Java
3 pages
C++ Selection Control Structures
No ratings yet
C++ Selection Control Structures
71 pages
Java All Chapters MCQ
No ratings yet
Java All Chapters MCQ
41 pages
CSE 232 Exam Answer Key
No ratings yet
CSE 232 Exam Answer Key
4 pages
Web Technology Course Overview
No ratings yet
Web Technology Course Overview
9 pages
OOP Java Exercises 2015/16
No ratings yet
OOP Java Exercises 2015/16
8 pages
Part 1 of 1
No ratings yet
Part 1 of 1
27 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Manual 888 Titrando 8.888.8002EN
No ratings yet
Manual 888 Titrando 8.888.8002EN
67 pages
Xe Controller - Retrofit Kit
No ratings yet
Xe Controller - Retrofit Kit
6 pages
Expt 7
No ratings yet
Expt 7
13 pages
A Dynamic Star Spots Extraction Method Based On Pi - 2024 - Advances in Space Re
No ratings yet
A Dynamic Star Spots Extraction Method Based On Pi - 2024 - Advances in Space Re
12 pages
Eng Scrubmaster B310R 01 (Manual)
No ratings yet
Eng Scrubmaster B310R 01 (Manual)
108 pages
Student Grades and Sales Data
No ratings yet
Student Grades and Sales Data
17 pages
Eeg Report
No ratings yet
Eeg Report
15 pages
Agnet Over Satcom
No ratings yet
Agnet Over Satcom
6 pages
Setup Streamyx on TPLink TD-W8961n
No ratings yet
Setup Streamyx on TPLink TD-W8961n
3 pages
Panel Guidelines
No ratings yet
Panel Guidelines
3 pages
Class 3 4 Key
No ratings yet
Class 3 4 Key
12 pages
8051 Timer Programming Guide
No ratings yet
8051 Timer Programming Guide
62 pages
RFD2000 RFID Sled for Zebra TC20
No ratings yet
RFD2000 RFID Sled for Zebra TC20
2 pages
Muhammad Faraz
No ratings yet
Muhammad Faraz
2 pages
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
No ratings yet
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
74 pages
Chapter 2
No ratings yet
Chapter 2
12 pages
MiCOM P132 Address Assignment Guide
No ratings yet
MiCOM P132 Address Assignment Guide
36 pages
E Sakshya Report
No ratings yet
E Sakshya Report
1 page
Copperbelt University Online Registration System
No ratings yet
Copperbelt University Online Registration System
16 pages
IP v6 Application Compatibility List
No ratings yet
IP v6 Application Compatibility List
5 pages
1.1.1 MSC Computer Science Syllabus
No ratings yet
1.1.1 MSC Computer Science Syllabus
57 pages
Metric Screw Thread Standards
No ratings yet
Metric Screw Thread Standards
24 pages
Killer Deals
No ratings yet
Killer Deals
33 pages
Art 6 Powerpoint
No ratings yet
Art 6 Powerpoint
22 pages
User Manual For AN97 Series
No ratings yet
User Manual For AN97 Series
25 pages
Engineering Poster Design Guide
No ratings yet
Engineering Poster Design Guide
8 pages
Online Spot Admission Guideline Final Published
No ratings yet
Online Spot Admission Guideline Final Published
3 pages
Java Notes
No ratings yet
Java Notes
53 pages
Bolt Mike III
No ratings yet
Bolt Mike III
1 page
IT341 - DSP - Lec#1 - Course Introduction
No ratings yet
IT341 - DSP - Lec#1 - Course Introduction
14 pages

M2 PPT

Uploaded by

M2 PPT

Uploaded by

DATA (PRE-)PROCESSING

 Preprocessing: real data is noisy, incomplete and inconsistent.

Measures for Data Quality: A Multidimensional View

• Fill in the missing value manually: tedious + infeasible?

• Use the attribute mean to fill in the missing value

Age Income Religion Gender

Data has attribute values

• Redundant data occur often when integration of multiple databases

 Motivation for Discretization

 Equal-depth (frequency) partitioning:

• The reduction of random variables or

Original Data Compressed

• Can be very effective if data is clustered but not if data is “smeared”

• There are many choices of clustering definitions and clustering

• Segmentation by natural partitioning

• Given a set of samples S, if S is partitioned into two

• Different files have different forming – Spaces instead of

• “Dirty” data – Unwanted anomalies – Duplicates

Principal Component Analysis, or PCA, is a

X1, X2: original axes (attributes) X2

Order principal components by significance and eliminate weaker ones

You might also like