0% found this document useful (0 votes)

8 views51 pages

DWDMUNIT2

The document outlines the importance of data preprocessing, which includes tasks such as data cleaning, integration, transformation, reduction, and discretization. It emphasizes that quality data is essential for quality mining results, detailing various methods for handling issues like missing or noisy data, and strategies for data reduction. Additionally, it discusses the creation of concept hierarchies to simplify data analysis by categorizing continuous attributes into intervals.

Uploaded by

Dr-Samson Chepuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPS, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views51 pages

DWDMUNIT2

Uploaded by

Dr-Samson Chepuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPS, PDF, TXT or read online on Scribd

You are on page 1/ 51

UNIT-2 Data Preprocessing

Lecture Topic
**********************************************
Lecture-13 Why preprocess the data?
Lecture-14 Data cleaning
Lecture-15 Data integration and transformation
Lecture-16 Data reduction
Lecture-17 Discretization and concept
hierarchgeneration
Lecture-13
Why preprocess the data?
Lecture-13 Why Data Preprocessing?
Data in the real world is:

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of
quality data

Lecture-13 Why Data Preprocessing?

Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility
Broad categories:

intrinsic, contextual, representational, and
accessibility.

Lecture-13 Why Data Preprocessing?

Major Tasks in Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, or files
Data transformation

Normalization and aggregation
Data reduction

Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data

Lecture-13 Why Data Preprocessing?

Forms of data preprocessing

Lecture-13 Why Data Preprocessing?

Lecture-14
Data cleaning
Data Cleaning

Data cleaning tasks


Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Lecture-14 - Data cleaning

Missing Data
Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of
entry

not register history or changes of the data
Missing data may need to be inferred.
Lecture-14 - Data cleaning
How to Handle Missing Data?
Ignore the tuple: usually done when class label
is missing
Fill in the missing value manually
Use a global constant to fill in the missing value:
ex. “unknown”

Lecture-14 - Data cleaning

How to Handle Missing Data?
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the
same class to fill in the missing value
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision
tree

Lecture-14 - Data cleaning

Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

Lecture-14 - Data cleaning

How to Handle Noisy Data?
Binning method:

first sort data and partition into (equal-
frequency) bins

then one can smooth by bin means, smooth
by bin median, smooth by bin boundaries
Clustering

detect and remove outliers
Regression

smooth by fitting the data to a regression
functions – linear regression
Lecture-14 - Data cleaning
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:


It divides the range into N intervals of equal size:
uniform grid

if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward

But outliers may dominate presentation

Skewed data is not handled well.
Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing
approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky.

Lecture-14 - Data cleaning

Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Lecture-14 - Data cleaning

Cluster Analysis

Lecture-14 - Data cleaning

Regression
y

Y1’ y=x+1

X1 x

Lecture-14 - Data cleaning

Lecture-15
Data integration and
transformation
Data Integration
Data integration:

combines data from multiple sources into a coherent
store
Schema integration

integrate metadata from different sources

Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts

for the same real world entity, attribute values from
different sources are different

possible reasons: different representations, different
scales, e.g., metric vs. British units

Lecture-15 - Data integration and transformation

Handling Redundant Data in Data
Integration

Redundant data occur often when

integration of multiple databases

The same attribute may have different names
in different databases

One attribute may be a “derived” attribute in
another table, e.g., annual revenue

Lecture-15 - Data integration and transformation

Handling Redundant Data in Data
Integration
Redundant data may be able to be
detected by correlation analysis
Careful integration of the data from
multiple sources may help reduce/avoid
redundancies and inconsistencies and
improve mining speed and quality

Lecture-15 - Data integration and transformation

Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing

Lecture-15 - Data integration and transformation

Data Transformation

Normalization: scaled to fall within a small,

specified range

min-max normalization

z-score normalization

normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones

Lecture-15 - Data integration and transformation

Data Transformation: Normalization

min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
z-score normalization
v  mean A
v' 
stand _ dev A

normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10

Lecture-15 - Data integration and transformation

Lecture-16
Data reduction
Data Reduction

Warehouse may store terabytes of data:

Complex data analysis/mining may take a
very long time to run on the complete data
set
Data reduction

Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results

Lecture-16 - Data reduction

Data Reduction Strategies

Data reduction strategies


Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy
generation

Lecture-16 - Data reduction

Data Cube Aggregation
The lowest level of a data cube

the aggregated data for an individual entity of interest

e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
Reference appropriate levels

Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should
be answered using data cube, when possible
Lecture-16 - Data reduction
Dimensionality Reduction
Feature selection (attribute subset selection):

Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand
Heuristic methods

step-wise forward selection

step-wise backward elimination

combining forward selection and backward elimination

decision-tree induction

Lecture-16 - Data reduction

Wavelet Transforms
Haar2 Daubechie4

Discrete wavelet transform (DWT): linear signal

processing
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:

Length, L, must be an integer power of 2 (padding with 0s, when
necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length
Lecture-16 - Data reduction
Principal Component Analysis

Given N data vectors from k-dimensions, find c

<= k orthogonal vectors that can be best used
to represent data

The original data set is reduced to one consisting of N
data vectors on c principal components (reduced
dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large

Lecture-16 - Data reduction

Principal Component Analysis

Y1
Y2

Lecture-16 - Data reduction

Attribute subset selection
Attribute subset selection reduces the data
set size by removing irrelevent or
redundant attributes.
Goal is find min set of attributes
Uses basic heuristic methods of attribute
selection

Lecture-16 - Data reduction

Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Lecture-16 - Data reduction

Numerosity Reduction
Parametric methods

Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)

Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling

Lecture-16 - Data reduction

Regression and Log-Linear Models
Linear regression: Data are modeled to fit a
straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable

Y to be modeled as a linear function of
multidimensional feature vector

Log-linear model: approximates discrete

multidimensional probability distributions
Lecture-16 - Data reduction
Regress Analysis and Log-Linear
Models
Linear regression: Y =  +  X

Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.

using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the
above.
Log-linear models:

The multi-way table of joint probabilities is
approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd
Lecture-16 - Data reduction
Histograms
A popular data
reduction technique 40
Divide data into 35
buckets and store 30
average (sum) for
25
each bucket
20
Can be constructed
optimally in one 15
dimension using 10
dynamic programming
5
Related to
0
quantization problems. 10000 30000 50000 70000 90000
Lecture-16 - Data reduction
Clustering

Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.

Lecture-16 - Data reduction

Sampling
Allows a large data set to be represented
by a much smaller of the data.
Let a large data set D, contains N tuples.
Methods to reduce data set D:

Simple random sample without replacement
(SRSWOR)

Simple random sample with replacement
(SRSWR)

Cluster sample

Stright sample
Lecture-16 - Data reduction
Sampling

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
Lecture-16 - Data reduction
Sampling

Raw Data Cluster/Stratified Sample

Lecture-16 - Data reduction

Lecture-17
Discretization and concept
hierarchy generation
Discretization

Three types of attributes:


Nominal — values from an unordered set

Ordinal — values from an ordered set

Continuous — real numbers
Discretization: divide the range of a continuous
attribute into intervals

Some classification algorithms only accept
categorical attributes.

Reduce data size by discretization

Prepare for further analysis

Lecture-17 - Discretization and concept hierarchy generation

Discretization and Concept hierachy

Discretization

reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies

reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).

Lecture-17 - Discretization and concept hierarchy generation

Discretization and concept hierarchy
generation for numeric data

Binning

Histogram analysis

Clustering analysis

Entropy-based discretization

Discretization by intuitive partitioning

Lecture-17 - Discretization and concept hierarchy generation

Entropy-Based Discretization
Given a set of samples S, if S is partitioned into
two intervals S1 and S2 using boundary T, the
entropy after partitioning is
| S 1| |S 2|
E (S ,T )  Ent ( S 1)  Ent ( S 2)
|S| |S|
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments showEnt ( Sthat
)  Eit(may
T , S ) reduce
 data size
and improve classification accuracy

Lecture-17 - Discretization and concept hierarchy generation

Discretization by intuitive partitioning

3-4-5 rule can be used to segment numeric data into

relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equal-width
intervals
* If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

Lecture-17 - Discretization and concept hierarchy generation

Example of 3-4-5 rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Lecture-17 - Discretization and concept hierarchy generation
Concept hierarchy generation for categorical
data
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
Specification of a portion of a hierarchy by
explicit data grouping
Specification of a set of attributes, but not of
their partial ordering
Specification of only a partial set of attributes

Lecture-17 - Discretization and concept hierarchy generation

Specification of a set of attributes

Concept hierarchy can be automatically

generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is
placed at the lowest level of the hierarchy.

country 15 distinct values

province_or_ state 65 distinct

values
city 3567 distinct values

street 674,339 distinct values

Lecture-17 - Discretization and concept hierarchy generation

UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
DWM
No ratings yet
DWM
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit - II
No ratings yet
Unit - II
56 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Session 4
No ratings yet
Session 4
40 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Mining
No ratings yet
Data Mining
31 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Data Preprocessing 1 - Annotated
No ratings yet
Data Preprocessing 1 - Annotated
23 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Unit 2
No ratings yet
Unit 2
37 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Project Presentation 9
No ratings yet
Project Presentation 9
13 pages
DWDMUNIT5
No ratings yet
DWDMUNIT5
55 pages
Modeap: Moving Desktop Application To Mobile Cloud
No ratings yet
Modeap: Moving Desktop Application To Mobile Cloud
20 pages
Unit 1
No ratings yet
Unit 1
34 pages
ASN UNIT-I Final
No ratings yet
ASN UNIT-I Final
97 pages
ASN UNIT-V Final
No ratings yet
ASN UNIT-V Final
83 pages
DCS - II-1 - Final
No ratings yet
DCS - II-1 - Final
49 pages
Project Presentation 7
No ratings yet
Project Presentation 7
18 pages
Project Presentation
No ratings yet
Project Presentation
10 pages
Project Presentation2
No ratings yet
Project Presentation2
19 pages
Project Presentation 8
No ratings yet
Project Presentation 8
13 pages
DCS Final-1
No ratings yet
DCS Final-1
88 pages
ASN UNIT-II Final
No ratings yet
ASN UNIT-II Final
116 pages
Sample Summer Internship
No ratings yet
Sample Summer Internship
19 pages
Working With Windows and DOS Systems
No ratings yet
Working With Windows and DOS Systems
86 pages
Al Cnnotes
No ratings yet
Al Cnnotes
77 pages
Unit 4
No ratings yet
Unit 4
23 pages
8051 Interrupts Final
0% (1)
8051 Interrupts Final
17 pages
Files Inodes
No ratings yet
Files Inodes
10 pages
5.serial Communication - 1
No ratings yet
5.serial Communication - 1
28 pages
6.interrupts 1
No ratings yet
6.interrupts 1
34 pages
New Eco. Envi.
No ratings yet
New Eco. Envi.
21 pages
8051 - Timers
No ratings yet
8051 - Timers
39 pages
Computer Graphics Question Bank
No ratings yet
Computer Graphics Question Bank
9 pages
Vikas
No ratings yet
Vikas
9 pages
Capital Budgetting1
No ratings yet
Capital Budgetting1
49 pages
(WWW - Entrance-Exam - Net) - JNTU ECE 3rd Year Computer Graphics Sample Paper 2
No ratings yet
(WWW - Entrance-Exam - Net) - JNTU ECE 3rd Year Computer Graphics Sample Paper 2
1 page
Chapter1cybersecurity PDF
100% (1)
Chapter1cybersecurity PDF
22 pages
Fundamentals of Cyber Security
100% (2)
Fundamentals of Cyber Security
2 pages
Hora Infiltración T.Acum. Lámina Parcial (CM)
No ratings yet
Hora Infiltración T.Acum. Lámina Parcial (CM)
17 pages
HLGS F21156
No ratings yet
HLGS F21156
5 pages
Crime Prediction Using ML & DL Techniques
No ratings yet
Crime Prediction Using ML & DL Techniques
15 pages
PDF Data Smart: Using Data Science To Transform Information Into Insight 2nd Edition Jordan Goldmeier Download
100% (9)
PDF Data Smart: Using Data Science To Transform Information Into Insight 2nd Edition Jordan Goldmeier Download
62 pages
Impact of Multitasking On Employees: Productivity at Workplace
No ratings yet
Impact of Multitasking On Employees: Productivity at Workplace
41 pages
Doc-20231113-Wa0007. 20231113 171028 0000
No ratings yet
Doc-20231113-Wa0007. 20231113 171028 0000
2 pages
Analysis of Data From Randomized Controlled Trials: A Practical Guide Jos W.R. Twisk Full Digital Chapters
No ratings yet
Analysis of Data From Randomized Controlled Trials: A Practical Guide Jos W.R. Twisk Full Digital Chapters
150 pages
Statistics FYUGP
No ratings yet
Statistics FYUGP
72 pages
GOLDBERGER, Arthur S. - A Course in Econometrics - Harvard University Press (1991) (1) (1) - 1-229
No ratings yet
GOLDBERGER, Arthur S. - A Course in Econometrics - Harvard University Press (1991) (1) (1) - 1-229
229 pages
ENIQ RP13 Issue1
No ratings yet
ENIQ RP13 Issue1
24 pages
STT153A Paper
No ratings yet
STT153A Paper
8 pages
Impact of Television Advertisement On The Buying Behaviour of FMCG Customers in Coimbatore District: A Study
No ratings yet
Impact of Television Advertisement On The Buying Behaviour of FMCG Customers in Coimbatore District: A Study
14 pages
Bitspilani ML Ai Wilp
No ratings yet
Bitspilani ML Ai Wilp
31 pages
Financial Model Analysis Guide
No ratings yet
Financial Model Analysis Guide
16 pages
JNTU KAKINADA - B.Tech - STATISTICS WITH R PROGRAMMING R16 R1621051102017 FR 200
No ratings yet
JNTU KAKINADA - B.Tech - STATISTICS WITH R PROGRAMMING R16 R1621051102017 FR 200
5 pages
P & S Important Questions
100% (1)
P & S Important Questions
8 pages
Chapter 3: Quantitative Demand Analysis Answers To Questions and Problems
No ratings yet
Chapter 3: Quantitative Demand Analysis Answers To Questions and Problems
14 pages
What Are We Weighting For - Jeffrey M. Wooldridge
No ratings yet
What Are We Weighting For - Jeffrey M. Wooldridge
16 pages
Ba 1502 S 20 Syl
No ratings yet
Ba 1502 S 20 Syl
4 pages
Quantitative Methods For Business 4th Edition Donald Waters Full
No ratings yet
Quantitative Methods For Business 4th Edition Donald Waters Full
112 pages
Effect of Strategic Management Practices On Performance of Coffee Factories PDF
No ratings yet
Effect of Strategic Management Practices On Performance of Coffee Factories PDF
13 pages
Audit Report Lag
No ratings yet
Audit Report Lag
22 pages
CMT Curriculum Level 3 2022 Changes PDF
No ratings yet
CMT Curriculum Level 3 2022 Changes PDF
12 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Ghosh Datta 2012 Functional Improvement and Social Participation Through Sports Activity For Children With Mental
No ratings yet
Ghosh Datta 2012 Functional Improvement and Social Participation Through Sports Activity For Children With Mental
9 pages
Iom Reviewer
No ratings yet
Iom Reviewer
11 pages
Leg Length vs. Jump Distance Study
No ratings yet
Leg Length vs. Jump Distance Study
17 pages
Sobo 1040-Business Mathematics and Statistics - August 2022 Final Exams
No ratings yet
Sobo 1040-Business Mathematics and Statistics - August 2022 Final Exams
2 pages
Effect of Disclosure of Economic Performance, Enviromental Performance, and Social Performance On Stock Trade Volume
No ratings yet
Effect of Disclosure of Economic Performance, Enviromental Performance, and Social Performance On Stock Trade Volume
6 pages
MSC Statistics Syl Lab Us
No ratings yet
MSC Statistics Syl Lab Us
29 pages

DWDMUNIT2

Uploaded by

DWDMUNIT2

Uploaded by

UNIT-2 Data Preprocessing

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Lecture-13 Why Data Preprocessing?

Data cleaning tasks

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Equal-width (distance) partitioning:

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-14 - Data cleaning

Lecture-15 - Data integration and transformation

Redundant data occur often when

Lecture-15 - Data integration and transformation

Lecture-15 - Data integration and transformation

Smoothing: remove noise from data

Lecture-15 - Data integration and transformation

Normalization: scaled to fall within a small,

Lecture-15 - Data integration and transformation

normalization by decimal scaling

Lecture-15 - Data integration and transformation

Warehouse may store terabytes of data:

Lecture-16 - Data reduction

Data reduction strategies

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Discrete wavelet transform (DWT): linear signal

Given N data vectors from k-dimensions, find c

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Lecture-16 - Data reduction

Lecture-16 - Data reduction

Multiple regression: allows a response variable

Log-linear model: approximates discrete

Lecture-16 - Data reduction

Raw Data Cluster/Stratified Sample

Lecture-16 - Data reduction

Three types of attributes:

Lecture-17 - Discretization and concept hierarchy generation

Lecture-17 - Discretization and concept hierarchy generation

Discretization by intuitive partitioning

Lecture-17 - Discretization and concept hierarchy generation

Lecture-17 - Discretization and concept hierarchy generation

3-4-5 rule can be used to segment numeric data into

Lecture-17 - Discretization and concept hierarchy generation

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

($2,000 - $5, 000)

Lecture-17 - Discretization and concept hierarchy generation

Concept hierarchy can be automatically

country 15 distinct values

province_or_ state 65 distinct

street 674,339 distinct values

You might also like