0% found this document useful (0 votes)

11 views26 pages

Data Pre Processing II

The document discusses data integration and transformation processes essential for preparing data for machine learning. It covers issues in data integration such as schema integration, data value conflicts, and redundant features, as well as various data transformation techniques like normalization, aggregation, generalization, and feature construction. Specific methods for scaling data, including Min-Max, Z score, and One-Hot Encoding, are also detailed to enhance data usability in machine learning algorithms.

Uploaded by

ajain2be22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views26 pages

Data Pre Processing II

Uploaded by

ajain2be22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Pre-Processing-II

(Data Integration, Data Transformation)

D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Data Integration
◦ Data Integration: It is the process of merging the data from
multiple sources into a coherent data store.
e.g. Collection of banking data from different banks at data stores of RBI

Issues in data integration

• Schema integration and feature
matching
• Detection and resolution of data value
conflicts.
• Redundant features
Data Integration
Schema integration and feature matching:

Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB

Cust.id Name Age DoB “Carefully analysis the metadata”

Data Integration
Detection and resolution of data value conflicts:

Product year Price Product Year Price Product Year Price

($) (Rs) (pound)

Research data for price of Product year Price

essential products

“Carefully analysis the metadata”

Data Integration
Redundant features: They are unwanted features.
 To deal with redundant features correlation analysis is
performed. Denoted by r.
 A threshold is decided to find redundant features.
Cust.id Name Age DoB

r is +ve r is -ve r is zero

Data Transformation
Data Transformation: It is a process to transform or
consolidate the data in a form suitable for machine learning
algorithms.

Major techniques of data transformation are :-

• Normalization/Scaling
• Aggregation
• Generalization
• Feature construction
Data Transformation- Scaling
Scaling/Normalization: It is the technique of mapping the numerical
feature values of any range into a specific smaller range i.e. 0 to 1 or -
1 to 1 etc.

Popular methods of Normalization are:-

• Min-Max method
• Mean normalization
• Z score method
• Decimal scaling method
• Log Transformer
• MaxAbs Transformer
• InterQuartile Scaler / Robust Scaler
Data Transformation- Scaling Contd
Min-Max method: 𝑋
X

2 0
Min-max
47
normalization 0.512
90
1
Where, 18
is mapped value 0.18
5
is data value to be mapped in 0.034
specific range
is minimum and maximum
value of feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd
Mean normalization 𝑋
X

2 -0.345
Mean
47
normalization 0.166
Where, 90
is mapped value 0.655
is data value to be mapped in 18
-0.164
specific range 5
𝑋 is mean of feature vector corresponding to -0.311
𝑋.
is minimum and maximum
value of feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
Z Score method: 𝑋
X

2 -0.826
Z score
47
normalization 0.397
90
1.566
Where, 18
is mapped value -0.391
5
is data value to be mapped in -0.745
specific range
𝑋 and 𝜎 is mean and standard deviation of
feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
Decimal scaling method: 𝑋
X

2 Decimal scaling 0.02

normalization
47
0.47
90
0.9
Where,
18
is mapped value 0.18
is data value to be mapped in 5
specific range 0.05
is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
Data Transformation- Scaling Contd…
Log Transformer: 𝑋
X

2 Log scaling 0.693147

normalization
47
Where, 3.850148
𝑋 is mapped value 90
𝑋 is data value to be mapped in 4.499810
specific range 18
It is primarily used to convert a skewed 2.890372
5
distribution to a normal distribution/less-skewed 1.609438
distribution.
The log operation had a dual role:
•Reducing the impact of too-low values
•Reducing the impact of too-high values.
Data Transformation – Scaling Contd…
MaxAbs Scaler X 𝑋

It first takes the absolute value of each 100 MaxAbs 0.05

value in the column and then takes the -263
normalization
maximum value out of those. -0.1315
-2000
This operation scales the data between the -1
range [-1, 1]. 18
0.009
5
0.0025
Data Transformation- Scaling Contd
Interquartile/Robust normalization X 𝑋
2 -0.38095238
Robust
47
Where, normalization 0.69047619
𝑋 is mapped value 90
𝑋 is data value to be mapped in 1.71428571
specific range 18
0
• The mean, maximum and minimum values of the columns. 5
All these values are sensitive to outliers. -0.30952381
• If there are too many outliers in the data, they will
influence the mean and the max value or the min value.
• Thus, even if we scale this data using the above methods,
we cannot guarantee a balanced data with a normal
distribution.
Data Transformation- Aggregation
Aggregation : take the aggregated values in order to put the
data in a better perspective.

e.g. in case of transactional data, the day to day sales of product at

various stores location can be aggregated store wise over months or years
in order to be analyzed for some decision making.

Benefits of aggregation
• Reduce memory consumption to store large data records.
• Provides more stable view point than individual data objects
Data Transformation- Aggregation Contd..
Data Transformation- Generalization
Generalization: The data is generalized from low-level to
higher order concepts using concept hierarchies.
e.g. categorical attributes like street can be generalized to higher order
concepts like city or country.

“The decision of generalization level depends on the problem

statement”
Data Transformation- Feature Construction
 Feature construction involves transforming a given set of input features to generate
a new set of powerful features.

 For e.g. feature like mobile number and landline number combined together under
new feature contact number.

 Features like apartment length and breadth must be converted to apartment area.
Data Transformation- Feature Construction
 There are certain situations where feature construction is an essential activity
before we can train a machine learning model.
 These situations are:
When features have categorical value and machine learning needs numeric
value inputs.
 Label Encoding
 One-Hot Encoding
Dummy Encoding
 When features have numeric (continuous) value and need to be converted to
ordinal values.
Rank according to numerical continuous values
 When text-specific feature construction need to be done.
Bag-of-words
Tf-idf
 Word Embeddings
Data Transformation- Feature Construction
 Label Encoding
This approach is very simple and it involves converting each value in a column
to a number.  Depending upon the data values and type of data,
label encoding induces a new problem since it uses
number sequencing.
 The problem using the number is that they
introduce relation/comparison between them.
 The algorithm might misunderstand that data has
some kind of hierarchy/order 0 < 1 < 2 … < 6 and
might give 6X more weight to ‘Cable’ in
calculation then than ‘Arch’ bridge type
Data Transformation- Feature Construction
 Label Encoding
◦ Let’s consider another column named ‘Safety
Level’.
◦ Performing label encoding of this column also
induces order/precedence in number, but in the
right way.
◦ Here the numerical order does not look out-of-
box and it makes sense if the algorithm
interprets safety order 0 < 1 < 2 < 3 < 4 i.e.
none < low < medium < high < very high.
Data Transformation- Feature Construction
 One-Hot Encoding
 Though label encoding is straight but it has the disadvantage that the numeric values
can be misinterpreted by algorithms.
 The ordering issue is addressed in another common alternative approach called ‘One-
Hot Encoding’.
 In this strategy, each category value is converted into a new column and assigned a 1
or 0 (notation for true/false) value to the column.
 It does have the downside of adding more columns to the data set.
 It can cause the number of columns to expand greatly if you have many unique values
in a category column.
Data Transformation- Feature Construction
 One-Hot Encoding
Data Transformation- Feature Construction
 Dummy Encoding
 Dummy coding scheme is similar to one-hot encoding.
 This categorical data encoding method transforms the categorical variable into
a set of binary variables (also known as dummy variables).
 In the case of one-hot encoding, for N categories in a variable, it uses N binary
variables.
 The dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.
Data Transformation- Feature Construction
 Dummy Encoding
Data Transformation- Feature Construction
Text specific Features- BoW , TF-IDF
Document A: The Car Is Driven On The Road
Document B: The Truck is Driven on the highway

Wa0003.
No ratings yet
Wa0003.
27 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
CH2 Data Integration - Transformation
No ratings yet
CH2 Data Integration - Transformation
16 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
ML Notes
No ratings yet
ML Notes
44 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Lecture # 13 Data - Transformation - Techniques
No ratings yet
Lecture # 13 Data - Transformation - Techniques
36 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Unit 1
No ratings yet
Unit 1
8 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
DWDM PDF
No ratings yet
DWDM PDF
21 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Data Transformation Techniques Guide
No ratings yet
Data Transformation Techniques Guide
1 page
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Chap 3
No ratings yet
Chap 3
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Transformation
No ratings yet
Data Transformation
5 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DCA1109 & INTRODUCTION TO WEB PROGRAMMING-Bhushan 266
No ratings yet
DCA1109 & INTRODUCTION TO WEB PROGRAMMING-Bhushan 266
19 pages
CNCF - Ai 2
No ratings yet
CNCF - Ai 2
21 pages
Computer Network and Cloud Computing
No ratings yet
Computer Network and Cloud Computing
1 page
Harshavardhani Thota: Education
No ratings yet
Harshavardhani Thota: Education
1 page
Box-Office Revenue Estimation For Telugu Movie Industry Using Predictive Analytic Techniques
No ratings yet
Box-Office Revenue Estimation For Telugu Movie Industry Using Predictive Analytic Techniques
7 pages
Capstone Project Proposal Malware Detection
No ratings yet
Capstone Project Proposal Malware Detection
11 pages
Amazon Elastic File System
No ratings yet
Amazon Elastic File System
31 pages
Lecture 1
No ratings yet
Lecture 1
24 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
DBMS Viva Questions
No ratings yet
DBMS Viva Questions
4 pages
4 Smart Contracts and Ethereum 101
No ratings yet
4 Smart Contracts and Ethereum 101
30 pages
Kritagya Kumra: Education
No ratings yet
Kritagya Kumra: Education
1 page
Srishti Resume (SE) - 1
No ratings yet
Srishti Resume (SE) - 1
1 page
Professional Machine Learning Engineer
No ratings yet
Professional Machine Learning Engineer
27 pages
SQL-Transactions Theory and Hands-On Exercises
No ratings yet
SQL-Transactions Theory and Hands-On Exercises
85 pages
Class-9 - AI Term-1 - Revision Worksheet Part-B Unit 1.1,1.2,1.3
100% (1)
Class-9 - AI Term-1 - Revision Worksheet Part-B Unit 1.1,1.2,1.3
13 pages
Unit 1
No ratings yet
Unit 1
47 pages
Msce Computer Studies Notes
100% (1)
Msce Computer Studies Notes
61 pages
Leica Cyclone PUBLISHER Pro Data Sheet
No ratings yet
Leica Cyclone PUBLISHER Pro Data Sheet
2 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
4 pages
RGPV Notes - Data Analytics
No ratings yet
RGPV Notes - Data Analytics
3 pages
Reducing Client Incidents Through Big Data Predictive Analytics
No ratings yet
Reducing Client Incidents Through Big Data Predictive Analytics
10 pages
Big Data With Hadoop and Spark - 2023-25
No ratings yet
Big Data With Hadoop and Spark - 2023-25
4 pages
Glossary of Software Architecture EN
50% (2)
Glossary of Software Architecture EN
78 pages
DBS 6202 - Advanced Database Systems Individual Assignment Iii
No ratings yet
DBS 6202 - Advanced Database Systems Individual Assignment Iii
16 pages
DBMS Important Questions
No ratings yet
DBMS Important Questions
3 pages
Data Quality Framework Eu Medicines Regulation - en
No ratings yet
Data Quality Framework Eu Medicines Regulation - en
42 pages
OCS351-AI&MLF LAB MANUAL Front Page
No ratings yet
OCS351-AI&MLF LAB MANUAL Front Page
3 pages
SAP 06 User Feedback Analysis With LLMs
No ratings yet
SAP 06 User Feedback Analysis With LLMs
2 pages
RenderCV EngineeringResumes Theme
No ratings yet
RenderCV EngineeringResumes Theme
1 page

Data Pre Processing II

Uploaded by

Data Pre Processing II

Uploaded by

Data Pre-Processing-II

(Data Integration, Data Transformation)

Issues in data integration

Cust.id Name Age DoB “Carefully analysis the metadata”

Product year Price Product Year Price Product Year Price

Research data for price of Product year Price

“Carefully analysis the metadata”

r is +ve r is -ve r is zero

Major techniques of data transformation are :-

Popular methods of Normalization are:-

2 Decimal scaling 0.02

2 Log scaling 0.693147

It first takes the absolute value of each 100 MaxAbs 0.05

e.g. in case of transactional data, the day to day sales of product at

“The decision of generalization level depends on the problem

You might also like