0% found this document useful (0 votes)
11 views26 pages

Data Pre Processing II

The document discusses data integration and transformation processes essential for preparing data for machine learning. It covers issues in data integration such as schema integration, data value conflicts, and redundant features, as well as various data transformation techniques like normalization, aggregation, generalization, and feature construction. Specific methods for scaling data, including Min-Max, Z score, and One-Hot Encoding, are also detailed to enhance data usability in machine learning algorithms.

Uploaded by

ajain2be22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

Data Pre Processing II

The document discusses data integration and transformation processes essential for preparing data for machine learning. It covers issues in data integration such as schema integration, data value conflicts, and redundant features, as well as various data transformation techniques like normalization, aggregation, generalization, and feature construction. Specific methods for scaling data, including Min-Max, Z score, and One-Hot Encoding, are also detailed to enhance data usability in machine learning algorithms.

Uploaded by

ajain2be22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Pre-Processing-II

(Data Integration, Data Transformation)

D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Data Integration
◦ Data Integration: It is the process of merging the data from
multiple sources into a coherent data store.
e.g. Collection of banking data from different banks at data stores of RBI

Issues in data integration


• Schema integration and feature
matching
• Detection and resolution of data value
conflicts.
• Redundant features
Data Integration
Schema integration and feature matching:

Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB

Cust.id Name Age DoB “Carefully analysis the metadata”


Data Integration
Detection and resolution of data value conflicts:

Product year Price Product Year Price Product Year Price


($) (Rs) (pound)

Research data for price of Product year Price


essential products

“Carefully analysis the metadata”


Data Integration
Redundant features: They are unwanted features.
 To deal with redundant features correlation analysis is
performed. Denoted by r.
 A threshold is decided to find redundant features.
Cust.id Name Age DoB

r is +ve r is -ve r is zero


Data Transformation
Data Transformation: It is a process to transform or
consolidate the data in a form suitable for machine learning
algorithms.

Major techniques of data transformation are :-


• Normalization/Scaling
• Aggregation
• Generalization
• Feature construction
Data Transformation- Scaling
Scaling/Normalization: It is the technique of mapping the numerical
feature values of any range into a specific smaller range i.e. 0 to 1 or -
1 to 1 etc.

Popular methods of Normalization are:-


• Min-Max method
• Mean normalization
• Z score method
• Decimal scaling method
• Log Transformer
• MaxAbs Transformer
• InterQuartile Scaler / Robust Scaler
Data Transformation- Scaling Contd
Min-Max method: 𝑋
X

2 0
Min-max
47
normalization 0.512
90
1
Where, 18
is mapped value 0.18
5
is data value to be mapped in 0.034
specific range
is minimum and maximum
value of feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd
Mean normalization 𝑋
X

2 -0.345
Mean
47
normalization 0.166
Where, 90
is mapped value 0.655
is data value to be mapped in 18
-0.164
specific range 5
𝑋 is mean of feature vector corresponding to -0.311
𝑋.
is minimum and maximum
value of feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
Z Score method: 𝑋
X

2 -0.826
Z score
47
normalization 0.397
90
1.566
Where, 18
is mapped value -0.391
5
is data value to be mapped in -0.745
specific range
𝑋 and 𝜎 is mean and standard deviation of
feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
Decimal scaling method: 𝑋
X

2 Decimal scaling 0.02


normalization
47
0.47
90
0.9
Where,
18
is mapped value 0.18
is data value to be mapped in 5
specific range 0.05
is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
Data Transformation- Scaling Contd…
Log Transformer: 𝑋
X

2 Log scaling 0.693147


normalization
47
Where, 3.850148
𝑋 is mapped value 90
𝑋 is data value to be mapped in 4.499810
specific range 18
It is primarily used to convert a skewed 2.890372
5
distribution to a normal distribution/less-skewed 1.609438
distribution.
The log operation had a dual role:
•Reducing the impact of too-low values
•Reducing the impact of too-high values.
Data Transformation – Scaling Contd…
MaxAbs Scaler X 𝑋

It first takes the absolute value of each 100 MaxAbs 0.05


value in the column and then takes the -263
normalization
maximum value out of those. -0.1315
-2000
This operation scales the data between the -1
range [-1, 1]. 18
0.009
5
0.0025
Data Transformation- Scaling Contd
Interquartile/Robust normalization X 𝑋
2 -0.38095238
Robust
47
Where, normalization 0.69047619
𝑋 is mapped value 90
𝑋 is data value to be mapped in 1.71428571
specific range 18
0
• The mean, maximum and minimum values of the columns. 5
All these values are sensitive to outliers. -0.30952381
• If there are too many outliers in the data, they will
influence the mean and the max value or the min value.
• Thus, even if we scale this data using the above methods,
we cannot guarantee a balanced data with a normal
distribution.
Data Transformation- Aggregation
Aggregation : take the aggregated values in order to put the
data in a better perspective.

e.g. in case of transactional data, the day to day sales of product at


various stores location can be aggregated store wise over months or years
in order to be analyzed for some decision making.

Benefits of aggregation
• Reduce memory consumption to store large data records.
• Provides more stable view point than individual data objects
Data Transformation- Aggregation Contd..
Data Transformation- Generalization
Generalization: The data is generalized from low-level to
higher order concepts using concept hierarchies.
e.g. categorical attributes like street can be generalized to higher order
concepts like city or country.

“The decision of generalization level depends on the problem


statement”
Data Transformation- Feature Construction
 Feature construction involves transforming a given set of input features to generate
a new set of powerful features.

 For e.g. feature like mobile number and landline number combined together under
new feature contact number.

 Features like apartment length and breadth must be converted to apartment area.
Data Transformation- Feature Construction
 There are certain situations where feature construction is an essential activity
before we can train a machine learning model.
 These situations are:
When features have categorical value and machine learning needs numeric
value inputs.
 Label Encoding
 One-Hot Encoding
Dummy Encoding
 When features have numeric (continuous) value and need to be converted to
ordinal values.
Rank according to numerical continuous values
 When text-specific feature construction need to be done.
Bag-of-words
Tf-idf
 Word Embeddings
Data Transformation- Feature Construction
 Label Encoding
This approach is very simple and it involves converting each value in a column
to a number.  Depending upon the data values and type of data,
label encoding induces a new problem since it uses
number sequencing.
 The problem using the number is that they
introduce relation/comparison between them.
 The algorithm might misunderstand that data has
some kind of hierarchy/order 0 < 1 < 2 … < 6 and
might give 6X more weight to ‘Cable’ in
calculation then than ‘Arch’ bridge type
Data Transformation- Feature Construction
 Label Encoding
◦ Let’s consider another column named ‘Safety
Level’.
◦ Performing label encoding of this column also
induces order/precedence in number, but in the
right way.
◦ Here the numerical order does not look out-of-
box and it makes sense if the algorithm
interprets safety order 0 < 1 < 2 < 3 < 4 i.e.
none < low < medium < high < very high.
Data Transformation- Feature Construction
 One-Hot Encoding
 Though label encoding is straight but it has the disadvantage that the numeric values
can be misinterpreted by algorithms.
 The ordering issue is addressed in another common alternative approach called ‘One-
Hot Encoding’.
 In this strategy, each category value is converted into a new column and assigned a 1
or 0 (notation for true/false) value to the column.
 It does have the downside of adding more columns to the data set.
 It can cause the number of columns to expand greatly if you have many unique values
in a category column.
Data Transformation- Feature Construction
 One-Hot Encoding
Data Transformation- Feature Construction
 Dummy Encoding
 Dummy coding scheme is similar to one-hot encoding.
 This categorical data encoding method transforms the categorical variable into
a set of binary variables (also known as dummy variables).
 In the case of one-hot encoding, for N categories in a variable, it uses N binary
variables.
 The dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.
Data Transformation- Feature Construction
 Dummy Encoding
Data Transformation- Feature Construction
Text specific Features- BoW , TF-IDF
Document A: The Car Is Driven On The Road
Document B: The Truck is Driven on the highway

You might also like