Data Pre-Processing-II
(Data Integration, Data Transformation)
                                       D r. JASMEET S INGH
                              ASSISTANT P ROFESSOR, C SED
                                             T IET, PATIALA
Data Integration
◦ Data Integration: It is the process of merging the data from
 multiple sources into a coherent data store.
 e.g. Collection of banking data from different banks at data stores of RBI
 Issues in data integration
 • Schema        integration   and   feature
    matching
 • Detection and resolution of data value
    conflicts.
 • Redundant features
  Data Integration
  Schema integration and feature matching:
Cust.id   Name   Age   DoB   Cust.no   Name   Age   DoB    Cust.id   Name   Age   DoB
                             Cust.id   Name   Age   DoB   “Carefully analysis the metadata”
Data Integration
 Detection and resolution of data value conflicts:
 Product   year   Price      Product   Year      Price           Product   Year   Price
                  ($)                            (Rs)                             (pound)
Research data for price of   Product      year           Price
essential products
                                        “Carefully analysis the metadata”
Data Integration
Redundant features: They are unwanted features.
 To deal with redundant features correlation analysis is
  performed. Denoted by r.
 A threshold is decided to find redundant features.
                                                                   Cust.id   Name   Age   DoB
       r is +ve                    r is -ve                 r is zero
Data Transformation
 Data Transformation: It is a process to transform or
 consolidate the data in a form suitable for machine learning
 algorithms.
 Major techniques of data transformation are :-
 • Normalization/Scaling
 • Aggregation
 • Generalization
 • Feature construction
Data Transformation- Scaling
Scaling/Normalization: It is the technique of mapping the numerical
feature values of any range into a specific smaller range i.e. 0 to 1 or -
1 to 1 etc.
Popular methods of Normalization are:-
•   Min-Max method
•   Mean normalization
•   Z score method
•   Decimal scaling method
•   Log Transformer
•   MaxAbs Transformer
•   InterQuartile Scaler / Robust Scaler
Data Transformation- Scaling Contd
Min-Max method:                                                     𝑋
                                              X
                                              2                     0
                                                     Min-max
                                              47
                                                   normalization   0.512
                                              90
                                                                    1
Where,                                        18
   is mapped value                                                 0.18
                                              5
   is data value to be mapped in                                   0.034
 specific range
                  is minimum and maximum
value of feature vector corresponding to 𝑋.
  Data Transformation- Scaling Contd
   Mean normalization                                                 𝑋
                                               X
                                               2                    -0.345
                                                       Mean
                                               47
                                                    normalization   0.166
Where,                                         90
   is mapped value                                                  0.655
   is data value to be mapped in               18
                                                                    -0.164
 specific range                                5
𝑋 is mean of feature vector corresponding to                        -0.311
𝑋.
                  is minimum and maximum
value of feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
Z Score method:                                                    𝑋
                                            X
                                            2                    -0.826
                                                    Z score
                                            47
                                                 normalization   0.397
                                            90
                                                                 1.566
Where,                                      18
   is mapped value                                               -0.391
                                            5
   is data value to be mapped in                                 -0.745
specific range
𝑋 and 𝜎 is mean and standard deviation of
feature vector corresponding to 𝑋.
Data Transformation- Scaling Contd…
 Decimal scaling method:                                        𝑋
                                        X
                                        2    Decimal scaling   0.02
                                              normalization
                                        47
                                                               0.47
                                        90
                                                               0.9
Where,
                                        18
   is mapped value                                             0.18
   is data value to be mapped in        5
specific range                                                 0.05
 is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
    Data Transformation- Scaling Contd…
      Log Transformer:                                                    𝑋
                                                    X
                                                    2     Log scaling  0.693147
                                                         normalization
                                                    47
Where,                                                                 3.850148
𝑋 is mapped value                                   90
𝑋 is data value to be mapped in                                        4.499810
specific range                                      18
It is primarily used to convert a skewed                               2.890372
                                                    5
distribution to a normal distribution/less-skewed                      1.609438
distribution.
The log operation had a dual role:
•Reducing the impact of too-low values
•Reducing the impact of too-high values.
 Data Transformation – Scaling Contd…
    MaxAbs Scaler                              X                       𝑋
 It first takes the absolute value of each   100       MaxAbs         0.05
value in the column and then takes the       -263
                                                     normalization
maximum value out of those.                                          -0.1315
                                             -2000
This operation scales the data between the                             -1
range [-1, 1].                                18
                                                                     0.009
                                              5
                                                                     0.0025
       Data Transformation- Scaling Contd
        Interquartile/Robust normalization                    X                        𝑋
                                                              2                    -0.38095238
                                                                      Robust
                                                              47
Where,                                                             normalization   0.69047619
𝑋 is mapped value                                             90
𝑋 is data value to be mapped in                                                    1.71428571
 specific range                                               18
                                                                                       0
• The mean, maximum and minimum values of the columns.        5
   All these values are sensitive to outliers.                                     -0.30952381
• If there are too many outliers in the data, they will
   influence the mean and the max value or the min value.
• Thus, even if we scale this data using the above methods,
   we cannot guarantee a balanced data with a normal
   distribution.
Data Transformation- Aggregation
Aggregation : take the aggregated values in order to put the
data in a better perspective.
e.g. in case of transactional data, the day to day sales of product at
various stores location can be aggregated store wise over months or years
in order to be analyzed for some decision making.
 Benefits of aggregation
 • Reduce memory consumption to store large data records.
 • Provides more stable view point than individual data objects
Data Transformation- Aggregation Contd..
Data Transformation- Generalization
Generalization: The data is generalized from low-level to
higher order concepts using concept hierarchies.
e.g. categorical attributes like street can be generalized to higher order
concepts like city or country.
   “The decision of generalization level depends on the problem
                            statement”
Data Transformation- Feature Construction
 Feature construction involves transforming a given set of input features to generate
 a new set of powerful features.
 For e.g. feature like mobile number and landline number combined together under
 new feature contact number.
 Features like apartment length and breadth must be converted to apartment area.
Data Transformation- Feature Construction
 There are certain situations where feature construction is an essential activity
 before we can train a machine learning model.
 These situations are:
  When features have categorical value and machine learning needs numeric
   value inputs.
    Label Encoding
    One-Hot Encoding
   Dummy Encoding
  When features have numeric (continuous) value and need to be converted to
  ordinal values.
   Rank according to numerical continuous values
  When text-specific feature construction need to be done.
   Bag-of-words
   Tf-idf
    Word Embeddings
Data Transformation- Feature Construction
 Label Encoding
This approach is very simple and it involves converting each value in a column
to a number.                 Depending upon the data values and type of data,
                               label encoding induces a new problem since it uses
                               number sequencing.
                             The problem using the number is that they
                               introduce relation/comparison between them.
                             The algorithm might misunderstand that data has
                               some kind of hierarchy/order 0 < 1 < 2 … < 6 and
                               might give 6X more weight to ‘Cable’ in
                               calculation then than ‘Arch’ bridge type
Data Transformation- Feature Construction
 Label Encoding
 ◦ Let’s consider another column named ‘Safety
   Level’.
 ◦ Performing label encoding of this column also
   induces order/precedence in number, but in the
   right way.
 ◦ Here the numerical order does not look out-of-
   box and it makes sense if the algorithm
   interprets safety order 0 < 1 < 2 < 3 < 4 i.e.
   none < low < medium < high < very high.
Data Transformation- Feature Construction
 One-Hot Encoding
  Though label encoding is straight but it has the disadvantage that the numeric values
   can be misinterpreted by algorithms.
  The ordering issue is addressed in another common alternative approach called ‘One-
   Hot Encoding’.
  In this strategy, each category value is converted into a new column and assigned a 1
   or 0 (notation for true/false) value to the column.
  It does have the downside of adding more columns to the data set.
  It can cause the number of columns to expand greatly if you have many unique values
   in a category column.
  Data Transformation- Feature Construction
 One-Hot Encoding
Data Transformation- Feature Construction
 Dummy Encoding
  Dummy coding scheme is similar to one-hot encoding.
  This categorical data encoding method transforms the categorical variable into
  a set of binary variables (also known as dummy variables).
  In the case of one-hot encoding, for N categories in a variable, it uses N binary
   variables.
  The dummy encoding is a small improvement over one-hot-encoding. Dummy
   encoding uses N-1 features to represent N labels/categories.
  Data Transformation- Feature Construction
 Dummy Encoding
Data Transformation- Feature Construction
              Text specific Features- BoW , TF-IDF
 Document A: The Car Is Driven On The Road
 Document B: The Truck is Driven on the highway