DATA (PRE-)PROCESSING
In Previous Class,
• We discuss various type of Data with examples
• In this Class,
• We focus on Data pre-processing – “an important
 milestone of the Data Mining Process”
Data analysis pipeline
 Mining is not the only step in the analysis process
 Preprocessing: real data is noisy, incomplete and inconsistent.
   Data cleaning is required to make sense of the data
    Techniques: Sampling, Dimensionality Reduction, Feature Selection.
 Post-Processing: Make the data actionable and useful to the user
    Statistical analysis of importance & Visualization.
Why Preprocess the Data
Measures for Data Quality: A Multidimensional View
Accuracy: Correct or Wrong, Accurate or Not
Completeness: Not recorded, unavailable,…
Consistency: Come modified but some not,...
Timeliness: Timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
Why Data Preprocessing?
• Data in the real world is dirty
   • incomplete: lacking attribute values, lacking certain attributes
     of interest, or containing only aggregate data
   • noisy: containing errors or outliers
   • inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
   • Quality decisions must be based on quality data
   • Data warehouse needs consistent integration of quality data
   • Required for both OLAP and Data Mining!
Why can Data be
Incomplete?
  Attributes of interest are not available (e.g., customer
  information for sales transaction data)
  Data were not considered important at the time of
  transactions, so they were not recorded!
  Data not recorder because of misunderstanding or
  malfunctions
  Data may have been recorded and later deleted!
  Missing/unknown values for some data
Attribute Values
Data is described using attribute
values
Attribute values are numbers or symbols assigned to an attribute
 Distinction between attributes and attribute values
    Same attribute can be mapped to different attribute values
      Example: height can be measured in feet or meters
 Different attributes can be mapped to the same set of
values
     Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
 Nominal
   Examples: ID numbers, eye color, zip codes
 Ordinal
   Examples: rankings (e.g., taste of potato chips on a
     scale from 1-10), grades, height in {tall, medium,
     short}
 Interval
   Examples: calendar dates
 Ratio
   Examples: length, time, counts
Discrete and Continuous Attributes
 Discrete Attribute
    Has only a finite or count able in finite set of values
    Examples: zip codes, counts, or the set of words in a
    collection of documents
     Often represented as integer variables.
     Continuous Attribute
    Has real numbers as attribute values
    Examples : temperature, height, or weight.
    Practically, real values can only be measured and
    represented using a finite number of digits.
Data Preprocessing
Major Tasks in Data Preprocessing
                                                                outliers=exceptions!
   • Data cleaning
      • Fill in missing values, smooth noisy data, identify or remove outliers,
        and resolve inconsistencies
   • Data integration
      • Integration of multiple databases, data cubes, or files
   • Data transformation
      • Normalization and aggregation
   • Data reduction
      • Obtains reduced representation in volume but produces the same or
        similar analytical results
   • Data discretization
      • Part of data reduction but with particular importance, especially for
        numerical data
Forms of data preprocessing
   Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, E g.,
Instrument faulty, human or computer error, Transmission error
       • Data cleaning tasks
           • Fill in missing values
           • Identify outliers and smooth out noisy data
           • Correct inconsistent data
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks
   in classification)—not effective when the percentage of missing values per
   attribute varies considerably.
 • Fill in the missing value manually: tedious + infeasible?
 • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
 • Use the attribute mean to fill in the missing value
 • Use the attribute mean for all samples belonging to the same class to fill in the
   missing value: smarter
 • Use the most probable value to fill in the missing value: inference-based such
   as Bayesian formula or decision tree
How to Handle Missing
Data?
               Age   Income Religion Gender
               23    24,200   Muslim      M
               39    ?        Christian   F
               45    45,390   ?           F
 Fill missing values using aggregate functions (e.g., average) or probabilistic estimates
 on global value distribution
 E.g., put the average income here, or put the most probable income based on the fact
 that the person is 39 years old
 E.g., put the most frequent religion here
Data Quality
 Data has attribute values
 Then,
 How good our Data w.r.t. these attribute
 values?
Data Quality
 Examples of data quality problems:
   Noise and outliers
   Missing values
   Duplicate data
Data Quality: Noise
   Noise refers to modification of original values
   Examples: distortion of a person’s voice when talking on
Data Quality: Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
Data Quality: Missing Values
     Reasons for missing values
    •Information is not collected
    •(e.g., people decline to give their age and weight)
    •Attributes may not be applicable to all cases (e.g.,
    annual income is not applicable to children)
  • Handling missing values
    •Eliminate Data Objects
    •Estimate Missing Values
    •Ignore the Missing Value During Analysis
    •Replace with all possible values (weighted by their
    probabilities)
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
 Major issue when merging data from heterogeous
sources
    Examples:
      Same person with multiple email addresses
   Data cleaning
   Process of dealing with duplicate data issues
Data Quality: Handle Noise(Binning)
 Binning
 sort data and partition into (equi-depth) bins
 smooth by bin means, bin median, bin boundaries,
etc.
 Regression
 smooth by fitting a regression function
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values automatically and check
by human
Data Quality: Handle Noise(Binning)
 Equal-width binning
   Divides the range into N intervals of equal size Width
  of intervals:
   Simple
   Outliers may dominate result
   Equal-depth binning
  Divides the range into N intervals,
each containing approximately same number of records
  Skewed data is also handled well
Simple Methods: Binning
Data Quality: Handle
Noise(Binning)
     Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26,
        28, 29, 34
        • Partition into three (equi-depth) bins
        - Bin 1: 4, 8, 9, 15
        - Bin 2: 21, 21, 24, 25
        - Bin 3: 26, 28, 29, 34
        • Smoothing by bin means
         - Bin 1: 9, 9, 9, 9
         - Bin 2: 23, 23, 23, 23
         - Bin 3: 29, 29, 29, 29
        • Smoothing by bin boundaries
        - Bin 1: 4, 4, 4, 15
        - Bin 2: 21, 21, 25, 25
        - Bin 3: 26, 26, 26, 34
Data Quality: Handle
Noise(Regression)
•Replace noisy or missing values by
predicted values
•Requires model of attribute
dependencies (maybe wrong!)
•Can be used for data smoothing or
for handling missing data
Data Integration
The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management.
Data integration:
   Combines data from multiple sources into a coherent store
Schema integration integrate metadata from different sources
   metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts for the same real
world entity, attribute values from different sources are
different (e.g., J.D.Smith and Jonh Smith may refer to the same
person)
   possible reasons: different representations, different scales,
   e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
     • Redundant data occur often when integration of multiple databases
        • The same attribute may have different names in different
         databases
        • One attribute may be a “derived” attribute in another table, e.g.,
         annual revenue
     • Redundant data may be able to be detected by correlation analysis
     • Careful integration of the data from multiple sources may help
      reduce/avoid redundancies and inconsistencies and improve mining
      speed and quality
Data Transformation
 • Smoothing: remove noise from data
 • Aggregation: summarization, data cube construction
 • Generalization: concept hierarchy climbing
 • Normalization: scaled to fall within a small, specified range
    • min-max normalization
    • z-score normalization
    • normalization by decimal scaling
 • Attribute/feature construction
    • New attributes constructed from the given ones
Normalization: Why normalization?
  • Speeds-up some learning techniques (ex. neural networks)
  • Helps prevent attributes with large ranges outweigh ones
   with small ranges
    • Example:
       • income has range 3000-200000
       • age has range 10-80
       • gender has domain M/F
Data Transformation
Data has an attribute values
Then,
Can we compare these attribute values?
For Example: Compare following two records
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg) Vs.
(3) (5.9 ft, 50 Kg)
(4) (5.6 ft, 56 Kg)
We need Data Transformation to makes different
dimension(attribute) records comparable ...
Data Transformation Techniques
Normalization: scaled to fall within a small,
specified range.
    min-max normalization
    z-score normalization
    normalization by decimal scaling
   Centralization:
     Based on fitting a distribution to the data
     Distance function between distributions
   KL Distance
   Mean Centering
Data Transformation: Normalization
Example: Data Transformation
- Assume, min and max value for height and weight.
- Now, apply Min-Max normalization to both attributes as given
follow
(1) (5.9 ft, 50 Kg)
(2) (4.6 ft, 55 Kg)
       Vs.
(1) (5.9 ft, 50 Kg)
(2) (5.6 ft, 56 Kg)
- Compare your results...
Data Transformation:
Aggregation
  Combining two or more attributes (or objects) into a
single attribute (or object)
 Purpose
 Data reduction
    Reduce the number of attributes or objects
 Change of scale
    Cities aggregated into regions, states, countries,
   etc
 More “stable” data
    Aggregated data tends to have less variability
Data Transformation: Discretization
 Motivation for Discretization
 Some data mining algorithms only accept
categorical attributes
 May improve understandability of patterns
Data Transformation: Discretization
  Task
 • Reduce the number of values for a given
    continuous attribute by partitioning the range of
    the attribute into intervals
 • Interval labels replace actual attribute values
  Methods
 • Binning (as explained earlier)
 • Cluster analysis (will be discussed later)
 • Entropy-based Discretization (Supervised)
Simple Discretization Methods: Binning
     Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will
 be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.
   Equal-depth (frequency) partitioning:
Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky.
 Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining
may take a very long time to run on the complete data set
• Data reduction
   • Obtains a reduced representation of the data set that is much
     smaller in volume but yet produces the same (or almost the same)
     analytical results
• Data reduction strategies
   • Data cube aggregation
   • Dimensionality reduction
   • Data compression
   • Numerosity reduction
   • Discretization and concept hierarchy generation
Techniques of Data Reduction
  Techniques or methods of data reduction in data mining, such
  as
    Dimensionality Reduction
• The reduction of random variables or
    attributes is done so that the
    dimensionality of the data set can be
    reduced.
•   Combining and merging the attributes of
    the data without losing its original
    characteristics. This also helps in the
    reduction of storage space and
    computation time is reduced.
Numerosity Reduction:
Reduce the volume of data
The representation of the data is made smaller by reducing the volume.
There will not be any loss of data in this reduction.
• Parametric methods
   • Assume the data fits some model, estimate model parameters, store
       only the parameters, and discard the data (except possible outliers)
   •   Log-linear models: obtain value at a point in m-D space as the product
       on appropriate marginal subspaces
• Non-parametric methods
   • Do not assume models
   • Major families: histograms, clustering, sampling
Data Cube Aggregation
• The lowest level of a data cube
   • the aggregated data for an individual entity of interest
   • e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
   • Further reduce the size of data to deal with
• Reference appropriate levels
   • Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
 data cube, when possible
Data Compression
           Original Data                  Compressed
                                             Data
                           lossless
                                   s sy
                                lo
           Original Data
           Approximated
      Histograms
                                   40
• A popular data reduction
 technique                         35
• Divide data into buckets and     30
 store average (or sum) for each
                                   25
 bucket
                                   20
• Can be constructed optimally
 in one dimension using            15
 dynamic programming               10
• Related to quantization           5
 problems.
                                    0
                                        10000   30000   50000   70000   90000
Histogram types
• Equal-width histograms:
   • It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
   • It divides the range into N intervals, each containing approximately same number of
    samples
• V-optimal:
   • It considers all histogram types for a given number of buckets and chooses the one
    with the least variance.
• MaxDiff:
   • After sorting the data to be approximated, it defines the borders of the buckets at
    points where the adjacent values have the maximum difference
      • Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three buckets
       Clustering
• Partitions data set into clusters, and models it by one representative
 from each cluster
• Can be very effective if data is clustered but not if data is “smeared”
• There are many choices of clustering definitions and clustering
 algorithms
          Cluster Analysis
               the distance between points in the
      salary
               same cluster should be small
cluster
                                                    outlier
                                                     age
Hierarchical Reduction
  • Use multi-resolution structure with different degrees of reduction
  • Hierarchical clustering is often performed but tends to define
   partitions of data sets rather than “clusters”
  • Parametric methods are usually not amenable to hierarchical
   representation
  • Hierarchical aggregation
     • An index tree hierarchically divides a data set into partitions by value range
       of some attributes
     • Each partition can be considered as a bucket
     • Thus an index tree with aggregates stored at each node is a hierarchical
       histogram
Discretization
   • Three types of attributes:
     • Nominal — values from an unordered set
     • Ordinal — values from an ordered set
     • Continuous — real numbers
   • Discretization:
     • divide the range of a continuous attribute into intervals
     • why?
        • Some classification algorithms only accept categorical
          attributes.
        • Reduce data size by discretization
        • Prepare for further analysis
Discretization and Concept hierarchy
   • Discretization
      • reduce the number of values for a given continuous attribute
       by dividing the range of the attribute into intervals. Interval
       labels can then be used to replace actual data values.
   • Concept hierarchies
      • reduce the data by collecting and replacing low level concepts
       (such as numeric values for the attribute age) by higher level
       concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
• Binning/Smoothing
• Histogram analysis
• Clustering analysis
• Entropy-based discretization
• Segmentation by natural partitioning
Entropy-Based Discretization                               m
                                              Ent(S1 ) = - å pi log2 ( pi )
                                                           i =1
                                   Entropy:
      • Given a set of samples S, if S is partitioned into two
       intervals S1 and S2 using boundary T, the information gain
       I(S,T) after partitioning is
                                      |S1 |           |S 2 |
                          I (S, T ) =       Ent(S1) +       Ent(S 2)
                                       |S |            |S |
      • The boundary that maximizes the information gain over all
       possible boundaries is selected as a binary discretization.
      • The process is recursively applied to partitions obtained
       until some stopping criterion is met, e.g.,
                         Ent(S ) - I (T , S ) > d
      • Experiments show that it may reduce data size and improve
       classification accuracy
Segmentation by natural partitioning
   Users often like to see numerical ranges partitioned into
    relatively uniform, easy-to-read intervals that appear intuitive or
    “natural”. E.g., [50-60] better than [51.223-60.812]
       The 3-4-5 rule can be used to segment numerical data into
       relatively uniform, “natural” intervals.
       * If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
          partition the range into 3 equiwidth intervals for 3,6,9 or 2-3-2 for 7
       * If it covers 2, 4, or 8 distinct values at the most significant digit, partition the
          range into 4 equiwidth intervals
       * If it covers 1, 5, or 10 distinct values at the most significant digit, partition the
          range into 5 equiwidth intervals
     The rule can be recursively applied for the resulting intervals
Python - Data Wrangling
Data wrangling is the process of cleaning and unifying messy and complex data sets
for easy access and analysis
Working with raw data sucks.
•   Data comes in all shapes and sizes – CSV files, PDFs, stone
tablets, .jpg…
• Different files have different forming – Spaces instead of
NULLs, extra rows
• “Dirty” data – Unwanted anomalies – Duplicates
Principal Component Analysis
Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Principal Component Analysis or
Karhuren-Loeve (K-L) method
• Given N data vectors from k-dimensions, find c <= k
 orthogonal vectors that can be best used to represent data
  • The original data set is reduced to one consisting of N data vectors
    on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal
 component vectors
• Works for numeric data only
• Used when the number of dimensions is large
Principal Component Analysis
   X1, X2: original axes (attributes)   X2
   Y1,Y2: principal components
                                                          Y1
                  Y2
                                                   significant component
                                                   (high variance)
                                                                X1
  Order principal components by significance and eliminate weaker ones