UNIT-2 Data Preprocessing
Lecture                  Topic
**********************************************
Lecture-13               Why preprocess the data?
Lecture-14               Data cleaning
Lecture-15               Data integration and transformation
Lecture-16               Data reduction
Lecture-17               Discretization and concept
                         hierarchgeneration
       Lecture-13
Why preprocess the data?
Lecture-13 Why Data Preprocessing?
   Data in the real world is:
   
       incomplete: lacking attribute values, lacking certain
       attributes of interest, or containing only aggregate
       data
   
       noisy: containing errors or outliers
   
       inconsistent: containing discrepancies in codes or
       names
   No quality data, no quality mining results!
   
       Quality decisions must be based on quality data
   
       Data warehouse needs consistent integration of
       quality data
               Lecture-13 Why Data Preprocessing?
Multi-Dimensional Measure of Data Quality
   A well-accepted multidimensional view:
   
       Accuracy
   
       Completeness
   
       Consistency
   
       Timeliness
   
       Believability
   
       Value added
   
       Interpretability
   
       Accessibility
   Broad categories:
   
       intrinsic, contextual, representational, and
       accessibility.
              Lecture-13 Why Data Preprocessing?
      Major Tasks in Data Preprocessing
Data cleaning
    Fill in missing values, smooth noisy data, identify or remove
    outliers, and resolve inconsistencies
Data integration
    Integration of multiple databases, data cubes, or files
Data transformation
    Normalization and aggregation
Data reduction
    Obtains reduced representation in volume but produces the
    same or similar analytical results
Data discretization
    Part of data reduction but with particular importance, especially
    for numerical data
              Lecture-13 Why Data Preprocessing?
Forms of data preprocessing
    Lecture-13 Why Data Preprocessing?
 Lecture-14
Data cleaning
                Data Cleaning
Data cleaning tasks
    Fill in missing values
    Identify outliers and smooth out noisy data
    Correct inconsistent data
            Lecture-14 - Data cleaning
        Missing Data
Data is not always available
    E.g., many tuples have no recorded value for several
    attributes, such as customer income in sales data
Missing data may be due to
    equipment malfunction
    inconsistent with other recorded data and thus deleted
    data not entered due to misunderstanding
    certain data may not be considered important at the time of
    entry
    not register history or changes of the data
Missing data may need to be inferred.
                 Lecture-14 - Data cleaning
   How to Handle Missing Data?
Ignore the tuple: usually done when class label
is missing
Fill in the missing value manually
Use a global constant to fill in the missing value:
ex. “unknown”
              Lecture-14 - Data cleaning
How to Handle Missing Data?
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the
same class to fill in the missing value
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision
tree
                Lecture-14 - Data cleaning
          Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
    faulty data collection instruments
    data entry problems
    data transmission problems
    technology limitation
    inconsistency in naming convention
Other data problems which requires data cleaning
    duplicate records
    incomplete data
    inconsistent data
                 Lecture-14 - Data cleaning
      How to Handle Noisy Data?
Binning method:
    first sort data and partition into (equal-
    frequency) bins
    then one can smooth by bin means, smooth
    by bin median, smooth by bin boundaries
Clustering
    detect and remove outliers
Regression
    smooth by fitting the data to a regression
    functions – linear regression
                Lecture-14 - Data cleaning
    Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
    It divides the range into N intervals of equal size:
    uniform grid
    if A and B are the lowest and highest values of the
    attribute, the width of intervals will be: W = (B-A)/N.
    The most straightforward
    But outliers may dominate presentation
    Skewed data is not handled well.
Equal-depth (frequency) partitioning:
    It divides the range into N intervals, each containing
    approximately same number of samples
    Good data scaling
    Managing categorical attributes can be tricky.
                  Lecture-14 - Data cleaning
    Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
  26, 28, 29, 34
* Partition into (equi-depth) bins:
   - Bin 1: 4, 8, 9, 15
   - Bin 2: 21, 21, 24, 25
   - Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
   - Bin 1: 9, 9, 9, 9
   - Bin 2: 23, 23, 23, 23
   - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
   - Bin 1: 4, 4, 4, 15
   - Bin 2: 21, 21, 25, 25
   - Bin 3: 26, 26, 26, 34
                   Lecture-14 - Data cleaning
Cluster Analysis
     Lecture-14 - Data cleaning
Regression
                    y
              Y1
              Y1’               y=x+1
                          X1            x
   Lecture-14 - Data cleaning
    Lecture-15
Data integration and
   transformation
            Data Integration
Data integration:
    combines data from multiple sources into a coherent
    store
Schema integration
    integrate metadata from different sources
    Entity identification problem: identify real world entities
    from multiple data sources, e.g., A.cust-id  B.cust-#
Detecting and resolving data value conflicts
    for the same real world entity, attribute values from
    different sources are different
    possible reasons: different representations, different
    scales, e.g., metric vs. British units
           Lecture-15 - Data integration and transformation
    Handling Redundant Data in Data
              Integration
Redundant data occur often when
integration of multiple databases
    The same attribute may have different names
    in different databases
    One attribute may be a “derived” attribute in
    another table, e.g., annual revenue
          Lecture-15 - Data integration and transformation
 Handling Redundant Data in Data
           Integration
Redundant data may be able to be
detected by correlation analysis
Careful integration of the data from
multiple sources may help reduce/avoid
redundancies and inconsistencies and
improve mining speed and quality
      Lecture-15 - Data integration and transformation
       Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing
      Lecture-15 - Data integration and transformation
             Data Transformation
Normalization: scaled to fall within a small,
specified range
    min-max normalization
    z-score normalization
    normalization by decimal scaling
Attribute/feature construction
    New attributes constructed from the given ones
          Lecture-15 - Data integration and transformation
 Data Transformation: Normalization
min-max normalization
      v  minA
v'              (new _ maxA  new _ minA)  new _ minA
     maxA  minA
z-score normalization
                v  mean             A
          v' 
               stand _ dev               A
normalization by decimal scaling
     v
v'  j    Where j is the smallest integer such that Max(| v ' |)<1
    10
      Lecture-15 - Data integration and transformation
 Lecture-16
Data reduction
              Data Reduction
Warehouse may store terabytes of data:
Complex data analysis/mining may take a
very long time to run on the complete data
set
Data reduction
    Obtains a reduced representation of the data set
    that is much smaller in volume but yet produces
    the same (or almost the same) analytical results
                 Lecture-16 - Data reduction
       Data Reduction Strategies
Data reduction strategies
    Data cube aggregation
    Attribute subset selection
    Dimensionality reduction
    Numerosity reduction
    Discretization and concept hierarchy
    generation
                Lecture-16 - Data reduction
              Data Cube Aggregation
The lowest level of a data cube
    the aggregated data for an individual entity of interest
    e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes
    Further reduce the size of data to deal with
Reference appropriate levels
    Use the smallest representation which is enough to
    solve the task
Queries regarding aggregated information should
be answered using data cube, when possible
                    Lecture-16 - Data reduction
            Dimensionality Reduction
Feature selection (attribute subset selection):
    Select a minimum set of features such that the
    probability distribution of different classes given the
    values for those features is as close as possible to the
    original distribution given the values of all features
    reduce # of patterns in the patterns, easier to understand
Heuristic methods
    step-wise forward selection
    step-wise backward elimination
    combining forward selection and backward elimination
    decision-tree induction
                   Lecture-16 - Data reduction
Wavelet Transforms
                                                        Haar2     Daubechie4
 Discrete wavelet transform (DWT): linear signal
 processing
 Compressed approximation: store only a small fraction of
 the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
 lossy compression, localized in space
 Method:
 
     Length, L, must be an integer power of 2 (padding with 0s, when
     necessary)
 
     Each transform has 2 functions: smoothing, difference
 
     Applies to pairs of data, resulting in two set of data of length L/2
 
     Applies two functions recursively, until reaches the desired length
                       Lecture-16 - Data reduction
       Principal Component Analysis
Given N data vectors from k-dimensions, find c
<= k orthogonal vectors that can be best used
to represent data
    The original data set is reduced to one consisting of N
    data vectors on c principal components (reduced
    dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large
                  Lecture-16 - Data reduction
Principal Component Analysis
                   X2
                                      Y1
   Y2
                                           X1
        Lecture-16 - Data reduction
   Attribute subset selection
Attribute subset selection reduces the data
set size by removing irrelevent or
redundant attributes.
Goal is find min set of attributes
Uses basic heuristic methods of attribute
selection
            Lecture-16 - Data reduction
Example of Decision Tree Induction
    Initial attribute set:
    {A1, A2, A3, A4, A5, A6}
                             A4 ?
               A1?                        A6?
     Class 1         Class 2        Class 1        Class 2
>   Reduced attribute set: {A1, A4, A6}
                     Lecture-16 - Data reduction
       Numerosity Reduction
Parametric methods
    Assume the data fits some model, estimate model
    parameters, store only the parameters, and discard
    the data (except possible outliers)
    Log-linear models: obtain value at a point in m-D
    space as the product on appropriate marginal
    subspaces
Non-parametric methods
    Do not assume models
    Major families: histograms, clustering, sampling
                 Lecture-16 - Data reduction
        Regression and Log-Linear Models
Linear regression: Data are modeled to fit a
straight line
    Often uses the least-square method to fit the line
Multiple regression: allows a response variable
Y to be modeled as a linear function of
multidimensional feature vector
Log-linear model: approximates discrete
multidimensional probability distributions
                  Lecture-16 - Data reduction
    Regress Analysis and Log-Linear
               Models
Linear regression: Y =  +  X
    Two parameters ,  and  specify the line and are to
    be estimated by using the data at hand.
    using the least squares criterion to the known values of
    Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
    Many nonlinear functions can be transformed into the
    above.
Log-linear models:
    The multi-way table of joint probabilities is
    approximated by a product of lower-order tables.
    Probability: p(a, b, c, d) = ab acad bcd
                  Lecture-16 - Data reduction
                     Histograms
A popular data
reduction technique       40
Divide data into          35
buckets and store         30
average (sum) for
                          25
each bucket
                          20
Can be constructed
optimally in one          15
dimension using           10
dynamic programming
                           5
Related to
                           0
quantization problems.         10000   30000   50000   70000   90000
                 Lecture-16 - Data reduction
                   Clustering
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
                Lecture-16 - Data reduction
                    Sampling
Allows a large data set to be represented
by a much smaller of the data.
Let a large data set D, contains N tuples.
Methods to reduce data set D:
    Simple random sample without replacement
    (SRSWOR)
    Simple random sample with replacement
    (SRSWR)
    Cluster sample
    Stright sample
               Lecture-16 - Data reduction
 Sampling
                       W    O  R
                SRS le random
                    i m  p            ho ut
                 ( s           e wi t
                             l
                    samp ment)
                        p l a ce
                     re
                   SRSW
                        R
Raw Data
           Lecture-16 - Data reduction
            Sampling
Raw Data                     Cluster/Stratified Sample
           Lecture-16 - Data reduction
       Lecture-17
Discretization and concept
   hierarchy generation
                      Discretization
Three types of attributes:
    Nominal — values from an unordered set
    Ordinal — values from an ordered set
    Continuous — real numbers
Discretization: divide the range of a continuous
attribute into intervals
    Some classification algorithms only accept
    categorical attributes.
    Reduce data size by discretization
    Prepare for further analysis
    Lecture-17 - Discretization and concept hierarchy generation
 Discretization and Concept hierachy
Discretization
    reduce the number of values for a given continuous
    attribute by dividing the range of the attribute into
    intervals. Interval labels can then be used to replace
    actual data values.
Concept hierarchies
    reduce the data by collecting and replacing low level
    concepts (such as numeric values for the attribute
    age) by higher level concepts (such as young,
    middle-aged, or senior).
    Lecture-17 - Discretization and concept hierarchy generation
  Discretization and concept hierarchy
      generation for numeric data
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Discretization by intuitive partitioning
   Lecture-17 - Discretization and concept hierarchy generation
         Entropy-Based Discretization
Given a set of samples S, if S is partitioned into
two intervals S1 and S2 using boundary T, the
entropy after partitioning is
                                | S 1|                  |S 2|
                  E (S ,T )             Ent ( S 1)            Ent ( S 2)
                                 |S|                     |S|
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments showEnt ( Sthat
                        )  Eit(may
                                T , S ) reduce
                                               data size
and improve classification accuracy
     Lecture-17 - Discretization and concept hierarchy generation
      Discretization by intuitive partitioning
  3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most
    significant digit, partition the range into 3 equal-width
    intervals
* If it covers 2, 4, or 8 distinct values at the most significant
    digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
    significant digit, partition the range into 5 intervals
       Lecture-17 - Discretization and concept hierarchy generation
                          Example of 3-4-5 rule
                                        count
Step 1:             -$351       -$159                                 profit                      $1,838           $4,700
                    Min         Low (i.e, 5%-tile)                                         High(i.e, 95%-0 tile)      Max
Step 2:             msd=1,000           Low=-$1,000          High=$2,000
                                                           (-$1,000 - $2,000)
Step 3:
                                        (-$1,000 - 0)           (0 -$ 1,000)          ($1,000 - $2,000)
                                                            (-$4000 -$5,000)
Step 4:
                                                                                                                         ($2,000 - $5, 000)
               (-$400 - 0)              (0 - $1,000)                                   ($1,000 - $2, 000)
                             (0 -
                                                                         ($1,000 -
  (-$400 -                    $200)
                                                                          $1,200)                                  ($2,000 -
   -$300)                                                                                                           $3,000)
                              ($200 -
                                                                          ($1,200 -
                               $400)
   (-$300 -                                                                $1,400)
                                                                                                                    ($3,000 -
    -$200)
                             ($400 -                                           ($1,400 -                             $4,000)
   (-$200 -                   $600)                                             $1,600)                                         ($4,000 -
    -$100)                        ($600 -                                           ($1,600 -                                    $5,000)
                                   $800)        ($800 -                                       ($1,800 -
                                                                                     $1,800)
    (-$100 -                                     $1,000)                                       $2,000)
      0)
                Lecture-17 - Discretization and concept hierarchy generation
Concept hierarchy generation for categorical
                   data
 Specification of a partial ordering of attributes
 explicitly at the schema level by users or experts
 Specification of a portion of a hierarchy by
 explicit data grouping
 Specification of a set of attributes, but not of
 their partial ordering
 Specification of only a partial set of attributes
    Lecture-17 - Discretization and concept hierarchy generation
     Specification of a set of attributes
Concept hierarchy can be automatically
 generated based on the number of distinct
 values per attribute in the given attribute set.
 The attribute with the most distinct values is
 placed at the lowest level of the hierarchy.
             country                         15 distinct values
       province_or_ state                        65 distinct
                                                   values
                city                        3567 distinct values
               street                    674,339 distinct values
    Lecture-17 - Discretization and concept hierarchy generation