UNIT-II
Know the data
       and
Data Preprocessing
                             UNIT-2
Know the Data and Data Preprocessing: Data Objects and
attribute types, Basic statistical description of Data, Data preprocessing,
Data cleaning, Data Integration and Data reduction. Main Approaches
for Dimensionality Reduction, Projection, Manifold Learning, PCA.
Insufficient Quantity of Training Data, Nonrepresentative Training Data,
Poor-Quality Data, Irrelevant Features, Overfitting the Training Data,
Underfitting the Training Data, Stepping Back, Testing and Validating.
                               Data Object
❑Data sets are made up of data objects.
❑A data object represents an entity.
❑Examples:
   ❑   sales database: customers, store items, sales
   ❑   medical database: patients, treatments
   ❑   university database: students, professors, courses
❑Also called samples , examples, instances, data points, objects, data tuples.
❑Data objects are described by attributes.
                              Data Objects
❑An attribute is a property or characteristic or feature of a data object.
  ❑ Examples: eye color of a person, temperature, etc.
❑Attribute is also known as variable, field, characteristic, or feature
❑A collection of attributes describe an object.
❑Attribute values are numbers or symbols assigned to an attribute
❑Database rows -> data objects; columns ->attributes.
       Data Objects
Database rows → data objects
Database columns → attributes
                                     Attributes
❑ Attribute (or dimensions, features, variables): a data field, representing a
  characteristic or feature of a data object.
      ❑E.g., customer _ID, name, address
❑ Attribute values are numbers or symbols assigned to an attribute.
❑ Distinction between attributes and attribute values
   ❑ Same attribute can be mapped to different attribute values
       ❑Example: height can be measured in feet or meters
   ❑ Different attributes can be mapped to the same set of values
       ❑Example: Attribute values for ID and age are integers
       ❑But properties of attribute values can be different; ID has no limit, but
        age has a maximum and minimum value
                     Attribute Types
❖NOMINAL ( “relating to names”)
❖BINARY (only two categories or states)
❖ORDINAL (Order or Ranking)
❖NUMERIC (Measurable quantity)
❖DISCRETE
❖CONTINUOUS
                            Attribute Types
❑Categorical (Qualitative)
  ❑   Nominal and Ordinal attributes are collectively referred to as
      categorical or qualitative attributes.
❑Numeric (Quantitative)
  ❑   Interval and Ratio are collectively referred to as quantitative
      or numeric attributes.
❑Discrete vs Continuous attributes
                              Attribute Types
Nominal: categories, states, or “names of things”, “Symbols”.
◼   Hair_color = {auburn, black, blond, brown, grey, red, white}
◼   Marital status, occupation, ID numbers, zip codes
Binary
◼   Nominal attribute with only 2 states (0 and 1)
◼   Symmetric binary: both outcomes equally important
       e.g., gender
◼   Asymmetric binary: outcomes not equally important.
       e.g., medical test (positive vs. negative)
       Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
◼   Values have a meaningful order (ranking) but magnitude between successive values
    is not known.
◼   Size = {small, medium, large}, grades, army rankings                       11
Quantity (integer or real-valued)
Interval
        Measured on a scale of equal-sized units
        Values have order
        ◼ E.g., temperature in C˚ or F˚, calendar dates
        No true zero-point
Ratio
        Inherent zero-point
        We can speak of values as being an order of magnitude larger
        than the unit of measurement (10 K˚ is twice as high as 5 K˚).
         ◼ e.g., temperature in Kelvin, length, counts, monetary
           quantities
                               Attribute Types
                      (Discrete Vs Continuous Attribute)
Discrete Attribute
◼   Has only a finite or countably finite set of values
          ◼ zip codes, profession, or the set of words in a collection of documents
◼   Sometimes, represented as integer variables
◼   Note: Binary attributes are a special case of discrete attributes
◼   Binary attributes where only non-zero values are important are called asymmetric binary
    attributes.
Continuous Attribute
◼   Has real numbers as attribute values
          ◼ Temperature, height, or weight
◼   Practically, real values can only be measured and represented using a finite number of
    digits
◼   Continuous attributes are typically represented as floating-point variables
            Basic statistical description of data
❑Basic statistical descriptions can be used to
 identify properties of the data and highlight
 which data values should be treated as noise
 or outliers.
❑For data preprocessing tasks, we want to learn
 about data characteristics regarding both
 central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode,
and midrange.
Measures of data dispersion include quartiles, interquartile
range (IQR), and variance.
These descriptive statistics are of great help in understanding
the distribution of the data.
  Symmetric vs. Skewed Data
Median, mean and mode of symmetric,                                        symmetric
positively and negatively skewed data
                            positively skewed                   negatively skewed
                                                                                       23
        February 27, 2023           Data Mining: Concepts and Techniques
Dispersion
measures the
extent to which
the items vary
from central
value.
Also called as
spread out,
scatter, variance.
                     Data Preprocessing
Data Preprocessing: An Overview
◼   Data Quality
◼   Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
                                  34
                                          34
          Data Quality: Why Preprocess the Data?
               • Accuracy: correct or wrong, accurate or not
               • Completeness: not recorded, unavailable
 Measures      • Consistency: some modified but some not,
   for data     dangling, …
  quality: A   • Timeliness: timely update?
multidimens    • Believability: how trustable the data are
 ional view     correct?
               • Interpretability: how easily the data can be
                understood?
                                                                35
  Major Techniques/ Tasks in Data Preprocessing
Data cleaning                 Data           Data                        Data
                           integration       reduction            transformation and
• Fill in missing
  values, smooth
                                                                  Data discretization
                         • Integration of    • Dimensionality
  noisy data, identify     multiple            reduction          • Normalization
  or remove outliers,      databases, data                        • Concept hierarchy
  and resolve                                • Numerosity
                           cubes, or files     reduction            generation
  inconsistencies
                                             • Data compression
                                                                                  36
Forms of data preprocessing
                    Data Preprocessing
Data Preprocessing: An Overview
◼   Data Quality
◼   Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
                                38
                                         38
                                 Data Cleaning
   Data in the Real World Is Dirty: Lots of potentially incorrect data
       Ex. instrument faulty, human or computer error, transmission error
                                                                             Intentional
 Incomplete:                Noisy:             Inconsistent:              e.g. disguised missing
                                                                                   data)
• lacking attribute    • containing noise,   • containing               • Jan. 1 as everyone’s
  values, lacking        errors, or            discrepancies in codes     birthday?
                                               or names, e.g.,
  certain attributes     outliers            • Age=“42”,
  of interest, or      • e.g.,                 Birthday=“20/03/2010”
  containing only        Salary=“−10” (an    • Was rating “1, 2, 3”,
  aggregate data         error)                now rating “A, B, C”
• e.g. Occupation=“                          • discrepancy between
 ” (missing data)                              duplicate records
                                                                                          39
                     Incomplete (Missing) Data
Data is not always available
 • E.g., many tuples have no recorded value for several attributes, such as customer
   income in sales data
Missing data may be due to
 • Equipment malfunction
 • Inconsistent with other recorded data and thus deleted
 • Data not entered due to misunderstanding
 • Certain data may not be considered important at the time of entry
 • Not register history or changes of the data
Missing data may need to be inferred
                                                                                   40
         How to Handle Missing Data?
Ignore the tuple :        usually
done when class label is missing              Fill in the missing value
(when doing classification)—not                        manually:
effective when the % of missing
values per attribute varies                       tedious + infeasible?
considerably
            Fill in it automatically with
            • a global constant : e.g., “unknown”, a new class?!
            • The attribute mean
            • The attribute mean for all samples belonging to the
              same class: smarter
            • The most probable value: inference-based such as
              Bayesian formula or decision tree
                                                                          41
                      Noisy Data
                      Incorrect attribute         Other data problems
Noise: random error   values may be due            which require data
                              to                        cleaning
  or variance in a
 measured variable       faulty data collection
                              instruments            duplicate records
                         data entry problems
                                                      incomplete data
                          data transmission
                              problems               inconsistent data
                         Technology limitation
                           Inconsistency in
                          naming convention
                                                                         43
               How to Handle Noisy Data?
                 • First sort data & partition into (equal-frequency) bins
                 • Then one can smooth by bin means,
   Binning
                 • smooth by bin median,
                 • smooth by bin boundaries, etc.
                • Data smooth can also be done regression functions
  Regression      • Linear Regression
                  • Multiple linear Regression
                • Place data elements in their similar groups as Clusters.
  Clustering     • Detect and remove outliers
 Combined computer      • Detect suspicious values and check by human (e.g.,
and human inspection      deal with possible outliers)
                                                                             44
                   Data Preprocessing
Data Preprocessing: An Overview
◼   Data Quality
◼   Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
                                        46
                          Data Integration
                           • Combines data from multiple sources into a
  Data integration:          coherent store
Entity identification      • Identify real world entities from multiple data
      problem:               sources, Ex. Bill Clinton = William Clinton
                           • For the same real-world entity, attribute values
Detecting and resolving      from different sources are different
 data value conflicts      • Possible reasons: different representations,
                             different scales, e.g., metric vs. British units
                           • Object Identification.
    Redundancy             • Derivable data
                                                                                48
               Redundant data occur often when integration of
               multiple databases
               • Object identification: The same attribute or object may
                 have different names in different databases
  Handling     • Derivable data: One attribute may be a “derived”
                 attribute in another table, e.g., annual revenue
Redundancy
   in Data     Redundant attributes may be able to be detected by
               correlation analysis and covariance analysis
 Integration
               Careful integration of the data from multiple sources
               may help reduce/avoid redundancies and
               inconsistencies and improve mining speed and
               quality
                                                                      52
              Χ2 Correlation Test (Nominal Data)
Χ2 (chi-square) test
                    (Observed − Expected)                        2
                =
                  2
                           Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
◼   # of hospitals and # of car-theft in a city are correlated
◼   Both are causally linked to the third variable: population
                                                                            56
             Chi-Square Calculation: An Example
                                    Play chess Not play chess Sum (row)
             Like science fiction   250(90)   200(360)       450
             Not like science       50(210)   1000(840)      1050
             fiction
             Sum(col.)              300       1200           1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
     ( 250 − 90) 2
                     (50 − 210) 2
                                    ( 200 − 360) 2
                                                     (1000 − 840) 2
2 =               +              +                +                = 507.93
          90             210             360             840
It shows that like_science_fiction and play_chess are correlated in the group
                                                                    57
                 Data Reduction Strategies
                                            Data reduction strategies
                      Why data
    Data             reduction?           • Dimensionality reduction:
                                            (remove unimportant attributes)
 reduction:       -Increase storage         • Wavelet transforms
Data reduction        efficiency            • Principal Components Analysis
  is a process     - Performance              (PCA)
 that reduces    (Complex data analysis     • Feature subset selection, feature
the volume of     may take a very long        creation
 original data     time to run on the
                   complete data set.)
                                          • Numerosity reduction: (some
and represents                              simply call it: Data Reduction)
 it in a much     -Reduce storage           • Regression and Log-Linear
     smaller           Cost                   Models
    volume.                                 • Histograms, clustering, sampling
                                            • Data cube aggregation
                                          • Data compression
                 Principal Component Analysis (PCA)
     Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
     reduction in machine learning.
     It is a statistical process that converts the observations of correlated features into a set of linearly
     uncorrelated features with the help of orthogonal transformation.
     These new transformed features are called the Principal Components.
      It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a
     technique to draw strong patterns from the given dataset by reducing the variances.
     PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
x2
                                                                                                        75
                                    x1
                         Step 2:                Step 3:
     Step 1:          Calculate the          Calculate the
Standardize the    covariance matrix       eigenvalues and
   dataset.        for the features in   eigenvectors for the
                      the dataset.        covariance matrix.
                                                Step 4:
                         Step 5:
     Step 6:                               Sort eigenvalues
                   Pick k eigenvalues
 Transform the                                 and their
                   and form a matrix
original matrix.                            corresponding
                    of eigenvectors.
                                             eigenvectors.
                 Regression Analysis                                        y
                                                                                Y1
Regression analysis:
                                                                                Y1’           y=x+1
◼   Regression analysis is a statistical method to model the relationship
    between a dependent (target) and independent (predictor)
                                                                                      X1             x
    variables with one or more independent variables.
                                                                                 Used for prediction
◼   More specifically, Regression analysis helps us to understand how            (including forecasting
    the value of the dependent variable is changing corresponding to an          of time-series data),
                                                                                 inference, hypothesis
    independent variable when other independent variables are held               testing, and modeling
    fixed.                                                                       of causal relationships
◼   It predicts continuous/real values such as temperature, age, salary,
    House price, etc.
The parameters are estimated so as to give a "best fit" of the data
                                                                                      78
         Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
◼   Two regression coefficients, w and b, specify the line and are to be estimated by using the
    data at hand
◼   Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼   Many nonlinear functions can be transformed into the above
Log-linear models:
◼   Approximate discrete multidimensional probability distributions
◼   Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
    discretized attributes, based on a smaller subset of dimensional combinations
◼   Useful for dimensionality reduction and data smoothing                                 79
                                      Histogram Analysis
     40
     35
     30
     25
     20
     15
     10
       5
       0
                                                                                       100000
            10000
                      20000
                              30000
                                       40000
                                               50000
                                                       60000
                                                               70000
                                                                       80000
                                                                               90000
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
◼   Equal-width: equal bucket range
◼   Equal-frequency (or equal-depth)
                                                                                                80
                              Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
                                                                          81
                             Sampling
Sampling: obtaining a small sample s to represent the whole data set N
Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
Key principle: Choose a representative subset of the data
◼   Simple random sampling may have very poor performance in the
    presence of skew
◼   Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)
                                                                         82
                    Types of Sampling
 Simple random     Sampling with     Sampling without
                                                        Stratified sampling:
    sampling        replacement        replacement
• There is an      • Once an         • A selected       • Partition the data
                                                          set, and draw
  equal              object is         object is not      samples from
  probability of   selected, it is     removed            each partition
  selecting any      removed           from the           (proportionally,
  particular         from the          population         i.e.,
                                                          approximately the
  item              population                            same percentage
                                                          of the data)
                                                        • Used in
                                                          conjunction with
                                                          skewed data
                                                                         83
Sampling: With or without Replacement
     Raw Data
                               84
Sampling: Cluster or Stratified Sampling
     Raw Data           Cluster/Stratified Sample
                   85
                What Is Wavelet Transform?
Decomposes a signal into different
frequency sub bands
◼   Applicable to n-dimensional signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression
                                          88
 Wavelet Transformation
Discrete wavelet transform (DWT) for linear signal
                                                                     Haar2   Daubechie4
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
 ▪   Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
 ▪   Each transform has 2 functions: smoothing, difference
 ▪   Applies to pairs of data, resulting in two set of data of length L/2
 ▪   Applies two functions recursively, until reaches the desired length
                                                              89
                      Why Wavelet Transform?
Use hat-shape
filters            Effective
                                      Multi-
• Emphasize        removal of
  region where                        resolution
                   outliers
  points cluster                      • Detect
                   • Insensitive to     arbitrary            Only
• Suppress           noise,
  weaker                                shaped           applicable to    Efficient
                     insensitive to     clusters    at
  information in     input order                              low        Complexity
  their                                 different
                                        scales           dimensional       O(N)
  boundaries
                                                             data
                                                                91
                 Data Preprocessing
Data Preprocessing: An Overview
◼   Data Quality
◼   Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
                                              92
                               Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values i.e.
each old value can be identified with one of the new values
Data transformation is a process of converting data from one format or structure into another format or
structure.
                                             Aggregation:      Normalization: Scaled       Discretization:
 Smoothing: Remove      Attribute/feature   Summarization,    to fall within a smaller,   Concept hierarchy
   noise from data        construction        data cube           specified range             climbing
                                             construction
                                                                       min-max
                               New attributes                        normalization
                              constructed from
                               the given ones
                                                                         z-score
                                                                      normalization
                                                                     normalization by
                                                                      decimal scaling
                                   Normalization
Min-max normalization: to [new_minA, new_maxA]
                    v − minA
              v' =             (new _ maxA − new _ minA) + new _ minA
                   maxA − minA
 ◼   Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
                                                73,600 − 12,000
                                                                (1.0 − 0) + 0 = 0.716
                                                98,000 − 12,000
 ◼                                v−
     Ex. Let μ = 54,000, σ = 16,000.   A
                                     Then
                          v' =
Normalization by decimal scaling       A
                                                             73,600 − 54,000
                                                                             = 1.225
                                                                 16,000
                        v
                   v' = j        Where j is the smallest integer such that Max(|ν’|) < 1
                       10
                                                                                94
                            Discretization
                             Three types of attributes
Nominal—values from an        Ordinal—values from an
    unordered set                                             Numeric—real numbers,
                                   ordered set,
                                                              e.g., integer or real numbers
  e.g., color, profession   e.g., military or academic rank
                                                                                              96
                                 Discretization: Divide the
                                range of a continuous attribute
                                         into intervals
Interval labels can                                                      Prepare for further
  then be used to            Supervised vs.       Split (top-down) vs.
                                                                            analysis, e.g.,
replace actual data          unsupervised         merge (bottom-up)
                                                                            classification
      values
       Reduce data size by
         discretization
       Discretization can be
     performed recursively on
           an attribute
                  Data Discretization Methods
                                      Data Discretization
                                           methods
                                     ( All the methods can be applied
                                                recursively)
                Histogram         Clustering analysis               Decision-tree     Correlation (e.g., 2)
Binning                       (unsupervised, top-down                 analysis              analysis
                 analysis                                         (supervised, top-
                              split or bottom-up merge)                                 (unsupervised,
                                                                     down split)       bottom-up merge)
  Top-down
     split,        Top-down split,
 unsupervised       unsupervised
                                                                                              98
                Simple Discretization: Binning
    Equal-width (distance)              Equal-depth (frequency)
         partitioning                         partitioning
• Divides the range into N            • Divides the range into N
  intervals of equal size: uniform      intervals, each containing
  grid                                  approximately same number of
• if A and B are the lowest and         samples
  highest values of the attribute,    • Good data scaling
  the width of intervals will be: W   • Managing categorical attributes
  = (B –A)/N.                           can be tricky
• The most straightforward, but
  outliers may dominate
  presentation
• Skewed data is not handled well
                                                                          99
Equal width vs Equal depth binning
            Challenges of ML
                      Insufficient Quantity
                         of training data
       Data
     Mismatch                                 Nonrepresentative
                                                training data
 Hyperparameter
                                                    Poor – Quality
tuning and Model
                                                       of data
    selection
  Testing and                                         Irrelevant
   Validating                                          features
      Stepping Back                               Overfitting the
                                                   Training data
                        Underfitting the
                         Training data                               102
103
104
105
     10
      6
BY PUNNA RAO