sainathgunda99@gmail.
com
DLZNK464L9 Data Preprocessing
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Objectives
Upon completion, you will be able to:
● Explain data pre-processing tasks.
● Illustrate methods to handle missing values and noisy data.
● Explain the importance of outlier removal and redundant data removal from datasets.
● List the methods for dimensionality reduction and numerosity reduction .
sainathgunda99@gmail.com
● Define data discretization and its methods.
DLZNK464L9
● Explain data transformation and the importance of normalization.
● Demonstrate typical data pre-processing tasks in Python.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Quality and Data Format
sainathgunda99@gmail.com
DLZNK464L9 Overview
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Concepts of Data Pre-processing:
○ Data Quality
○ Data Formats
○ Major Tasks in Data Pre-processing
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Quality: Why Preprocess the Data ?
Measures for data quality: A multidimensional view
● Accuracy: proper or incorrect, accurate or not.
● Completeness: not recorded, un-available, missing values, important variables not included
● Consistency: dangling and some features are modified but some features not
sainathgunda99@gmail.com
DLZNK464L9
● Interpretability: how easily the data can be understood, codes as variable names, or coded values,
nominal values – semantic ambiguity in the data
● Timeliness: is timely updated?
● Believability: how much data is trustable are as perceived by the end user
● Evaluate all of the above to assess data’s fitness for the task
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Formats: Tidy Data
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Var 1 Var 2 … … Var n
Obs 1
sainathgunda99@gmail.com
DLZNK464L9 2.3 34 Yes 123.45 0.3
Obs 2 3.6 23 No 567.34 0.7
Obs n 5.6 56 No 112.7 0.56
● Provides a standard way of structuring a dataset.
● Make it easier to extract needed variables for analysis.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Formats: Wide Format vs Long Format
● “wide” format: consider variable “Math/English”
Name Math English
Anna 86 90
John 43 75
Cath 80 82
sainathgunda99@gmail.com
DLZNK464L9
● “long” format: considered variable “Subject”
Name Subject Grade
Anna Math 86
Anna English 90
John Math 43
John English 75
Cath Math 80
Cath English 82
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Pre-processing: Major Tasks
● Data cleaning
○ Handling missing values, noisy data, resolve inconsistencies and identify or remove the outliers
● Data integration
○ Integration of multiple databases, data cubes, or files
● Data reduction
sainathgunda99@gmail.com
DLZNK464L9
○ Dimensionality reduction (PCA)
○ Numerosity reduction
● Data transformation
○ Normalization
○ Data discretization
○ Concept of hierarchy generation
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Pre-processing: Major Tasks
Tasks Methods
Binning, Histogram analysis
Missing values
Regression
Noisy data
Clustering, Classification
Outliers Correlation/covariance
sainathgunda99@gmail.com
Redundancy
DLZNK464L9
PCA, Feature selection
Box plots
Dimensionality reduction
Sampling
Numerosity reduction
Data compression
Data discretization Data Normalization
Scale differences Concept hierarchy
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed:
● Data quality: format, accuracy, completeness, consistency, timeliness, believability, interpretability.
● Tidy data provides a standard way of structuring a dataset.
● Major pre-processing tasks - Data cleaning, data integration, data reduction, and data
sainathgunda99@gmail.com
transformation.
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
sainathgunda99@gmail.com
DLZNK464L9
Missing Values and Noisy Data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Different Tasks and Methods
○ Missing values
○ How to handle missing data?
○ Simple Linear Regression
○ Multiple Linear Regression
sainathgunda99@gmail.com
DLZNK464L9
○ Noisy data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Missing Values
● Empty cells or cells filled with “NA”-like tokens.
● Semantics of missing data
○ An empty data cell could mean:
■ Value exists
sainathgunda99@gmail.com
DLZNK464L9
● Value is available but not recorded due to human error, for example
○ Negative findings are left empty (e.g., negative for asymmetric binary variables)
● Value is not available (e.g., I don’t know my grandpa’s birthday)
■ Value does not exist:
● Absence of a value (I don’t have a middle name)
● Not applicable (I don’t have a tail)
○ Different semantics should be encoded as different values,
NA-not applicable, Missing Sharing –applicable
or but
publishing the contents Rightsnot available,
inReserved.
part or Unauthorized use or etc.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All
full distribution prohibited.
is liable for legal action.
How to Handle Missing Data ?
● Ignore the tuples with missing value
○ when the class label is missing (when doing classification)
○ not effective when the percentage of missing information varies greatly per attribute - resulting
in a large number of tuples not being included in analyses.
● Fill in the missing value manually: major feasibility issue
sainathgunda99@gmail.com
DLZNK464L9
● Replace empty cells with ‘NA’, “Missing”, etc. More see https://support.datacite.org/docs/schema-
values-unknown-information-v42
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
How to Handle Missing Data ?
● Fill in automatically with (imputation)
○ A global constant: e.g., NA. Not ideal but often done
○ The attribute mean/median/mode
○ The mean/median/mode for all data objects in the same class (smarter)
○ The most probable value: regression or inference-based such as Bayesian inference or decision
sainathgunda99@gmail.com
DLZNK464L9
tree: best, but is this problem-free?
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Simple Linear Regression
● A statistical method that summarizes and studies the relationships between two continuous
(quantitative) variables
○ Independent (predictor) variable: x = height
○ Dependent (response) variable: y = weight
● Goal: find the best straight line that fits the data
○ y = bx +a
sainathgunda99@gmail.com
DLZNK464L9
● Method: find a and b that minimize the objective function
● How good is the fit?: coefficient of determination (R Squared,=1 is the best)
Adjusted R Squared
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Simple Linear Regression
y y = bx + a
250
Residual
200
r = 100 - 150 = -50
Weight (lbs)
150
sainathgunda99@gmail.com
DLZNK464L9
r
100 (55, 100)
50
x
10 20 30 40 50 60
Height (inches)
‘r’ here shows a residual, the difference between the true value and the predicted value.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Multiple Linear Regression
● Multiple linear regression (more than 1 independent variables, X and beta are vectors).
● Tips on choosing the best model.
○ http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-choose-the-best-regression-
model
● Use for:
○ missing values: use predicted values to replace missing values.
sainathgunda99@gmail.com
DLZNK464L9
○ data smoothing: use predicted values to replace original data.
○ data reduction: save only the function, parameters, and outliers (not the original data for the
predicated dimensions).
○ outlier detection: identify (visualize) data that are far away from the predicted values.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Noise
● Noise has two main sources:
○ Implicit inaccuracies caused by measuring devices
○ Random errors caused by human errors or other issues
● Noise can occur in attribute names and attribute values, including class labels
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: How to Handle Noisy Data ?
● Binning/Histogram analysis
○ First, sort data and partition it into (e.g., equal-frequency) bins.
○ Then smooth by bin means, smooth by bin median, or smooth by bin borders.
● Regression
○ Smooth by fitting data into the regression functions
sainathgunda99@gmail.com
DLZNK464L9
● Clustering
○ Smooth data by cluster centres
○ detect and remove outliers/errors
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: How to Handle Noisy Data ?
● Truncation
○ Truncate the least significant digits in a real number
● Human inspection and Combined computer
○ Detect suspicious values and check by humans
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Smooth by Binning
● Divide sorted data into bins.
● Partitioning rules:
○ Equal-width: equal bin range
○ Equal-frequency (or equal-depth): equal # of
data points in the bins
sainathgunda99@gmail.com
DLZNK464L9
● For data smoothing/discretization, replace data
with bin mean, median, etc/bin label.
● In effect, it also reduced the number of different
data values (cardinality of the variable)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Equal-width binning
● Equal-width (interval) partitioning
○ Divides the range into N bins of equal intervals.
○ if A is lowest and B is highest values of the attribute,
The width of intervals will be:
W = (B –A)/N.
sainathgunda99@gmail.com
DLZNK464L9
○ In practice: Freedman-Diaconis rule works well (more rules)
■ W=2×IQR×n−1/3 . N = (B−A)/W
○ The most straightforward, but outliers may dominate the presentation.
○ Data can’t be handled well if it is skewed
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Equal-Depth Binning
● Equal-depth (count, frequency) partitioning
○ Divides the entire range into N bins of equal number of data points.
○ Good data scaling with varied bin width
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Example Equal-Depth Binning for Data Smoothing
● Sorted data for the price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
○ Partition into equal-frequency (equi-depth) bins:
■ Bin 1: 4, 8, 9, 15
■ Bin 2: 21, 21, 24, 25
■ Bin 3: 26, 28, 29, 34
sainathgunda99@gmail.com
DLZNK464L9
○ Smoothing by bin boundaries:
■ Bin 1: 4, 4, 4, 15
■ Bin 2: 21, 21, 25, 25
■ Bin 3: 26, 26, 26, 34
○ Smoothing by bin means:
■ Bin 1: 9, 9, 9, 9
■ Bin 2: 23, 23, 23, 23
■ Bin 3: 29, 29, 29, 29
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Clustering
● Partition continuous, discrete, or mixed datasets into clusters
based on similarity [distance].
○ There are many choices of distance functions, clustering
definitions, and clustering algorithms
● Can be used to smooth noisy data, detect outliers, numerosity
sainathgunda99@gmail.com
DLZNK464L9
reduction, and data discretization.
○ Data smoothing/discretization: take cluster means,
median, etc.
○ Data reduction: store cluster representation only
○ Outlier detection: visualize data points far away
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Clustering
● Can be very useful if the data is clustered, but it
cannot be effective if the data is "splattered."
● Can have hierarchical clustering and be stored in
multi-dimensional index tree structures.
● A non-parametric method: no assumption. Let the
sainathgunda99@gmail.com
DLZNK464L9
data tell the story.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,
● Empty cells or cells filled with “NA”-like tokens are referred to as missing data.
● Noisy Data can be implicit errors introduced by measurement tools, such as different types of
sensors, or random errors.
● There are different ways to handle missing data and noisy data, including various imputation
sainathgunda99@gmail.com
DLZNK464L9
methods and data smoothing methods.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
Outliers and Data Redundancy
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Outliers
○ Data Redundancy
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Outliers: Outlier Detection
● Exploratory data analysis:
○ Data summary plots – boxplots
○ Histogram analysis
● Regression
○ Data that doesn’t fit the known distribution model are outliers.
sainathgunda99@gmail.com
DLZNK464L9
● Clustering
○ Outliers form small and distant clusters or not be included in any cluster.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Data Integration
● Data integration:
○ Data from multiple sources is combined into a coherent storage.
● Database schema integration
○ Challenging; examining metadata carefully originates from various
sources.
sainathgunda99@gmail.com
DLZNK464L9
● Data redundancy, e.g., entity identification problem:
○ Identify real-world entities from a variety of data sources
● Detecting and resolving data value conflicts and scale differences.
○ Attribute values from different sources differ for the same real-world
item.
○ Possible reasons: different representations (e.g., date, GPA), different
scales, e.g., metric vs.Proprietary
BritishThis file units
is meant for personal use by sainathgunda99@gmail.com only.
content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Handling Redundancy in Data Integration
● Redundant attributes may be detected by correlation analysis or covariance analysis.
● Redundant attributes should be removed
● Attributes that are correlated but not redundant should often be kept.
● Careful integration of data from various sources may aid in the reduction/avoidance of redundancies
sainathgunda99@gmail.com
DLZNK464L9
and inconsistencies, as well as the improvement of mining speed and quality.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Correlation Analysis (Nominal Data)
Play chess Not play chess Sum (row)
[c1] [c2]
Like science fiction[r1] 250(90) 200(360) 450 [R=r1]
Not like science fiction 50(210) 1000(840) 1050 [R=r2]
[r2]
Sum(col.)
sainathgunda99@gmail.com
DLZNK464L9
300 [C=c1] 1200 [C=c2] 1500 [n]
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Chi-Square Calculation
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
● H0: A and B are not correlated. alpha = 0.001
sainathgunda99@gmail.com
DLZNK464L9
● Χ2 (chi-square) value calculation
● Using the Χ2 table (next slide), we find the critical value=10.828 for the alpha and d.f.=1
● Χ2 > 10.828, reject H0, so A and B are correlated.
● Most tests will give you a p-value; if p-value < alpha, reject H0.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Critical Value Table
Critical values of the Chi-square distribution with d
degrees of freedom
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Correlation Analysis (Numeric Data)
● Correlation coefficient (also called Pearson’s product moment coefficient) [-1, 1]
○ Where n is the number of tuples, and are the respective means of A and B,
sainathgunda99@gmail.com
○ σA and σB are the respective standard deviation of A and B
DLZNK464L9
○ Σ(aibi) is the sum of the AB cross-product.
● If rA,B > 0, A and B are positively linearly correlated (A’s values increase as B’s). The higher the value of
rA,B, the stronger the correlation.
● rA,B = 0: not linearly correlated; may still be associated in other ways.
● rAB < 0: negatively linearly correlated
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Visually Evaluating Correlation
Scatter plots showing Pearson
sainathgunda99@gmail.com
DLZNK464L9
coefficient from –1 to 1.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Covariance (Numeric Data)
● Covariance is similar to the correlation.
Contrast: Correlation coefficient:
sainathgunda99@gmail.com
DLZNK464L9
○ where n is the number of tuples, are the respective mean or expected values (E) of A
and B, σA and σB are the respective standard deviation of A and B.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Covariance (Numeric Data)
● Positive covariance: Cov A and B> 0, indicating A and B both tend to be larger than their expected
values.
● Negative covariance: CovA and B < 0, indicating two variables change in different directions: one is
larger and the other one is smaller than their expected values.
sainathgunda99@gmail.com
● Independence: CovA and B= 0, but the reverse is not true:
DLZNK464L9
○ Some random variable pairings may have a covariance of zero but they are not independent. A
covariance of 0 implies independence only under certain additional conditions (for example,
the data have multivariate normal distributions).
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Co-Variance: An Example
● Suppose two stocks A and B have the following values in one week: (2,5), (3, 8), (5, 10), (4,
11), (6, 14).
sainathgunda99@gmail.com
DLZNK464L9
● Question: Are the prices of A and B rise or fall together?
● E(A) = (2 + 3 + 5 + 4 + 6)/5 = 20/5 = 4
● E(B) = (5 + 8 + 10 + 11 + 14)/5 = 48/5 = 9.6
● Cov(A,B) = (2x5+3x8+5x10+4x11+6x14)/5 - 4 x 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,
● Outliers can be detected.
● Data redundancy occurs mostly because of data integration, and redundant attributes may be
detected by correlation or covariance analysis.
● Redundant attributes should be removed.
sainathgunda99@gmail.com
DLZNK464L9
● Correlated attributes are often useful in mining tasks.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
sainathgunda99@gmail.com Dimensionality Reduction and
Numerosity Reduction
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Dimensionality reduction
○ Curse of Dimensionality and data sparseness
○ PCA – Principal Component Analysis
○ Numerosity reduction and random sampling methods
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Reduction Strategies
● Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results.
● Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a long time on the complete data set.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Reduction Strategies
● Data reduction strategies
○ Dimensionality reduction, e.g., removing or merging attributes
■ Principal Components Analysis (PCA).
■ Feature subset selection, feature creation
○ Numerosity reduction (reduce data volume, use smaller forms of data representation)
■ Regression
sainathgunda99@gmail.com
DLZNK464L9
■ Histograms/binning, clustering, sampling
■ Data cube aggregation
○ Data compression
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dimensionality Reduction: Curse of Dimensionality
● Curse of dimensionality
○ When dimensionality of features in the dataset increases, data becomes increasingly sparse in
feature space.
○ Density and distance between points, which are important for grouping and outlier analysis,
become less relevant.
○ The number of possible subspace combinations will expand exponentially.
● Dimensionality reduction
sainathgunda99@gmail.com
DLZNK464L9
○ Avoid the curse of dimensionality by reducing features
○ Dimensionality reduction help in eliminate irrelevant features and reduce noise.
○ Reduces time and space required in data mining.
○ Ease to visualize
● Dimensionality reduction techniques
○ Principal Component Analysis
○ Supervised techniques
○ Nonlinear techniques (e.g., feature selection)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Curse of Dimensionality: sparseness
A single feature does not
result in a perfect separation
of our training data
sainathgunda99@gmail.com
DLZNK464L9
Adding a third feature
results in a linearly
separable classification
problem in our training data
Adding a second feature still does
not result in a linearly separable
This file is meant for personal use by sainathgunda99@gmail.com only.
classification Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sparseness: More Training Data Needed
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sparseness -> Everything is Equal-Distanced
sainathgunda99@gmail.com
DLZNK464L9
● With increased dimensionality, the hypersphere occupies only a very small
portion of the search space; all training examples are essentially located in
the corners.
● When dim -> infinity, all training examples are at the same distance from all
other examples.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA): Numeric Data
● Finds the projection that captures the most variety in the data.
● The original data can be reflected into a much smaller space, which reduces dimensionality while
keeping variability. We find the eigenvectors (“characteristic” vectors) of the covariance matrix, and
these eigenvectors define the new space.
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA): Numeric Data
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA)
• First three PCs capture 75% of original
variance based on loadings.
• Component values are weighted sum of
the original dimensions.
• Comp1 = 0.361*Sepal.Length +
0.867*Petal.Length + 0.358*Petal.Width
sainathgunda99@gmail.com
DLZNK464L9
• Subsequent analysis will use reduced
presentation/dimensions
53 This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numerosity Reduction
● Reduce the size of data volume by choosing alternative smaller forms of data representation.
● Parametric methods (Example: regression)
○ Consider the data fits some model, estimate model parameters, store only the parameters, and
discard the data (except possible outliers).
● Non-parametric methods
sainathgunda99@gmail.com
DLZNK464L9
○ Do not assume parameterized probability distributions.
○ Major families: histograms/binning, clustering, sampling, …
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sampling
● Obtaining a small sample “s” to represent the whole data set “N”.
● Also used in sampling training and test examples.
● Allow mining algorithms to run at a complexity that is possibly sub-linear to data size.
● Key principle: choose a representative subset of the data.
sainathgunda99@gmail.com
DLZNK464L9
○ In skewed datasets, simple random sampling may perform poorly.
○ Develop adaptive sampling methods, e.g., stratified sampling.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,
● Data reduction obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results.
● Data reduction can be done by:
○ Dimensionality reduction - It is the process of removing unimportant attributes.
sainathgunda99@gmail.com
DLZNK464L9
○ Numerosity reduction - It reduces data volume; uses smaller forms of data representation.
○ Data compression
● Sampling is about obtaining a small sample s to represent the whole data set N.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
Data Transformation
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Data transformation: Normalization
○ Data discretization methods
○ Concept Hierarchy generation
sainathgunda99@gmail.com
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Transformation
● Data are transformed or consolidated into forms appropriate for mining.
● Methods
○ Smoothing: Remove noise from data
○ Attribute / feature construction
■ New attributes constructed from the given ones
sainathgunda99@gmail.com
DLZNK464L9
○ Aggregation: Data cube construction, summarization
○ Normalization: Scaled to fall within a smaller, specified range for more meaningful comparison
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
○ Discretization: Concept hierarchy climbing
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization
● Min-max normalization: to [new_minA, new_maxA]
○ Ex. Let income range from $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,600 is mapped to
sainathgunda99@gmail.com
DLZNK464L9
● Z-score normalization (μ: mean, σ: standard deviation):
○ Ex. Let μ = 54,000, σ = 16,000. Then
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization
● Normalization by decimal scaling
○ Ex. (50, 20) -> (0.5, 0.2) with j=2
Where j is the smallest integer such
sainathgunda99@gmail.com that Max(|ν’|) < 1 or =1
DLZNK464L9
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Discretization
Discretization: Divide the range of a continuous attribute into intervals.
● Actual data values are replaced with interval labels..
● Reduce attribute cardinality
● Handles outliers and skewed data
● Supervised vs. unsupervised
sainathgunda99@gmail.com
DLZNK464L9
● Prepare data for further analysis, e.g., classification.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Discretization Methods
Typical methods:
All the methods mentioned below can be applied recursively.
● Histogram and Binning analysis
○ Top-down split
sainathgunda99@gmail.com
DLZNK464L9
○ Unsupervised
● Clustering analysis (unsupervised, top-down split, or bottom-up merge)
● Classification analysis, e.g., decision-tree (supervised, top-down split)
● Correlation (e.g., χ2) analysis, e.g., ChiMerge (supervised, bottom-up merge)
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discretization by Correlation Analysis
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
● Exploit the correlation between intervals and class labels.
● "Interval – Class” contingency tables
● If two adjacent intervals have low χ2 values (less correlated to the class labels), merge them to form
sainathgunda99@gmail.com
DLZNK464L9
a larger interval (keeping them separate does not offer more information on how to classify objects).
● Merge performed recursively until a predefined stopping condition is met.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization
Sample F K
● Statistical approach to Data 1 1 1
Discretization. 2 3 2
3 7 1
● Discretizing the data based on class 4 8 1
labels, using the Chi-square
sainathgunda99@gmail.com
DLZNK464L9
5 9 1
approach.
6 11 2
● F:attribute 7 23 1
8 37 1
● K:class label 9 39 2
10 45 1
11 46 2
12 59 1
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization Example Sample F K Intervals
● Sort and arrange the
1 1 1 {0,2}
attributes you want to 2 3 2 {2,5}
group (Example: 3 7 1 {5,7.5}
attribute F). 4 8 1
● Begin by having each {7.5,8.5}
unique value in the
sainathgunda99@gmail.com
DLZNK464L9 5 9 1 {8.5,10}
attribute in its own 6 11 2 {10,17}
interval. 7 23 2 {17,30}
8 37 1 {30,38}
9 39 2 {38,42}
10 45 1 {42,45.5}
11 46 1 {45.5,52}
12 59 1 {52,60}
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization Example Sample F K
● Begin calculating the Chi- 1 1 1
square test on every pair 2 3 2
of adjacent intervals 3 7 1
● Interval/class contingency 4 8 1
tables:
sainathgunda99@gmail.com
DLZNK464L9
5 9 1
Sample K=1 K=2 6 11 2
2 0 1 1 7 23 2
3 1 0 1
8 37 1
total 1 1 2
9 39 2
Sample K=1 K=2
10 45 1
3 1 0 1
11 46 1
4 1 0 1
total 2 0 2 12 59
This file is meant for personal use by sainathgunda99@gmail.com only. 1
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sampl K=1 K=2 E11 = (1/2)*1 = .05
e E12 = (1/2)*1 = .05
E21 = (1/2)*1 = .05
2 0 1 1
E22 = (1/2)*1 = .05
3 1 0 1
total 1 1 2
sainathgunda99@gmail.com
DLZNK464L9 X2 = (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 = 2
Sampl K=1 K=2
E11 = (1/2)*2 = 1
e
E12 = (0/2)*2 = 0
3 1 0 1 E21 = (1/2)*2 = 1
4 1 0 1 E22 = (0/2)*2 = 0
total 2 0 2
X2 = (1-1)2/1+(0-0)2/0+ (1-1)2/1+(0-0)2/0 = 0
Sig Level 0.1 with df=1 from Chi square distribution X2 critical value = 2.7024. Not
correlated, can be merged. ProprietarySharing
This file is meant for personal use by sainathgunda99@gmail.com only.
content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sample F K Intervals Chi2 ● Calculate all the Chi-square
1 1 1 {0,2} values for all intervals.
2
● Merge the intervals with the
2 3 2 {2,5}
2 smallest Chi values.
3 7 1 {5,7.5}
0
4 8 1 {7.5,8}
sainathgunda99@gmail.com
DLZNK464L9 0
5 9 1 {8.5,10}
2
6 11 2 {10,17}
0
7 23 2 {17,30}
2
8 37 1 {30,38}
2
9 39 2 {38,42}
2
10 45 1 {42,45.5}
0
11 46 1 {45.5,52}
0
12 59 1 {52,60}
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Intervals Chi2
Samp F K
1 1 1 {0,2} 2
2 3 2 {2,5}
Repeat.
3 7 1 4
Keep merging intervals with small X2
4 8 1 {5,10} until all X2 > 2.7024
sainathgunda99@gmail.com
DLZNK464L9 5 9 1 5
6 11 2
7 23 2 {10,30}
3
8 37 1 {30,38}
2
9 39 2 {38,42}
10 45 1 4
11 46 1 {42,60}
12 59 1
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sample F K Intervals Chi2
1 1 1
2 3 2 {0,10}
3 7 1
4 8 1 ● End: There are no
more intervals with
5 9 1 2.72
sainathgunda99@gmail.com
DLZNK464L9 X2 < 2.7024.
6 11 2 ● These intervals are
7 23 2 {10,30} correlated with class
8 37 1 labels.
9 39 2
3.93
10 45 1
11 46 1 {42,60}
12 59 1 This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Concept Hierarchy Generation
● Concept hierarchy organises concepts (attribute values) hierarchically and is typically associated with
each dimension in a data warehouse.
● In data warehouses, concept hierarchies enable drilling and rolling to see data at various
granularities.
● Concept hierarchy generation
sainathgunda99@gmail.com
DLZNK464L9
○ Specified by domain experts, taxonomies/thesaurus/ ontologies
○ Generated from data sets (for some simple, specific cases)
■ Discretization for numerical or ordinal data
■ Frequency counts for categorical data (limited cases)
○ Concept hierarchy learning
■ Natural language processing and ML approaches.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Concept Hierarchy Generation for Nominal Data
● Specification of a partial/total ordering of attributes explicitly at the schema level by users or
experts.
○ street < city < state < country
● Specification of a hierarchy for a set of values by explicit data grouping.
○ {Urbana, Champaign, Canada} < Illinois
sainathgunda99@gmail.com
DLZNK464L9
● Specification of only a partial set of attributes.
○ E.g. only street < city, not others
● Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct
values.
○ E.g. for a set of attributes: {street, city, state, country}
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Automatic Concept Hierarchy Generation
● Some hierarchies can be built automatically based on a study of the number of distinct values for
each attribute in the data collection.
○ The attribute with the most distinct values is at the bottom of the hierarchy.
○ Exceptions,
Example: weekday, month, quarter, year
sainathgunda99@gmail.com
DLZNK464L9
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed:
● Normalization – The data is scaled to fall within a smaller, specified range for more meaningful
comparison.
● Discretization divides the range of a continuous attribute into the interval.
● Chi-Merge Discretization example
sainathgunda99@gmail.com
DLZNK464L9
● Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated
with each dimension in a data warehouse.
● Concept hierarchy generation for nominal data
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
You should now be able to:
● Apply data pre-processing tasks and methods to prepare data for a data mining task.
● Summarize the importance of outlier removal and redundant data removal from data sets.
● Explain the methods for dimensionality reduction and numerosity reduction.
● Implement data transformation strategies, such as normalization, discretization, and concept
sainathgunda99@gmail.com
DLZNK464L9
hierarchy generation.
● Perform typical data pre-processing tasks in Python.
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
sainathgunda99@gmail.com
DLZNK464L9 Thank you !
This file is meant for personal use by sainathgunda99@gmail.com only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.