0% found this document useful (0 votes)

19 views46 pages

Unit 2

Uploaded by

redoxit809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views46 pages

Unit 2

Uploaded by

redoxit809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Mining (DM)

2101CS521

Unit-2
Data Pre-Processing

Prof. Jayesh D. vagadiya

Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
jayesh.vagadiya@darshan.ac.in
9537133260
 Looping
Topics to be covered
• Why to pre-process data?
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data Discretization
Why to pre-process data?
 Data pre-processing is a data mining technique that involves transforming
raw data (real world data) into an understandable format.
 Real-world data is often incomplete, inconsistent, lacking in certain
behaviors or trends and likely to contain many errors.
 Incomplete: Missing attribute values, lack of certain attributes of interest, or
containing only aggregate data.
 E.g. Occupation = “ ”
 Noisy: Containing errors or outliers.
 E.g. Salary = “abcxy”
 Inconsistent: Containing similarity in codes or names.
 E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 3
Why to pre-process data? (Cont..)
 It looks like Garbage In Garbage Out (GIGO).

 Quality decisions must be based on quality data.

 Duplicate or missing data may cause incorrect or even misleading
statistics.
 Data preprocessing prepares raw data for further processing.
Data preparation, cleaning and
transformation are the majority
task (90%) in data mining.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 4
Data Cleaning
1. Fill in missing values
2. Identify outliers and smooth out noisy data
3. Correct inconsistent data
4. Resolve redundancy caused by data integration

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 5
1) Fill in missing values Data
Cleaning
 Ignore the tuple (record/row):
• Usually done when class label is missing.
• This means removing the entire row from the dataset if any of its values are missing.
• This approach can be useful when the number of missing values is relatively small,
and removing a few tuples does not significantly impact the overall data analysis.
 Fill missing value manually:
• In general, this approach is time consuming and may not be feasible given a large
data set with many missing values.
 Use a global constant to fill in the missing value
• Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞ .
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value :
• For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 6
1) Fill in missing values Data
Cleaning
 Use the attribute mean or median for all samples belonging to
the same class as the given tuple:
• To replace the mean or median in the same class, you would typically calculate the
mean or median for all samples within that class and use the resulting value as the
replacement.
 Use the most probable value to fill in the missing value:
• This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 7
2) Identify outliers and smooth out noisy dataData
Cleaning
There are three data smoothing techniques as follows..
1. Binning :
 Binning methods smooth a sorted data value by consulting its “neighborhood” that
is, the values around it.
2. Regression :
 It conforms data values to a function.
 Linear regression involves finding the “best” line to fit two attributes (or variables) so
that one attribute can be used to predict the other.
3. Outlier analysis :
 Outliers may be detected by clustering for example, where similar values are
organized into groups or “clusters”.
 In this, values that fall outside of the set of clusters may be considered as outliers.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 8
1. Binning Method Data
Cleaning
 Binning method is a top-down splitting technique based on a specified
number of bins.
 In this method the data is first sorted and then the sorted values are
distributed into a number of buckets or bins.
 For example, attribute values can be discretized (separated) by applying
equal-width or equal-frequency binning, and then replacing each value by
the bin mean, median or boundaries.
 It can be applied recursively to the resulting partitions to generate
concept hierarchies.
 It used to minimize the effects of small observation errors.

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 9
1. Binning Method Cont.. Data
Cleaning
There are basically two types of binning approaches..
1. Equal width (or distance) binning :
 The simplest binning approach is to partition the range of the variable into k equal-
width intervals.
 The interval width is simply the range [Min, Max] of the variable divided by N,
 Width = Max – Min / N (Number of Bins)
 Example
 Data: 5,10,11,13,15, 35, 50, 55, 72, 92, 204, 215
 As per above formula we have Max=215, Min=5, Number of Bins=3, so 215-5 =
210, 210/3 = 70
 70+5=75 (from 5 to 75) = Bin 1: 5,10,11,13,15, 35, 50, 55, 72
 70+75=145 (from 75 to 145) = Bin 2: 92
 70+145=215 (from 145 to 215) = Bin 3: 204, 215

2. Equal depth (or frequency) binning :

 In equal-frequency binning
Identify outliers
we divide andthe
smooth range out[Max,
noisyMin]
data of the variable into
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 10
1. Binning Method Cont.. Data
Cleaning
 Bin Operations
1. Smoothing by bin means
 In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin.
2. Smoothing by bin median
 In this method each bin value is replaced by its bin median value.
3. Smoothing by bin boundary
 In smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the bin boundaries.
 Each bin value is then replaced by the closest boundary value.

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 11
Binning Method Example – {Bin Means} Data
Cleaning
 Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Step: 1
 Partition into equal-depth [n=3]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34 Bin Means
 Step: 2
(4 + 8 + 9 + 15)/4 = 9
• Smoothing by bin means:
(21 + 21 + 24 + 25)/4 = 23
Bin 1: 9, 9, 9, 9 (26 + 28 + 29 + 34)/4 = 29
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 12
Binning Method Example – {Bin Boundaries}
Data
Cleaning
 Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Step: 1
 Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
 Step: 2
• Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 13
2. Regression Data
Cleaning
 Data smoothing can also be done by regression, a technique that
conforms data values to a function.
 Regression analysis is a way to find trends in data & it is also called as
mathematically describes the relationship between independent variables
and the dependent variable.
 It can be divided into two categories..
1. Linear regression :
 It involves finding the “best” line to fit two attributes (or variables) so that one attribute
can be used to predict the other.
 In this, analysis on a single x variable for each dependent “y” variable. For example: (x 1,
Y1).
2. Multiple linear regression :
 An extension of linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
 It uses multiple “x” variables for each independent variable: (x1) 1, (x2)1, (x3)1, Y1).

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 14
3. Clustering Data
Cleaning
 Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
 Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that
involves trial and failure.
 It is often necessary to modify data preprocessing and model parameters
Cluster center
until the result achieves the desired properties.

Outliers

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 15
Correct Inconsistent Data Data
Cleaning
 With larger datasets, it can be difficult to find all of the inconsistencies.
 It contains similarity in codes or names.
 We can manually solve common mistakes like spelling, grammar, articles
or use other tools for it.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 16
Resolve redundancy caused by data Data
integration Cleaning
 Data redundancy occurs in database systems which have a field that
is repeated in two or more tables.
 When customer data is duplicated and attached with each product
bought, then redundancy of data is known as inconsistency.
 So, the entity "customer" might appear with different values.
 Database normalization prevents redundancy and makes the best
possible usage of storage.
 The proper use of foreign keys can minimize data redundancy and
reduce the chance of destructive anomalies appearing.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 17
Data Integration
Data Integration
 Combines data from multiple
sources into a coherent store.
 Careful integration can help
reduce and avoid redundancies
and inconsistencies in the
resulting data set.
 Schema integration: e.g., A.cust-id
 B.cust#

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 18
Entity Identification Problem Data Integration

 How can real-world entities from different data sources be matched?

 This is referred to as the entity identification problem.
 Example:
 The customer id in one database and cust number in another refer to the same
attribute?
 When matching attributes from one database to another during
integration, special attention must be paid to the structure of the data.
 This is to ensure that any attribute functional dependencies and
referential constraints in the source system match those in the target
system.
 Example:
 Imagine there are two separate systems: System A and System B. In System A,
discounts are applied to the entire order, whereas in System B, discounts are applied
to each individual line item within the order.
 Items in the target system may end up being improperly discounted.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 19
Redundancy and Correlation Analysis
Data Integration

 Redundancy is another important issue in data integration. An attribute

(such as annual revenue, for instance) may be redundant if it can be
“derived” from another attribute or set of attributes.
 This redundancy can consume additional storage space, introduce
inconsistencies, and complicate data management processes.
 Example:
 if annual revenue can be obtained by summing up monthly revenue values, storing
both the individual monthly revenue and the derived annual revenue would result in
redundant data.
 Some redundancies can be detected by correlation analysis.
 We can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 20
Correlation coefficient Data Integration

 The correlation coefficient is a statistical measure that assesses the

strength and direction of the relationship between two variables.
 RA,B =
 where n is the number of tuples
 ai and bi are the respective values of A and B in tuple i
 A ̄ and B ̄ are the respective mean values of A and B
 σA and σB are the respective standard deviations of A and B
 The value of correlation is −1 ≤ RA,B ≤ +1.
 If rA,B is greater than 0, then A and B are positively correlated, meaning
that the values of A increase as the values of B increase.
 The higher the value, the stronger the correlation.
 a higher value may indicate that A (or B) may be removed as a
redundancy.
 If the resulting value is equal to 0, then A and B are independent and
there
Prof. Jayesh D. Vagadiya
#2101CS521 (DM)  Unit 2 - Data Pre-
21
Correlation coefficient Data Integration

 If the resulting value is less than 0, then A and B are negatively correlated.
 where the values of one attribute increase as the values of the other
attribute decrease.
 Scatter plots can also be used to view correlations between attributes
 Positive Correlation :
 Let's consider the relationship between the number of hours studied and the test
scores obtained by a group of students.
 Negative Correlation:
 Suppose we examine the relationship between temperature and sales of ice cream.
 No Correlation
 Consider the relationship between shoe size and intelligence.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 22
Data
Data Transformation Transformatio
 A function that maps the entire set of values of a given attribute n
to a new
set of replacement values that each old value can be identified with one of
the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization
 Z-score normalization
 Normalization by decimal scaling
 Discretization: Concept hierarchy climbing

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 23
1. Min-Max Normalization Data
Transformatio
 Min max is a technique that n
helps to normalizing the Formula : V’
data.
 Given data
 It will scale the data
 Min : Minimum value = 16
between 0 and 1 or within
 Max : Maximum value = 40
specified range.
 V = Respective value of
 Example attributes. In our example V1=
16, V2=20, V3=30 & V4=40.
 NewMax = 1
 NewMin = 0

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 24
1. Min-Max Normalization Cont.. Data
Transformatio
Examp Formula : V’
le
n
For Age 16 : For Age 30 :

MinMax (v’) = (16 – 16)/(40-16) * (1 – MinMax (v’) = (30 – 16)/(40-16) * (1 –

0) + 0 0) + 0
= 0 / 24 * 1 = 14 / 24 * 1
=0 = 0.58
For Age 20 : For Age 40 :

MinMax (v’) = (20 – 16)/(40-16) * (1 – MinMax (v’) = (40 – 16)/(40-16) * (1 –

0) + 0 0) + 0
= 4 / 24 * 1 = 24 / 24 * 1
= 0.16 =1
Age After Min-max
normalization
16 0
20 0.16
30 0.58
40 1
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 25
2. Decimal Scaling Data
Transformatio
 In this technique we move the Exampl n
decimal point of values of the e CGPA Formula After Decimal
attribute. Scaling

 This movement of decimal 2 2 / 10 0.2

points totally depends on the 3 3 / 10 0.3

maximum value among all  We will check maximum value among our
values in the attribute. attribute CGPA.
 Value V of attribute A can be  Maximum value is 3 so, we can convert it into
normalized by the following decimal by dividing with 10. why 10?
formula
Where j is the smallest integer such that
 We will count total digits in our maximum value
 Max(|ν’|) <1 and then put 1.
Normalized value of attribute
 After 1 we can put zeros equal to the length of
 V’= V / 10j
maximum value.
 Here 3 is maximum value and total digits in this
value is only 1 so, we will put one zero after 1.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 26
3. Z-Score Normalization Data
Transformatio
 It is also called zero-mean normalization. n
 The essence of this technique is the data transformation by
the values conversation to a common scale where an average number
equals zero and a standard deviation is one.
 To find z-score values..
𝑣 − 𝜇𝐴
𝑣 ′=
𝜎𝐴
Where μ: Mean, σ: Standard deviation
Example
 Let μ = 54,000, σ = 16,000
 Find z-score for 73,600,
73,600 −54,000
=𝟏 . 𝟐𝟐𝟓
16,000

Z-score for
73600: 1.225
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 27
3. Z-Score Normalization Data
Transformatio
 These z-scores represent the number of standard deviations that n each
data point is away from the mean of the distribution.
 A positive z-score indicates that the data point is above the mean, while a
negative z-score indicates it is below the mean.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 28
Data Reduction Data
Reduction
 Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet closely
maintains the integrity of the original data
 That is, mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results
 Dimensionality reduction:
 Dimensionality reduction techniques aim to reduce the number of variables or
features in a dataset while preserving the important information.
 Dimensionality reduction methods include wavelet transforms and principal
components analysis.
 Numerosity reduction:
 Numerosity reduction techniques focus on reducing the number of instances or data
points in a dataset while maintaining the representativeness of the data.
 Data compression:
 Data compression techniques are used to reduce the size of a dataset by encoding
the data in a more compact form.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 29
Data Reduction Data
Reduction

Original Data
Compressed
Data
lossless

ssy
lo

Original Data
Approximated

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 30
Principal Components Analysis Data
Reduction
 The number of input variables or features for a dataset is referred to as its
dimensionality.
 PCA transforms high-dimensional data into a lower-dimensional subspace
while preserving the most important information.
 Formally, PCA is a statistical technique for reducing the dimensionality of a
dataset.
 This is accomplished by linearly transforming the data into a
new coordinate system where (most of) the variation in the data can be
described with fewer dimensions than the initial data.
 Dimensionality reduction refers to techniques that reduce the number of
input variables in a dataset.
 Example
 Dimensional reduction can be discussed through a simple e-mail classification
problem, where we need to classify whether the e-mail is spam or not.
 This can involve a large number of features, such as whether or not the e-mail has a
generic title, the content of#2101CS521
the e-mail,
(DM) whether the
 Unit 2 - Data Pre-e-mail uses a template, etc.
Prof. Jayesh D. Vagadiya 31
Principal Components Analysis Data
Reduction

Ref :” Principal Component Analysis, ” https://setosa.io/ev/principal-component-analysis/

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 32
Attribute Subset Selection Data
Reduction
 Attribute subset selection is a technique used to reduce the dimensionality
of a dataset by selecting a relevant subset of attributes (features) while
discarding the irrelevant or redundant ones.
 Redundant features and Irrelevant Features.
 Example: Consider the below data for predication for 5th sem SPI of the
given student.
Student Name Roll No Sem Year Sem 1 Sem 2 Sem 3 Sem 4
Id Fee fee SPI SPI SPI SPI

1 ABC 101 30,000 60,000 7.5 7.6 8.9 5.6

2 XYZ 102 30,000 60,000 4.5 6.7 4.3 6.7

Redundant
Irrelevant features
Features #2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 33
Attribute Subset Selection Data
Reduction
 Here are the heuristic methods of attribute subset selection include the
technique that follows.
 Stepwise forward selection:
 The procedure starts with an empty set of attributes. At each step. The best of the
remaining original attributes is added to the set.
 Stepwise backwards elimination:
 The procedure starts with the full set of attributes. At each step. It removes the worst
attribute remaining in the set.
 Stepwise forward selection and stepwise backwards elimination:
 At each step, the procedure selects the best attributes and remove the worst from
among the remaining attributes.
 Decision tree indication
 It is generally used for classification.
 It used flow type structure.
 All the attributes not appearing in tree are assumed to be irrelevant.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 34
Decision Tree Indication Data
Reduction
Initial attribute set
 The Inner node represents
{A1, A2, A3, A4, A5, A6} an attribute.
A4 ?  An Edge represents a test
on attribute.
Y N  Leaf represent one of the
classes.
A1 ? A6 ?

Y N Y N

Class Class Class Class

1 2 1 2

Reduced attribute set

{A1, A4, A6}

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 35
Histograms Data
Reduction
 A histogram is a useful visualization tool for data reduction as it provides a
clear representation of the distribution of a dataset.
 Histograms help in identifying patterns, outliers, and the general shape of
the data, which can use full in making decisions about data reduction
techniques.
 A histogram for an attribute, A, partitions the data distribution of A into
disjoint subsets, referred to as buckets or bins.
 There are several partitioning rules, including the following:
1. Equal width (or distance) binning :
 The simplest binning approach is to partition the range of the variable into k equal-
width intervals.
 The interval width is simply the range [Min, Max] of the variable divided by N,
 Width = Max – Min / N (Number of Bins)

2. Equal depth (or frequency) binning :

 In equal-frequency binning we divide
#2101CS521 the2 - range
(DM)  Unit Data Pre- [Max, Min] of the variable into
Prof. Jayesh D. Vagadiya 36
Clustering Data
Reduction
 Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
 Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that
involves trial and failure.
 In data reduction, the cluster representations of the data are
Cluster center
used to replace the actual data.

Outliers

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 37
Sampling Data
Reduction
 Sampling is a data reduction technique that involves selecting a subset of
data from a larger dataset to represent the whole data.
 The objective of sampling is to reduce the size of the dataset while
preserving its important characteristics and properties.
1. Simple random sample without replacement (SRSWOR):
 Data is randomly selected from the original dataset, and once a data point is
selected, it is not returned to the dataset before selecting the next data point.
 No data point is selected more than once.

1 2
5 3 2
3 4

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 38
Sampling Data
Reduction
2. Simple random sample with replacement (SRSWR):
 Data is randomly selected from the original dataset, and after each data point is
selected, it is returned to the dataset before selecting the next data point.
 it is possible for a data point to be selected more than once in the sample.

1 2 2
5 3 2
3 4

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 39
Sampling Data
Reduction
3. Cluster sample:
 It converts the data into numbers of cluster and select cluster based on the
SRSWOR technique.
4. Stratified sample:
 A sampling method known as stratified sampling divides the population into
subgroups or strata according to specific characteristics.
 And taking a random sample from each stratum.
strata

4 1 2 3 4 5 4 2 3
4 2

1 2 5 5 2 1
1 2 3 4
3 4
1 SRS
3 5 1 2 3 4 5 4 3 2
5 2
5
3 1

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 40
Data Discretization
 It transform continuous or numerical data into discrete intervals or
categories.
 It involves dividing the data values into bins or intervals, which can
simplifyAge
data analysis and reduce the impact of noise.

10
10-22 Young
22
23
41
50 23-70 Mature

60
70
90
71-100 Senior
100
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 41
Data Discretization
 Discretization can make complex data more understandable and
interpretable.
 Some algorithms, especially those that require categorical data or work
better with discrete values. we Discretization the data
 In some cases, discretization can be used as a privacy-enhancing
technique to prevent the disclosure of sensitive information.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 42
Data Discretization Techniques
 Discretization by Binning:
 For example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median as
in smoothing by bin means or smoothing by bin medians, respectively.
 Discretization by Histogram Analysis:
 A histogram is a useful visualization tool for data reduction as it provides a clear
representation of the distribution of a dataset.
 Discretization by Clustering:
 Using clustering algorithms (e.g., k-means) to group similar data points into bins.
 Discretization by Decision trees :
 Using decision trees to find optimal split points for discretization.
 Discretization by correlation :
 When considering data discretization, correlation can help identify patterns and
relationships between variables that might guide the creation of meaningful bins.

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 43
Concept Hierarchy
 A concept hierarchy in data mining refers to the arrangement of data into
a tree-like structure, with each level of the hierarchy representing a
concept that is more general than the one below it.
 This hierarchical data organization enables more efficient and effective
data analysis, as well as the capacity to drill down to more specific levels
of detail as necessary.
country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 44
Concept Hierarchy
Location

USA India

New York Illinois Gujarat UP

Luckno
Utica Albay Joliet Elgin Surat Rajkot Noida
w

#2101CS521 (DM)  Unit 2 - Data Pre-

Prof. Jayesh D. Vagadiya 45
Questions
1. What is data Preprocessing ?
2. Explain data cleaning with example.
3. Explain different ways to fill missing values in dataset.
4. Explain binning method with example.
5. Explain how we can Identify outliers and smooth out noisy data ?
6. Explain data integration.
7. Explain data transformation with example.
8. Explain min max, decimal scaling and z score with example.
9. Explain data reduction with example.
10.Explain data sampling technique for data reduction.
11.What is data discretization ? Explain techniques for data discretization.
12.Explain concept Hierarchy.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 46

Unit 2
No ratings yet
Unit 2
37 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Data Pre Processing I
No ratings yet
Data Pre Processing I
37 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
DWM
No ratings yet
DWM
14 pages
Data Pre-Processing & Cleaning Guide
No ratings yet
Data Pre-Processing & Cleaning Guide
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
Unit - II
No ratings yet
Unit - II
56 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Unit 2
No ratings yet
Unit 2
34 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Outliners
No ratings yet
Outliners
15 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Pre Processing
No ratings yet
Data Pre Processing
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
ML 4
No ratings yet
ML 4
17 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DWDM Unit II
No ratings yet
DWDM Unit II
29 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
DWDMUNIT2
No ratings yet
DWDMUNIT2
51 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
July Est 2024 Math
No ratings yet
July Est 2024 Math
9 pages
2 - Factors Changing IBE, Economic Growth Impact
No ratings yet
2 - Factors Changing IBE, Economic Growth Impact
11 pages
PLC Based Smart Relay Coordination System For Smart Electricity Distribution
No ratings yet
PLC Based Smart Relay Coordination System For Smart Electricity Distribution
3 pages
Ias 12
No ratings yet
Ias 12
33 pages
PRR4336 ShotSpotter
No ratings yet
PRR4336 ShotSpotter
261 pages
Servier Diamicron Scientific Deck RTM 2024-25
No ratings yet
Servier Diamicron Scientific Deck RTM 2024-25
22 pages
Manual Tesla PDF
100% (1)
Manual Tesla PDF
40 pages
Nandakumar & Anr V State of Kerala
No ratings yet
Nandakumar & Anr V State of Kerala
6 pages
Aace RP 37R-06 (2010)
No ratings yet
Aace RP 37R-06 (2010)
8 pages
Company Profile Finewest
No ratings yet
Company Profile Finewest
10 pages
Catalytic Hydrogenation Techni - Augustine, Robert L., 1932
100% (1)
Catalytic Hydrogenation Techni - Augustine, Robert L., 1932
208 pages
Octave: Senior Management Briefing
No ratings yet
Octave: Senior Management Briefing
21 pages
Releasable Check Valve Type RHC and RHCE: Product Documentation
No ratings yet
Releasable Check Valve Type RHC and RHCE: Product Documentation
26 pages
CAT 3054C Engine Specification
No ratings yet
CAT 3054C Engine Specification
2 pages
The Spirit of Leadership
No ratings yet
The Spirit of Leadership
16 pages
Unacompanied Minors
No ratings yet
Unacompanied Minors
2 pages
Contesting China in The Maldives India S Foreign Policy Challenge.
No ratings yet
Contesting China in The Maldives India S Foreign Policy Challenge.
18 pages
EMS Executive CV - Abul Kalam Khalifa
No ratings yet
EMS Executive CV - Abul Kalam Khalifa
3 pages
Business Analyst & IT Support Expert
No ratings yet
Business Analyst & IT Support Expert
2 pages
Engineering Services Directives
No ratings yet
Engineering Services Directives
71 pages
Conv 03
No ratings yet
Conv 03
2 pages
IMSE-7139 Cyber Physical Systems: Prof. Ning Xi
No ratings yet
IMSE-7139 Cyber Physical Systems: Prof. Ning Xi
20 pages
Mine Laws 100 Questions With Answers
No ratings yet
Mine Laws 100 Questions With Answers
5 pages
shs12 PDF
No ratings yet
shs12 PDF
5 pages
Leather Industry of Bangladesh: Performance Analysis Based On Financial Ratio
No ratings yet
Leather Industry of Bangladesh: Performance Analysis Based On Financial Ratio
15 pages
Standard Info - Allianz
No ratings yet
Standard Info - Allianz
9 pages
SB20 51 02
No ratings yet
SB20 51 02
19 pages
New Asset Accounting
No ratings yet
New Asset Accounting
8 pages
DBExtract Manual
100% (1)
DBExtract Manual
38 pages
Summative Computer7 Q2
No ratings yet
Summative Computer7 Q2
4 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Data Mining (DM)

Prof. Jayesh D. vagadiya

#2101CS521 (DM)  Unit 2 - Data Pre-

 Quality decisions must be based on quality data.

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

Identify outliers and smooth out

2. Equal depth (or frequency) binning :

Identify outliers and smooth out

Identify outliers and smooth out

Identify outliers and smooth out

Identify outliers and smooth out

Identify outliers and smooth out

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

 How can real-world entities from different data sources be matched?

 Redundancy is another important issue in data integration. An attribute

#2101CS521 (DM)  Unit 2 - Data Pre-

 The correlation coefficient is a statistical measure that assesses the

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

MinMax (v’) = (16 – 16)/(40-16) * (1 – MinMax (v’) = (30 – 16)/(40-16) * (1 –

MinMax (v’) = (20 – 16)/(40-16) * (1 – MinMax (v’) = (40 – 16)/(40-16) * (1 –

 This movement of decimal 2 2 / 10 0.2

points totally depends on the 3 3 / 10 0.3

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

Ref :” Principal Component Analysis, ” https://setosa.io/ev/principal-component-analysis/

#2101CS521 (DM)  Unit 2 - Data Pre-

1 ABC 101 30,000 60,000 7.5 7.6 8.9 5.6

#2101CS521 (DM)  Unit 2 - Data Pre-

Class Class Class Class

Reduced attribute set

#2101CS521 (DM)  Unit 2 - Data Pre-

2. Equal depth (or frequency) binning :

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

#2101CS521 (DM)  Unit 2 - Data Pre-

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

#2101CS521 (DM)  Unit 2 - Data Pre-

New York Illinois Gujarat UP

#2101CS521 (DM)  Unit 2 - Data Pre-

You might also like