0% found this document useful (0 votes)
19 views46 pages

Unit 2

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views46 pages

Unit 2

Uploaded by

redoxit809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Mining (DM)

2101CS521

Unit-2
Data Pre-Processing

Prof. Jayesh D. vagadiya


Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
jayesh.vagadiya@darshan.ac.in
9537133260
 Looping
Topics to be covered
• Why to pre-process data?
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data Discretization
Why to pre-process data?
 Data pre-processing is a data mining technique that involves transforming
raw data (real world data) into an understandable format.
 Real-world data is often incomplete, inconsistent, lacking in certain
behaviors or trends and likely to contain many errors.
 Incomplete: Missing attribute values, lack of certain attributes of interest, or
containing only aggregate data.
 E.g. Occupation = “ ”
 Noisy: Containing errors or outliers.
 E.g. Salary = “abcxy”
 Inconsistent: Containing similarity in codes or names.
 E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 3
Why to pre-process data? (Cont..)
 It looks like Garbage In Garbage Out (GIGO).

 Quality decisions must be based on quality data.


 Duplicate or missing data may cause incorrect or even misleading
statistics.
 Data preprocessing prepares raw data for further processing.
Data preparation, cleaning and
transformation are the majority
task (90%) in data mining.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 4
Data Cleaning
1. Fill in missing values
2. Identify outliers and smooth out noisy data
3. Correct inconsistent data
4. Resolve redundancy caused by data integration

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 5
1) Fill in missing values Data
Cleaning
 Ignore the tuple (record/row):
• Usually done when class label is missing.
• This means removing the entire row from the dataset if any of its values are missing.
• This approach can be useful when the number of missing values is relatively small,
and removing a few tuples does not significantly impact the overall data analysis.
 Fill missing value manually:
• In general, this approach is time consuming and may not be feasible given a large
data set with many missing values.
 Use a global constant to fill in the missing value
• Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞ .
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value :
• For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 6
1) Fill in missing values Data
Cleaning
 Use the attribute mean or median for all samples belonging to
the same class as the given tuple:
• To replace the mean or median in the same class, you would typically calculate the
mean or median for all samples within that class and use the resulting value as the
replacement.
 Use the most probable value to fill in the missing value:
• This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 7
2) Identify outliers and smooth out noisy dataData
Cleaning
There are three data smoothing techniques as follows..
1. Binning :
 Binning methods smooth a sorted data value by consulting its “neighborhood” that
is, the values around it.
2. Regression :
 It conforms data values to a function.
 Linear regression involves finding the “best” line to fit two attributes (or variables) so
that one attribute can be used to predict the other.
3. Outlier analysis :
 Outliers may be detected by clustering for example, where similar values are
organized into groups or “clusters”.
 In this, values that fall outside of the set of clusters may be considered as outliers.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 8
1. Binning Method Data
Cleaning
 Binning method is a top-down splitting technique based on a specified
number of bins.
 In this method the data is first sorted and then the sorted values are
distributed into a number of buckets or bins.
 For example, attribute values can be discretized (separated) by applying
equal-width or equal-frequency binning, and then replacing each value by
the bin mean, median or boundaries.
 It can be applied recursively to the resulting partitions to generate
concept hierarchies.
 It used to minimize the effects of small observation errors.

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 9
1. Binning Method Cont.. Data
Cleaning
There are basically two types of binning approaches..
1. Equal width (or distance) binning :
 The simplest binning approach is to partition the range of the variable into k equal-
width intervals.
 The interval width is simply the range [Min, Max] of the variable divided by N,
 Width = Max – Min / N (Number of Bins)
 Example
 Data: 5,10,11,13,15, 35, 50, 55, 72, 92, 204, 215
 As per above formula we have Max=215, Min=5, Number of Bins=3, so 215-5 =
210, 210/3 = 70
 70+5=75 (from 5 to 75) = Bin 1: 5,10,11,13,15, 35, 50, 55, 72
 70+75=145 (from 75 to 145) = Bin 2: 92
 70+145=215 (from 145 to 215) = Bin 3: 204, 215

2. Equal depth (or frequency) binning :


 In equal-frequency binning
Identify outliers
we divide andthe
smooth range out[Max,
noisyMin]
data of the variable into
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 10
1. Binning Method Cont.. Data
Cleaning
 Bin Operations
1. Smoothing by bin means
 In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin.
2. Smoothing by bin median
 In this method each bin value is replaced by its bin median value.
3. Smoothing by bin boundary
 In smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the bin boundaries.
 Each bin value is then replaced by the closest boundary value.

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 11
Binning Method Example – {Bin Means} Data
Cleaning
 Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Step: 1
 Partition into equal-depth [n=3]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34 Bin Means
 Step: 2
(4 + 8 + 9 + 15)/4 = 9
• Smoothing by bin means:
(21 + 21 + 24 + 25)/4 = 23
Bin 1: 9, 9, 9, 9 (26 + 28 + 29 + 34)/4 = 29
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 12
Binning Method Example – {Bin Boundaries}
Data
Cleaning
 Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Step: 1
 Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
 Step: 2
• Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 13
2. Regression Data
Cleaning
 Data smoothing can also be done by regression, a technique that
conforms data values to a function.
 Regression analysis is a way to find trends in data & it is also called as
mathematically describes the relationship between independent variables
and the dependent variable.
 It can be divided into two categories..
1. Linear regression :
 It involves finding the “best” line to fit two attributes (or variables) so that one attribute
can be used to predict the other.
 In this, analysis on a single x variable for each dependent “y” variable. For example: (x 1,
Y1).
2. Multiple linear regression :
 An extension of linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
 It uses multiple “x” variables for each independent variable: (x1) 1, (x2)1, (x3)1, Y1).

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 14
3. Clustering Data
Cleaning
 Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
 Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that
involves trial and failure.
 It is often necessary to modify data preprocessing and model parameters
Cluster center
until the result achieves the desired properties.

Outliers

Identify outliers and smooth out


#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 15
Correct Inconsistent Data Data
Cleaning
 With larger datasets, it can be difficult to find all of the inconsistencies.
 It contains similarity in codes or names.
 We can manually solve common mistakes like spelling, grammar, articles
or use other tools for it.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 16
Resolve redundancy caused by data Data
integration Cleaning
 Data redundancy occurs in database systems which have a field that
is repeated in two or more tables.
 When customer data is duplicated and attached with each product
bought, then redundancy of data is known as inconsistency.
 So, the entity "customer" might appear with different values.
 Database normalization prevents redundancy and makes the best
possible usage of storage.
 The proper use of foreign keys can minimize data redundancy and
reduce the chance of destructive anomalies appearing.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 17
Data Integration
Data Integration
 Combines data from multiple
sources into a coherent store.
 Careful integration can help
reduce and avoid redundancies
and inconsistencies in the
resulting data set.
 Schema integration: e.g., A.cust-id
 B.cust#

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 18
Entity Identification Problem Data Integration

 How can real-world entities from different data sources be matched?


 This is referred to as the entity identification problem.
 Example:
 The customer id in one database and cust number in another refer to the same
attribute?
 When matching attributes from one database to another during
integration, special attention must be paid to the structure of the data.
 This is to ensure that any attribute functional dependencies and
referential constraints in the source system match those in the target
system.
 Example:
 Imagine there are two separate systems: System A and System B. In System A,
discounts are applied to the entire order, whereas in System B, discounts are applied
to each individual line item within the order.
 Items in the target system may end up being improperly discounted.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 19
Redundancy and Correlation Analysis
Data Integration

 Redundancy is another important issue in data integration. An attribute


(such as annual revenue, for instance) may be redundant if it can be
“derived” from another attribute or set of attributes.
 This redundancy can consume additional storage space, introduce
inconsistencies, and complicate data management processes.
 Example:
 if annual revenue can be obtained by summing up monthly revenue values, storing
both the individual monthly revenue and the derived annual revenue would result in
redundant data.
 Some redundancies can be detected by correlation analysis.
 We can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 20
Correlation coefficient Data Integration

 The correlation coefficient is a statistical measure that assesses the


strength and direction of the relationship between two variables.
 RA,B =
 where n is the number of tuples
 ai and bi are the respective values of A and B in tuple i
 A ̄ and B ̄ are the respective mean values of A and B
 σA and σB are the respective standard deviations of A and B
 The value of correlation is −1 ≤ RA,B ≤ +1.
 If rA,B is greater than 0, then A and B are positively correlated, meaning
that the values of A increase as the values of B increase.
 The higher the value, the stronger the correlation.
 a higher value may indicate that A (or B) may be removed as a
redundancy.
 If the resulting value is equal to 0, then A and B are independent and
there
Prof. Jayesh D. Vagadiya
#2101CS521 (DM)  Unit 2 - Data Pre-
21
Correlation coefficient Data Integration

 If the resulting value is less than 0, then A and B are negatively correlated.
 where the values of one attribute increase as the values of the other
attribute decrease.
 Scatter plots can also be used to view correlations between attributes
 Positive Correlation :
 Let's consider the relationship between the number of hours studied and the test
scores obtained by a group of students.
 Negative Correlation:
 Suppose we examine the relationship between temperature and sales of ice cream.
 No Correlation
 Consider the relationship between shoe size and intelligence.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 22
Data
Data Transformation Transformatio
 A function that maps the entire set of values of a given attribute n
to a new
set of replacement values that each old value can be identified with one of
the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization
 Z-score normalization
 Normalization by decimal scaling
 Discretization: Concept hierarchy climbing

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 23
1. Min-Max Normalization Data
Transformatio
 Min max is a technique that n
helps to normalizing the Formula : V’
data.
 Given data
 It will scale the data
 Min : Minimum value = 16
between 0 and 1 or within
 Max : Maximum value = 40
specified range.
 V = Respective value of
 Example attributes. In our example V1=
16, V2=20, V3=30 & V4=40.
 NewMax = 1
 NewMin = 0

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 24
1. Min-Max Normalization Cont.. Data
Transformatio
Examp Formula : V’
le
n
For Age 16 : For Age 30 :

MinMax (v’) = (16 – 16)/(40-16) * (1 – MinMax (v’) = (30 – 16)/(40-16) * (1 –


0) + 0 0) + 0
= 0 / 24 * 1 = 14 / 24 * 1
=0 = 0.58
For Age 20 : For Age 40 :

MinMax (v’) = (20 – 16)/(40-16) * (1 – MinMax (v’) = (40 – 16)/(40-16) * (1 –


0) + 0 0) + 0
= 4 / 24 * 1 = 24 / 24 * 1
= 0.16 =1
Age After Min-max
normalization
16 0
20 0.16
30 0.58
40 1
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 25
2. Decimal Scaling Data
Transformatio
 In this technique we move the Exampl n
decimal point of values of the e CGPA Formula After Decimal
attribute. Scaling

 This movement of decimal 2 2 / 10 0.2

points totally depends on the 3 3 / 10 0.3


maximum value among all  We will check maximum value among our
values in the attribute. attribute CGPA.
 Value V of attribute A can be  Maximum value is 3 so, we can convert it into
normalized by the following decimal by dividing with 10. why 10?
formula
Where j is the smallest integer such that
 We will count total digits in our maximum value
 Max(|ν’|) <1 and then put 1.
Normalized value of attribute
 After 1 we can put zeros equal to the length of
 V’= V / 10j
maximum value.
 Here 3 is maximum value and total digits in this
value is only 1 so, we will put one zero after 1.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 26
3. Z-Score Normalization Data
Transformatio
 It is also called zero-mean normalization. n
 The essence of this technique is the data transformation by
the values conversation to a common scale where an average number
equals zero and a standard deviation is one.
 To find z-score values..
𝑣 − 𝜇𝐴
𝑣 ′=
𝜎𝐴
Where μ: Mean, σ: Standard deviation
Example
 Let μ = 54,000, σ = 16,000
 Find z-score for 73,600,
73,600 −54,000
=𝟏 . 𝟐𝟐𝟓
16,000

Z-score for
73600: 1.225
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 27
3. Z-Score Normalization Data
Transformatio
 These z-scores represent the number of standard deviations that n each
data point is away from the mean of the distribution.
 A positive z-score indicates that the data point is above the mean, while a
negative z-score indicates it is below the mean.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 28
Data Reduction Data
Reduction
 Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet closely
maintains the integrity of the original data
 That is, mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results
 Dimensionality reduction:
 Dimensionality reduction techniques aim to reduce the number of variables or
features in a dataset while preserving the important information.
 Dimensionality reduction methods include wavelet transforms and principal
components analysis.
 Numerosity reduction:
 Numerosity reduction techniques focus on reducing the number of instances or data
points in a dataset while maintaining the representativeness of the data.
 Data compression:
 Data compression techniques are used to reduce the size of a dataset by encoding
the data in a more compact form.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 29
Data Reduction Data
Reduction

Original Data
Compressed
Data
lossless

ssy
lo

Original Data
Approximated

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 30
Principal Components Analysis Data
Reduction
 The number of input variables or features for a dataset is referred to as its
dimensionality.
 PCA transforms high-dimensional data into a lower-dimensional subspace
while preserving the most important information.
 Formally, PCA is a statistical technique for reducing the dimensionality of a
dataset.
 This is accomplished by linearly transforming the data into a
new coordinate system where (most of) the variation in the data can be
described with fewer dimensions than the initial data.
 Dimensionality reduction refers to techniques that reduce the number of
input variables in a dataset.
 Example
 Dimensional reduction can be discussed through a simple e-mail classification
problem, where we need to classify whether the e-mail is spam or not.
 This can involve a large number of features, such as whether or not the e-mail has a
generic title, the content of#2101CS521
the e-mail,
(DM) whether the
 Unit 2 - Data Pre-e-mail uses a template, etc.
Prof. Jayesh D. Vagadiya 31
Principal Components Analysis Data
Reduction

Ref :” Principal Component Analysis, ” https://setosa.io/ev/principal-component-analysis/

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 32
Attribute Subset Selection Data
Reduction
 Attribute subset selection is a technique used to reduce the dimensionality
of a dataset by selecting a relevant subset of attributes (features) while
discarding the irrelevant or redundant ones.
 Redundant features and Irrelevant Features.
 Example: Consider the below data for predication for 5th sem SPI of the
given student.
Student Name Roll No Sem Year Sem 1 Sem 2 Sem 3 Sem 4
Id Fee fee SPI SPI SPI SPI

1 ABC 101 30,000 60,000 7.5 7.6 8.9 5.6


2 XYZ 102 30,000 60,000 4.5 6.7 4.3 6.7

Redundant
Irrelevant features
Features #2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 33
Attribute Subset Selection Data
Reduction
 Here are the heuristic methods of attribute subset selection include the
technique that follows.
 Stepwise forward selection:
 The procedure starts with an empty set of attributes. At each step. The best of the
remaining original attributes is added to the set.
 Stepwise backwards elimination:
 The procedure starts with the full set of attributes. At each step. It removes the worst
attribute remaining in the set.
 Stepwise forward selection and stepwise backwards elimination:
 At each step, the procedure selects the best attributes and remove the worst from
among the remaining attributes.
 Decision tree indication
 It is generally used for classification.
 It used flow type structure.
 All the attributes not appearing in tree are assumed to be irrelevant.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 34
Decision Tree Indication Data
Reduction
Initial attribute set
 The Inner node represents
{A1, A2, A3, A4, A5, A6} an attribute.
A4 ?  An Edge represents a test
on attribute.
Y N  Leaf represent one of the
classes.
A1 ? A6 ?

Y N Y N

Class Class Class Class


1 2 1 2

Reduced attribute set


{A1, A4, A6}

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 35
Histograms Data
Reduction
 A histogram is a useful visualization tool for data reduction as it provides a
clear representation of the distribution of a dataset.
 Histograms help in identifying patterns, outliers, and the general shape of
the data, which can use full in making decisions about data reduction
techniques.
 A histogram for an attribute, A, partitions the data distribution of A into
disjoint subsets, referred to as buckets or bins.
 There are several partitioning rules, including the following:
1. Equal width (or distance) binning :
 The simplest binning approach is to partition the range of the variable into k equal-
width intervals.
 The interval width is simply the range [Min, Max] of the variable divided by N,
 Width = Max – Min / N (Number of Bins)

2. Equal depth (or frequency) binning :


 In equal-frequency binning we divide
#2101CS521 the2 - range
(DM)  Unit Data Pre- [Max, Min] of the variable into
Prof. Jayesh D. Vagadiya 36
Clustering Data
Reduction
 Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters).
 Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that
involves trial and failure.
 In data reduction, the cluster representations of the data are
Cluster center
used to replace the actual data.

Outliers

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 37
Sampling Data
Reduction
 Sampling is a data reduction technique that involves selecting a subset of
data from a larger dataset to represent the whole data.
 The objective of sampling is to reduce the size of the dataset while
preserving its important characteristics and properties.
1. Simple random sample without replacement (SRSWOR):
 Data is randomly selected from the original dataset, and once a data point is
selected, it is not returned to the dataset before selecting the next data point.
 No data point is selected more than once.

1 2
5 3 2
3 4

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 38
Sampling Data
Reduction
2. Simple random sample with replacement (SRSWR):
 Data is randomly selected from the original dataset, and after each data point is
selected, it is returned to the dataset before selecting the next data point.
 it is possible for a data point to be selected more than once in the sample.

1 2 2
5 3 2
3 4

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 39
Sampling Data
Reduction
3. Cluster sample:
 It converts the data into numbers of cluster and select cluster based on the
SRSWOR technique.
4. Stratified sample:
 A sampling method known as stratified sampling divides the population into
subgroups or strata according to specific characteristics.
 And taking a random sample from each stratum.
strata

4 1 2 3 4 5 4 2 3
4 2

1 2 5 5 2 1
1 2 3 4
3 4
1 SRS
3 5 1 2 3 4 5 4 3 2
5 2
5
3 1

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 40
Data Discretization
 It transform continuous or numerical data into discrete intervals or
categories.
 It involves dividing the data values into bins or intervals, which can
simplifyAge
data analysis and reduce the impact of noise.

10
10-22 Young
22
23
41
50 23-70 Mature

60
70
90
71-100 Senior
100
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 41
Data Discretization
 Discretization can make complex data more understandable and
interpretable.
 Some algorithms, especially those that require categorical data or work
better with discrete values. we Discretization the data
 In some cases, discretization can be used as a privacy-enhancing
technique to prevent the disclosure of sensitive information.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 42
Data Discretization Techniques
 Discretization by Binning:
 For example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median as
in smoothing by bin means or smoothing by bin medians, respectively.
 Discretization by Histogram Analysis:
 A histogram is a useful visualization tool for data reduction as it provides a clear
representation of the distribution of a dataset.
 Discretization by Clustering:
 Using clustering algorithms (e.g., k-means) to group similar data points into bins.
 Discretization by Decision trees :
 Using decision trees to find optimal split points for discretization.
 Discretization by correlation :
 When considering data discretization, correlation can help identify patterns and
relationships between variables that might guide the creation of meaningful bins.

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 43
Concept Hierarchy
 A concept hierarchy in data mining refers to the arrangement of data into
a tree-like structure, with each level of the hierarchy representing a
concept that is more general than the one below it.
 This hierarchical data organization enables more efficient and effective
data analysis, as well as the capacity to drill down to more specific levels
of detail as necessary.
country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 44
Concept Hierarchy
Location

USA India

New York Illinois Gujarat UP

Luckno
Utica Albay Joliet Elgin Surat Rajkot Noida
w

#2101CS521 (DM)  Unit 2 - Data Pre-


Prof. Jayesh D. Vagadiya 45
Questions
1. What is data Preprocessing ?
2. Explain data cleaning with example.
3. Explain different ways to fill missing values in dataset.
4. Explain binning method with example.
5. Explain how we can Identify outliers and smooth out noisy data ?
6. Explain data integration.
7. Explain data transformation with example.
8. Explain min max, decimal scaling and z score with example.
9. Explain data reduction with example.
10.Explain data sampling technique for data reduction.
11.What is data discretization ? Explain techniques for data discretization.
12.Explain concept Hierarchy.
#2101CS521 (DM)  Unit 2 - Data Pre-
Prof. Jayesh D. Vagadiya 46

You might also like