Chap 3

The document discusses data cleaning and analytics, outlining its objectives such as improving accuracy, consistency, and completeness of data. It details various techniques for data cleaning, including filtering, deduplication, imputation, standardization, normalization, and encoding. Additionally, it covers methods for outlier detection and the importance of these processes in preparing data for analysis.

Uploaded by

oguzhanaytekin65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views26 pages

Chap 3

Uploaded by

oguzhanaytekin65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Veri Bilimi ve Veri Analitiği

Veri Temizleme ve Dönüştürme

What is Data Cleaning?
 Data cleaning involves identifying and correcting (or removing) errors
and inconsistencies in data to improve its quality and prepare it for
analysis.
 Objectives of Data Cleaning:
➢ Improve Data Accuracy: Ensure that the data correctly reflects
real-world values.
➢ Ensure Data Consistency: Maintain uniformity across datasets to
avoid errors caused by inconsistent formats.
➢ Handle Missing Data: Address gaps in data to prevent bias or loss of
information during analysis.
➢ Remove Irrelevant Data: Eliminate unnecessary data to focus on
relevant information for the analysis.
➢ Detect and Correct Outliers: Identify and address extreme values
that may skew analysis.
What is Data Cleaning?
 Objectives of Data Cleaning (cont):
➢ Ensure Data Completeness: Verify that all necessary data points
are present to avoid gaps in analysis.
➢ Maintain Data Integrity: Ensure data conforms to defined rules or
constraints.
➢ Reduce Redundancy: Eliminate duplicate records to reduce data
size and improve analysis accuracy.
➢ Minimize Bias: Ensure that data cleaning does not introduce or
perpetuate bias.
Data Cleaning Techniques
 Data Filtering involves removing irrelevant or unnecessary data from a
dataset to reduce noise and focus on the most relevant information.
 Data Deduplication involves eliminating duplicate records from a
dataset, ensuring that each record is unique.
 Data Imputation entails replacing missing or null values with
estimated values to maintain data integrity.
 Data Standardization involves putting all data into a common format
to facilitate comparison and analysis.
 Data Normalization is the process of adjusting the values of numeric
data to a common scale without distorting differences in the ranges of
values.
 Data Transformation involves modifying existing data to make it more
suitable for analysis or modeling.
Data Cleaning Techniques
 Outlier Detection is the process of identifying and managing values
that significantly deviate from the rest of the data, often by treating
or removing them.
 Data validation aims to check if data adheres to defined rules and
constraints, identifying and correcting inconsistencies.
 Data encoding involves converting categorical data into a numerical
format to make it compatible with machine learning algorithms.
 Data aggregation entails grouping data by category, time period, or
another criterion to obtain summarized statistics.
 Data sampling is the process of selecting a representative subset of
data to expedite analysis while preserving data integrity.
Data Filtering
 Purpose of Data Filtering
➢ Eliminate noise and irrelevant information.
➢ Reduce computational complexity by working with smaller, cleaner
datasets.
➢ Focus analysis on data that aligns with specific goals or criteria.
➢ Improve the accuracy of predictive models and analytical outcomes.
 How Data Filtering Works
➢ Define Filtering Criteria: Establish rules or conditions to identify which
data points to include or exclude.
➢ Apply Filters: Use software tools or programming libraries to implement
these criteria.
➢ Verify Results: Review the filtered dataset to ensure no essential
information is lost.
Data Filtering Example

Customer ID Purchase Amount Location

101 $120 NYC
102 $85 LAC
103 $95 NYC

Customer ID Purchase Amount Location

101 $120 NYC
103 $95 NYC
Data Deduplication
 Purpose of Data Deduplication
➢ Avoids storing multiple copies of the same data.
➢ Prevents inflated metrics caused by duplicate entries.
➢ Eliminates bias or inaccuracies in data-driven insights.
➢ Reduces processing time for large datasets.
 How Data Filtering Works
➢ Identify Duplicate Records:
• Compare records using unique identifiers (e.g., IDs, email addresses).
• Check for similarities in attributes like names, dates, or addresses.
➢ Evaluate the Data: Decide which record to keep (e.g., most recent or most
complete).
➢ Remove or Merge Records: Delete duplicate entries or combine data into a single
record.
➢ Validate Results: Ensure no useful information is accidentally deleted.
Data Deduplication Example

Customer ID Name Email Birth Date

101 John Smith john@gmail.com 10.10.2005
102 John A. Smith John.Smith@gmail.com 10.10.2006
103 John Smith jsmith@gmail.com 10.10.2005
104 John Smith john@gmail.com 10.10.2005

Customer ID Name Email Birth Date

101 John Smith john@gmail.com 10.10.2005
102 John A. Smith John.Smith@gmail.com 10.10.2006
103 John Smith john@gmail.com 10.10.2007
Types of Data Imputation Methods
 Simple Imputation Techniques:
➢ Mean Imputation: Replacing missing values with the mean of the non-
missing values in the same column.
• Example: If the column values are [10, 20, NA, 40], the mean (10 + 20 + 40)/3
= 23.33 replaces the missing value.
➢ Median Imputation: Replacing missing values with the median of the
column. Useful when the data has outliers.
➢ Mode Imputation: Replacing missing categorical values with the most
frequent category (mode).
• Example: For the column ['A', 'B', 'A', NA, 'A'], replace NA with 'A’.
 Domain-Specific or Logical Imputation: Using domain knowledge to
infer missing values.
• Example: If a survey has missing values for gender and most respondents
are female, assign "Female" logically.
Types of Data Imputation Methods
 Advanced Imputation Techniques:
➢ Regression Imputation: Predicting the missing value using a regression
model built with other related variables in the dataset.
➢ K-Nearest Neighbors (KNN) Imputation: Replacing missing values by
averaging the values of the nearest neighbors in the feature space.
➢ Multiple Imputation: Generating multiple possible values for the missing
data and combining the results using statistical models.
 Time Series-Specific Imputation:
➢ Forward Fill: Replace missing values with the last observed value.
• Example: if the values are [100, NA, NA, 120], replace NA with 100.
➢ Backward Fill: Replace missing values with the next observed value.
• Example: For [NA, NA, 120, 130], replace NA with 120.
Data Standardization
 Importance of Data Standardization
➢ Improves Accuracy: Ensures consistent interpretation of data.
➢ Enables Integration: Makes it easier to combine datasets from multiple sources.
➢ Enhances Machine Learning: Algorithms require standardized data for optimal
performance.
➢ Facilitates Comparisons: Allows meaningful comparisons across observations.
 Steps in Data Standardization
➢ Identify Inconsistent Formats: Detect fields with different formats (e.g., date,
text, currency).
➢ Define a Standard Format: Establish a consistent format for each data type (e.g.,
"YYYY-MM-DD" for dates, lowercase for text).
➢ Apply the Transformation: Convert non-standard data into the defined format.
➢ Validate Standardization: Check for remaining inconsistencies or errors after
transformation.
Why Normalize Data?
 Improve Model Performance: Many machine learning algorithms, such as k-
nearest neighbors (KNN), support vector machines (SVM), and linear
regression, perform better when the data is normalized because these
algorithms use distance measures or optimization methods that are sensitive
to the magnitude of values.
 Prevent Dominance of Larger Scales: Without normalization, features with
larger ranges (like income or age) might dominate over features with smaller
ranges (like rating or number of children), leading to biased or inaccurate
results.
 Enhance Convergence in Optimization: Models using gradient-based
optimization (like neural networks) may converge faster when features are
normalized because it helps the gradient descent process behave more
smoothly.
Methods of Data Normalization
 Min-Max Normalization rescales the data to a fixed range, usually [0, 1].
 It adjusts the data so that the minimum value becomes 0, and the maximum
value becomes 1, maintaining the proportion between other values.

Data Normalized Data

10 0,00
20 0,25
30 0,50
40 0,75
50 1,00
Methods of Data Normalization
 Z-Score Normalization (Standardization) transforms the data to have a mean
of 0 and a standard deviation of 1.
 It is especially useful when the data follows a Gaussian distribution.

Data Normalized Data

10 -1,42
20 -0,71
30 0,00
40 0,71
50 1,42
When to Use Each Method
 Min-Max Normalization:
➢ Ideal for algorithms that require data within a fixed range, like KNN, Neural
Networks, and SVM.
➢ Best when you know your data has a fixed or known range (e.g., pixel values in
images, ratings on a scale of 1-5).
 Z-Score Normalization:
➢ Preferred when the data is not restricted to a fixed range and has a Gaussian
distribution.
➢ Useful for algorithms that assume normally distributed data, like Linear Regression,
Logistic Regression, and Principal Component Analysis (PCA).
Methods of Data Normalization
 Robust Scaling is a data preprocessing technique used to normalize features
by removing the median and scaling the data according to the interquartile
range (IQR).
 It is particularly effective for datasets containing outliers, as it is less
sensitive to extreme values compared to methods like standardization (z-
score normalization) or min-max scaling.
Methods of Data Normalization
 When to Use Robust Scaling
➢ Presence of Outliers
➢ Skewed Data
 When Not to Use Robust Scaling
➢ No Outliers: If the data does not have significant outliers, standardization
or min-max scaling may suffice and are simpler.
➢ Small Datasets: If there is insufficient data, the median and IQR may not
be representative, leading to biased scaling.
➢ Data Without a Well-Defined Median: Robust scaling might not be
effective for datasets where median and quartiles are not meaningful,
such as categorical data.
Data Encoding Methods
 Label Encoding (Integer Encoding) converts each category into a
unique integer value. This method is simple but may not be suitable
for algorithms that assume an ordinal relationship between categories.
➢ Example: Red → 0, Blue → 1, Green → 2
 Ordinal Encoding is used when the categories have a natural order or
ranking. It assigns integers to categories based on their order.
➢ Example: For a "Size" column with values Small, Medium, and Large, the
ordinal encoding would be:
➢ Small → 0, Medium → 1, Large → 2
Data Encoding Methods
 One-Hot Encoding creates binary columns for each category in the feature.
Each category is represented as a vector with a '1' for the category it
represents and '0' for all others.
➢ Example:
Color Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1
Data Encoding Methods
 Dummy encoding is a technique for converting categorical variables into
numerical values, similar to one-hot encoding, but with a slight variation.
 In dummy encoding, one category is dropped from the encoding, and the
remaining categories are represented using binary (0 or 1) variables.
 Dummy encoding avoids the dummy variable trap, which occurs when all
categories are encoded, leading to perfect multicollinearity in regression
models.
➢ Example:

Color Red Green

Red 1 0
Green 0 1
Blue 0 0
Outlier Detection
 Z-Score (Standard Score) Method measures how far a data point is from the
mean in terms of standard deviations.
 A common threshold for outliers is z > 3 or z <-3

 Interquartile Range (IQR) Method detects outliers based on the spread of the
middle 50% of the data.
 Steps:
 Calculate Q1 (25th percentile) and Q3 (75th percentile).
 Compute IQR = Q3 - Q1.
 Define outlier thresholds:
 Lower bound = Q1 - 1.5 * IQR
 Upper bound = Q3 + 1.5 * IQR
 Values outside these bounds are outliers.
Outlier Detection
 Dataset: [10, 12, 15, 18, 20, 22, 50]
 Using Z-Score Method:
➢ Mean = 21,14 and Standard Deviation = 12,56
➢ Z(50) = (50 – 21,14) / 12,56 = 2,3 (Not an outlier)

 Using IQR Method:

➢ Q1 = 13,5, Q3 = 21, IQR = 21 – 13,5 = 7,5.
➢ Outlier thresholds:
➢ Lower bound = 13.5 - (1.5 * 7.5) = 2.25
➢ Upper bound = 21 + (1.5 * 7.5) = 32.25
➢ 50 > 32.25, so it is an outlier
Outlier Detection
Outlier Detection
 Machine Learning Methods
➢ Isolation Forest: Constructs decision trees to isolate outliers.
➢ DBSCAN: A clustering algorithm that treats low-density regions as outliers.
➢ One-Class SVM: Learns a decision boundary to separate normal data from
outliers.

 Statistical Methods
➢ Grubbs' Test: Identifies one outlier at a time by testing if the most
extreme value is an outlier.
➢ Dixon's Q Test: For small datasets, tests whether the extreme value in the
data is an outlier.
References

Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Week 3
No ratings yet
Week 3
23 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
DWDM PDF
No ratings yet
DWDM PDF
21 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
AIDS C04-Session-21
No ratings yet
AIDS C04-Session-21
18 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
5 Preprocessing
No ratings yet
5 Preprocessing
44 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Part 5
No ratings yet
Part 5
22 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Better Catalogue 2024
No ratings yet
Better Catalogue 2024
33 pages
GGH3707 - 2024 Module Overview
No ratings yet
GGH3707 - 2024 Module Overview
10 pages
A Comparative Study of Diesel Oil and Soybean Oil As Oil-Based Drilling Mud PDF
No ratings yet
A Comparative Study of Diesel Oil and Soybean Oil As Oil-Based Drilling Mud PDF
11 pages
Bullying's Impact on Kids' Minds & Bodies
No ratings yet
Bullying's Impact on Kids' Minds & Bodies
9 pages
Understanding Emotions and Moods in OB
No ratings yet
Understanding Emotions and Moods in OB
14 pages
Centrifuge Beckman Coulter Optima L-100XP
No ratings yet
Centrifuge Beckman Coulter Optima L-100XP
110 pages
Theology's Historical Challenges
No ratings yet
Theology's Historical Challenges
15 pages
5.Rcl Circuit Intro
No ratings yet
5.Rcl Circuit Intro
5 pages
Two-Way Screenarray Cinema Loudspeakers: Key Features
No ratings yet
Two-Way Screenarray Cinema Loudspeakers: Key Features
3 pages
NESPAK Carreer Opportunities
No ratings yet
NESPAK Carreer Opportunities
3 pages
Zond-12e Catalogue
100% (1)
Zond-12e Catalogue
12 pages
Assignment 2 (2013)
No ratings yet
Assignment 2 (2013)
1 page
Personal FS
No ratings yet
Personal FS
4 pages
Architecture of SoC
No ratings yet
Architecture of SoC
25 pages
Review Notes in Forest Management
No ratings yet
Review Notes in Forest Management
80 pages
Feature-Based Semi-Supervised Learning To Detect Malware From Android
No ratings yet
Feature-Based Semi-Supervised Learning To Detect Malware From Android
26 pages
Cluster A Personality Disorders Case Report by Slidesgo
No ratings yet
Cluster A Personality Disorders Case Report by Slidesgo
40 pages
5 +Hilya+Ayu+Adene+Taqya
No ratings yet
5 +Hilya+Ayu+Adene+Taqya
10 pages
RSF SSF Comparison
No ratings yet
RSF SSF Comparison
6 pages
"The Thing on the Doorstep: Summary"
No ratings yet
"The Thing on the Doorstep: Summary"
19 pages
Design and Implementation of An Electricity On-Line Billing Payment System
No ratings yet
Design and Implementation of An Electricity On-Line Billing Payment System
7 pages
Physiotherapy - Stroke - Arm Exercises
No ratings yet
Physiotherapy - Stroke - Arm Exercises
5 pages
Xerostomia: Dental Student Guide
No ratings yet
Xerostomia: Dental Student Guide
16 pages
Hind Swaraj
No ratings yet
Hind Swaraj
16 pages
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
6 pages
Reported Speecg
No ratings yet
Reported Speecg
29 pages
Invoice Template
No ratings yet
Invoice Template
14 pages
MICRO MINDSET With Answer Key
No ratings yet
MICRO MINDSET With Answer Key
6 pages
Imnci. Sa 2024 Final
No ratings yet
Imnci. Sa 2024 Final
61 pages
UDP Practice
No ratings yet
UDP Practice
8 pages

Chap 3

Uploaded by

Chap 3

Uploaded by

Veri Bilimi ve Veri Analitiği

Veri Temizleme ve Dönüştürme

Customer ID Purchase Amount Location

Customer ID Purchase Amount Location

Customer ID Name Email Birth Date

Customer ID Name Email Birth Date

Data Normalized Data

Data Normalized Data

Color Red Green

 Using IQR Method:

You might also like