Veri Bilimi ve Veri Analitiği
Veri Temizleme ve Dönüştürme
What is Data Cleaning?
Data cleaning involves identifying and correcting (or removing) errors
and inconsistencies in data to improve its quality and prepare it for
analysis.
Objectives of Data Cleaning:
➢ Improve Data Accuracy: Ensure that the data correctly reflects
real-world values.
➢ Ensure Data Consistency: Maintain uniformity across datasets to
avoid errors caused by inconsistent formats.
➢ Handle Missing Data: Address gaps in data to prevent bias or loss of
information during analysis.
➢ Remove Irrelevant Data: Eliminate unnecessary data to focus on
relevant information for the analysis.
➢ Detect and Correct Outliers: Identify and address extreme values
that may skew analysis.
What is Data Cleaning?
Objectives of Data Cleaning (cont):
➢ Ensure Data Completeness: Verify that all necessary data points
are present to avoid gaps in analysis.
➢ Maintain Data Integrity: Ensure data conforms to defined rules or
constraints.
➢ Reduce Redundancy: Eliminate duplicate records to reduce data
size and improve analysis accuracy.
➢ Minimize Bias: Ensure that data cleaning does not introduce or
perpetuate bias.
Data Cleaning Techniques
Data Filtering involves removing irrelevant or unnecessary data from a
dataset to reduce noise and focus on the most relevant information.
Data Deduplication involves eliminating duplicate records from a
dataset, ensuring that each record is unique.
Data Imputation entails replacing missing or null values with
estimated values to maintain data integrity.
Data Standardization involves putting all data into a common format
to facilitate comparison and analysis.
Data Normalization is the process of adjusting the values of numeric
data to a common scale without distorting differences in the ranges of
values.
Data Transformation involves modifying existing data to make it more
suitable for analysis or modeling.
Data Cleaning Techniques
Outlier Detection is the process of identifying and managing values
that significantly deviate from the rest of the data, often by treating
or removing them.
Data validation aims to check if data adheres to defined rules and
constraints, identifying and correcting inconsistencies.
Data encoding involves converting categorical data into a numerical
format to make it compatible with machine learning algorithms.
Data aggregation entails grouping data by category, time period, or
another criterion to obtain summarized statistics.
Data sampling is the process of selecting a representative subset of
data to expedite analysis while preserving data integrity.
Data Filtering
Purpose of Data Filtering
➢ Eliminate noise and irrelevant information.
➢ Reduce computational complexity by working with smaller, cleaner
datasets.
➢ Focus analysis on data that aligns with specific goals or criteria.
➢ Improve the accuracy of predictive models and analytical outcomes.
How Data Filtering Works
➢ Define Filtering Criteria: Establish rules or conditions to identify which
data points to include or exclude.
➢ Apply Filters: Use software tools or programming libraries to implement
these criteria.
➢ Verify Results: Review the filtered dataset to ensure no essential
information is lost.
Data Filtering Example
Customer ID Purchase Amount Location
101 $120 NYC
102 $85 LAC
103 $95 NYC
Customer ID Purchase Amount Location
101 $120 NYC
103 $95 NYC
Data Deduplication
Purpose of Data Deduplication
➢ Avoids storing multiple copies of the same data.
➢ Prevents inflated metrics caused by duplicate entries.
➢ Eliminates bias or inaccuracies in data-driven insights.
➢ Reduces processing time for large datasets.
How Data Filtering Works
➢ Identify Duplicate Records:
• Compare records using unique identifiers (e.g., IDs, email addresses).
• Check for similarities in attributes like names, dates, or addresses.
➢ Evaluate the Data: Decide which record to keep (e.g., most recent or most
complete).
➢ Remove or Merge Records: Delete duplicate entries or combine data into a single
record.
➢ Validate Results: Ensure no useful information is accidentally deleted.
Data Deduplication Example
Customer ID Name Email Birth Date
101 John Smith john@gmail.com 10.10.2005
102 John A. Smith John.Smith@gmail.com 10.10.2006
103 John Smith jsmith@gmail.com 10.10.2005
104 John Smith john@gmail.com 10.10.2005
Customer ID Name Email Birth Date
101 John Smith john@gmail.com 10.10.2005
102 John A. Smith John.Smith@gmail.com 10.10.2006
103 John Smith john@gmail.com 10.10.2007
Types of Data Imputation Methods
Simple Imputation Techniques:
➢ Mean Imputation: Replacing missing values with the mean of the non-
missing values in the same column.
• Example: If the column values are [10, 20, NA, 40], the mean (10 + 20 + 40)/3
= 23.33 replaces the missing value.
➢ Median Imputation: Replacing missing values with the median of the
column. Useful when the data has outliers.
➢ Mode Imputation: Replacing missing categorical values with the most
frequent category (mode).
• Example: For the column ['A', 'B', 'A', NA, 'A'], replace NA with 'A’.
Domain-Specific or Logical Imputation: Using domain knowledge to
infer missing values.
• Example: If a survey has missing values for gender and most respondents
are female, assign "Female" logically.
Types of Data Imputation Methods
Advanced Imputation Techniques:
➢ Regression Imputation: Predicting the missing value using a regression
model built with other related variables in the dataset.
➢ K-Nearest Neighbors (KNN) Imputation: Replacing missing values by
averaging the values of the nearest neighbors in the feature space.
➢ Multiple Imputation: Generating multiple possible values for the missing
data and combining the results using statistical models.
Time Series-Specific Imputation:
➢ Forward Fill: Replace missing values with the last observed value.
• Example: if the values are [100, NA, NA, 120], replace NA with 100.
➢ Backward Fill: Replace missing values with the next observed value.
• Example: For [NA, NA, 120, 130], replace NA with 120.
Data Standardization
Importance of Data Standardization
➢ Improves Accuracy: Ensures consistent interpretation of data.
➢ Enables Integration: Makes it easier to combine datasets from multiple sources.
➢ Enhances Machine Learning: Algorithms require standardized data for optimal
performance.
➢ Facilitates Comparisons: Allows meaningful comparisons across observations.
Steps in Data Standardization
➢ Identify Inconsistent Formats: Detect fields with different formats (e.g., date,
text, currency).
➢ Define a Standard Format: Establish a consistent format for each data type (e.g.,
"YYYY-MM-DD" for dates, lowercase for text).
➢ Apply the Transformation: Convert non-standard data into the defined format.
➢ Validate Standardization: Check for remaining inconsistencies or errors after
transformation.
Why Normalize Data?
Improve Model Performance: Many machine learning algorithms, such as k-
nearest neighbors (KNN), support vector machines (SVM), and linear
regression, perform better when the data is normalized because these
algorithms use distance measures or optimization methods that are sensitive
to the magnitude of values.
Prevent Dominance of Larger Scales: Without normalization, features with
larger ranges (like income or age) might dominate over features with smaller
ranges (like rating or number of children), leading to biased or inaccurate
results.
Enhance Convergence in Optimization: Models using gradient-based
optimization (like neural networks) may converge faster when features are
normalized because it helps the gradient descent process behave more
smoothly.
Methods of Data Normalization
Min-Max Normalization rescales the data to a fixed range, usually [0, 1].
It adjusts the data so that the minimum value becomes 0, and the maximum
value becomes 1, maintaining the proportion between other values.
Data Normalized Data
10 0,00
20 0,25
30 0,50
40 0,75
50 1,00
Methods of Data Normalization
Z-Score Normalization (Standardization) transforms the data to have a mean
of 0 and a standard deviation of 1.
It is especially useful when the data follows a Gaussian distribution.
Data Normalized Data
10 -1,42
20 -0,71
30 0,00
40 0,71
50 1,42
When to Use Each Method
Min-Max Normalization:
➢ Ideal for algorithms that require data within a fixed range, like KNN, Neural
Networks, and SVM.
➢ Best when you know your data has a fixed or known range (e.g., pixel values in
images, ratings on a scale of 1-5).
Z-Score Normalization:
➢ Preferred when the data is not restricted to a fixed range and has a Gaussian
distribution.
➢ Useful for algorithms that assume normally distributed data, like Linear Regression,
Logistic Regression, and Principal Component Analysis (PCA).
Methods of Data Normalization
Robust Scaling is a data preprocessing technique used to normalize features
by removing the median and scaling the data according to the interquartile
range (IQR).
It is particularly effective for datasets containing outliers, as it is less
sensitive to extreme values compared to methods like standardization (z-
score normalization) or min-max scaling.
Methods of Data Normalization
When to Use Robust Scaling
➢ Presence of Outliers
➢ Skewed Data
When Not to Use Robust Scaling
➢ No Outliers: If the data does not have significant outliers, standardization
or min-max scaling may suffice and are simpler.
➢ Small Datasets: If there is insufficient data, the median and IQR may not
be representative, leading to biased scaling.
➢ Data Without a Well-Defined Median: Robust scaling might not be
effective for datasets where median and quartiles are not meaningful,
such as categorical data.
Data Encoding Methods
Label Encoding (Integer Encoding) converts each category into a
unique integer value. This method is simple but may not be suitable
for algorithms that assume an ordinal relationship between categories.
➢ Example: Red → 0, Blue → 1, Green → 2
Ordinal Encoding is used when the categories have a natural order or
ranking. It assigns integers to categories based on their order.
➢ Example: For a "Size" column with values Small, Medium, and Large, the
ordinal encoding would be:
➢ Small → 0, Medium → 1, Large → 2
Data Encoding Methods
One-Hot Encoding creates binary columns for each category in the feature.
Each category is represented as a vector with a '1' for the category it
represents and '0' for all others.
➢ Example:
Color Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1
Data Encoding Methods
Dummy encoding is a technique for converting categorical variables into
numerical values, similar to one-hot encoding, but with a slight variation.
In dummy encoding, one category is dropped from the encoding, and the
remaining categories are represented using binary (0 or 1) variables.
Dummy encoding avoids the dummy variable trap, which occurs when all
categories are encoded, leading to perfect multicollinearity in regression
models.
➢ Example:
Color Red Green
Red 1 0
Green 0 1
Blue 0 0
Outlier Detection
Z-Score (Standard Score) Method measures how far a data point is from the
mean in terms of standard deviations.
A common threshold for outliers is z > 3 or z <-3
Interquartile Range (IQR) Method detects outliers based on the spread of the
middle 50% of the data.
Steps:
Calculate Q1 (25th percentile) and Q3 (75th percentile).
Compute IQR = Q3 - Q1.
Define outlier thresholds:
Lower bound = Q1 - 1.5 * IQR
Upper bound = Q3 + 1.5 * IQR
Values outside these bounds are outliers.
Outlier Detection
Dataset: [10, 12, 15, 18, 20, 22, 50]
Using Z-Score Method:
➢ Mean = 21,14 and Standard Deviation = 12,56
➢ Z(50) = (50 – 21,14) / 12,56 = 2,3 (Not an outlier)
Using IQR Method:
➢ Q1 = 13,5, Q3 = 21, IQR = 21 – 13,5 = 7,5.
➢ Outlier thresholds:
➢ Lower bound = 13.5 - (1.5 * 7.5) = 2.25
➢ Upper bound = 21 + (1.5 * 7.5) = 32.25
➢ 50 > 32.25, so it is an outlier
Outlier Detection
Outlier Detection
Machine Learning Methods
➢ Isolation Forest: Constructs decision trees to isolate outliers.
➢ DBSCAN: A clustering algorithm that treats low-density regions as outliers.
➢ One-Class SVM: Learns a decision boundary to separate normal data from
outliers.
Statistical Methods
➢ Grubbs' Test: Identifies one outlier at a time by testing if the most
extreme value is an outlier.
➢ Dixon's Q Test: For small datasets, tests whether the extreme value in the
data is an outlier.
References