0% found this document useful (0 votes)

20 views14 pages

Unit 2 Notes - Docx-3

The document discusses the importance of handling missing values and outliers in datasets, outlining various types of missing data and techniques for imputation, such as deletion, mean imputation, and advanced methods like KNN and EM algorithms. It also covers the identification and management of outliers, including statistical methods and machine learning approaches for detection and handling. Best practices for data type conversion and ensuring consistency in datasets are highlighted to improve data quality and analysis accuracy.

Uploaded by

akshay0311patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views14 pages

Unit 2 Notes - Docx-3

Uploaded by

akshay0311patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1.

Understanding Missing Values

Missing values in a dataset can arise due to several reasons:

● Human error: Inaccurate data entry or missing responses in surveys.

● Technical issues: Sensor failures, data corruption.
● Intentional omissions: Participants may intentionally skip questions (e.g., sensitive information).

Proper handling of missing values is crucial as they can:

● Bias analysis results.

● Reduce model accuracy.
● Impact data integrity.

2. Types of Missing Data

Missing Completely at Random (MCAR):

o Missing values are independent of any variables.

o Example: Sensor failure in a random interval.
o Solution: Deletion or imputation.

Missing at Random (MAR):

o Missing values depend on other observed variables, but not on the missing variable itself.
o Example: Older participants skipping online forms due to unfamiliarity with technology.
o Solution: Statistical imputation (mean, regression, etc.).

Missing Not at Random (MNAR):

o Missing values depend on the variable with missing data itself.

o Example: People with lower income might avoid answering salary questions.
o Solution: Complex modeling techniques (e.g., MICE, EM algorithm).

3. Techniques to Handle Missing Values

A. Deletion Techniques
Listwise Deletion (Complete Case Analysis):

o Removes rows with any missing values.

o Pros: Simple to implement.
o Cons: Risk of losing a significant portion of data, leading to bias if data is not MCAR.
o Suitable for: Large datasets with sparse missing values.

Column Deletion:

o Removes entire columns with high proportions of missing values.

o Pros: Useful when the column adds little value to analysis.
o Cons: Loss of potentially useful information.

B. Simple Imputation Techniques

Filling with a Constant Value:

o Replace missing values with a fixed constant (e.g., 0, Unknown).

o Pros: Useful for categorical data or placeholders.
o Cons: May introduce artificial patterns.

Mean Imputation:

o Replace missing values with the mean of the column.

o Pros: Retains the column's overall central tendency.
o Cons: Reduces variability, assumes data is MCAR
Median Imputation:

o Replace missing values with the median of the column.

o Pros: Robust to outliers compared to mean imputation.
o Cons: May not capture the true distribution of the data.

Mode Imputation:

o Replace missing values with the most frequent value in the column.
o Pros: Works well for categorical data.
o Cons: May distort the distribution if the mode is over-represented.

C. Advanced Imputation Techniques

K-Nearest Neighbors (KNN) Imputation:

o Finds the k closest rows (neighbors) and imputes the missing value as the mean (or weighted mean)
of the neighbors.
o Pros:

▪ Retains relationships between variables.

▪ Suitable for numerical data with clusters.

o Cons:

▪ Computationally expensive for large datasets.

▪ Sensitive to scaling and choice of k.

Iterative Imputation (e.g., MICE):

o Uses a regression model to predict missing values iteratively based on other variables.
o Each feature with missing values is modeled as a dependent variable.
o Pros:

▪ Handles complex relationships between variables.

▪ Provides multiple imputations to reflect uncertainty.

o Cons:

▪ Computationally intensive.
▪ Requires more expertise to implement effectively.

Expectation-Maximization (EM) Algorithm:

o Iteratively estimates missing data by finding the most likely values based on probability distributions.
o The Expectation-Maximization (EM) algorithm is an iterative optimization
method that combines different unsupervised machine learning algorithms
to find maximum likelihood or maximum posterior estimates of parameters
in statistical models that involve unobserved latent variables. The EM
algorithm is commonly used for latent variable models and can handle
missing data. It consists of an estimation step (E-step) and a maximization
step (M-step), forming an iterative process to improve model fit.
o Pros:

▪ Works well for data that follows known distributions.

▪ Accounts for uncertainty.

o Cons:

▪ Assumes the data fits a specific distribution.

▪ Computational complexity.
D. Custom and Domain-Specific Techniques
Forward-Fill and Backward-Fill:

o Forward-fill: Propagates the last observed value forward.

o Backward-fill: Propagates the next observed value backward.

● Forward Fill (ffill): Propagates the last known value forward until the next
non-missing value. Useful when you assume that the missing value should
carry forward the last observed value.
● Backward Fill (bfill): Propagates the next known value backward until the
previous missing value. Useful when you assume that future data points can
be used to fill earlier gaps.

●

o Pros:

▪ Useful for time-series data.

▪ /Preserves temporal relationships.

o Cons:

▪ May propagate incorrect or stale values.

Interpolation:

o Estimates missing values using interpolation techniques (e.g., linear, spline).

o Pros:

▪ Captures trends in numerical data.

▪ Suitable for time-series or sequential data.

o Cons:

▪ Assumes smooth trends, which may not always hold.

Domain Knowledge-Based Imputation:

o Use domain expertise to fill missing values (e.g., replace missing ages in a medical dataset with
typical values based on the condition).
o Pros:

▪ Highly tailored to the data.

o Cons:

▪ Requires extensive domain expertise.

4. Evaluating the Effectiveness of Imputation

● Visual Inspection: Compare distributions before and after imputation.
● Model Performance: Train predictive models and evaluate their performance.
● Multiple Imputations: Compare results from different imputation methods to assess variability.

5. Best Practices
● Analyze the nature and pattern of missing data before selecting a technique.
● Use visualization (e.g., heatmaps) to understand missing data patterns.
● Avoid over-reliance on deletion techniques unless missing data is sparse.
● Consider advanced imputation for datasets with complex interdependencies.

Or
Introduction to Missing Values
● Definition: Missing Values (MVs) are incomplete entries in datasets due to errors during recording, manual
data entry, or incorrect measurements.
● Impact of MVs:
1. Loss of efficiency in data analysis.
2. Increased difficulty in handling and analyzing datasets.
3. Bias in results, reducing generalizability.

Techniques for Handling Missing Values

1. Case Deletion

● Description: Discards instances with missing values.

● Advantages: Unbiased parameter estimates under the MCAR (Missing Completely at Random) assumption.
● Limitations:
o Reduces dataset size significantly.
o Introduces bias if data is not MCAR.

2. Single Imputation

● Overview: Missing values are filled with a single estimate.

● Methods:

1. Mean/Mode/Median Substitution:

▪ Replace MVs with the column mean (numerical) or mode (categorical).

▪ Advantages: Simple and widely used.
▪ Limitation: Reduces variability and introduces bias in correlations.

2. Hot Deck Imputation:

▪ A random similar instance from the dataset fills the MV.

▪ Modern versions cluster data before selecting a replacement.

3. Cold Deck Imputation:

▪ Replacements come from external datasets.

3. Multiple Imputation

● Description: Generates multiple datasets by imputing MVs differently in each.

● Approach:
o Each dataset is analyzed separately.
o Final analysis combines results.
● Advantage: Accounts for uncertainty in imputation.
● Common Frameworks:

o Multivariate Imputation by Chained Equations (MICE).

o Expectation-Maximization (EM) algorithm.

4. Maximum Likelihood Imputation

● Principle: Assumes data follows a known probability distribution.

● Steps:
o Estimate distribution parameters (e.g., mean, variance).
o Use parameters to impute MVs.

● Challenges: Complex when distributions are multi-dimensional or irregular.

5. Machine Learning-Based Imputation

● Overview: Employs predictive models for imputation.

● Methods:

1. K-Nearest Neighbors (KNN) Imputation:

▪ Finds the K nearest neighbors based on non-missing attributes.

▪ Imputes MVs as averages or modes of neighbors.
▪ Limitation: Sensitive to K-value and data scaling.

2. Support Vector Machine Imputation (SVMI):

▪ Treats missing attributes as targets in an SVM regression.

3. Fuzzy K-Means Clustering:

▪ Uses membership values for clusters to determine imputed values.

▪ Suitable for datasets with overlapping clusters.

6. Bayesian and Statistical Methods

● Bayesian Networks: Models dependencies between variables to infer MVs.

● Expectation-Maximization (EM):

o Iteratively estimates distribution parameters and imputes MVs.

o Advantage: Works well with Gaussian distributions.

7. Recent Advances

● Hybrid Models:

o Combine multiple methods, e.g., fuzzy clustering + regression.

● Ensemble Methods:

o Use algorithms like Random Forests for robust imputation.

● Neural Network Approaches:

o Predict missing values using deep learning models.

Guidelines for Choosing an Imputation Method

1. Data Type: Consider the nature of the dataset (e.g., numerical, categorical).
2. Percentage of MVs:

o Low percentages: Simpler methods like mean imputation are sufficient.

o High percentages: Advanced methods (e.g., ML-based) may be required.

3. Data Correlations: Use advanced techniques when attributes have strong interdependencies.
4. Computational Resources: Complex methods (e.g., Bayesian, ML-based) require more resources
1. Identifying and Handling Outliers
1.1 What are Outliers?
An outlier is a data point that significantly deviates from the rest of the data. It can be
either much higher or much lower than the other data points, and its presence can have a
significant impact on the results of machine learning algorithms. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining. They may arise due to:
● Data entry errors (typos, incorrect values)
● Measurement errors (sensor malfunctions, incorrect readings)
● Natural variability (genuine but rare occurrences)

Types of Outliers

There are two main types of outliers:

● Global outliers: Global outliers are isolated data points that are far away from the
main body of the data. They are often easy to identify and remove.
● Contextual outliers: Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to determine
their significance.

Algorithm

1. Calculate the mean of each cluster

2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. If (Distance > Threshold) then, Outlier

1.2 Outlier Detection Methods in Machine Learning

Outlier detection plays a crucial role in ensuring the quality and accuracy of machine
learning models. By identifying and removing or handling outliers effectively, we can
prevent them from biasing the model, reducing its performance, and hindering its
interpretability. Here’s an overview of various outlier detection methods:

1. Statistical Methods:

● Z-Score: This method calculates the standard deviation of the data points and
identifies outliers as those with Z-scores exceeding a certain threshold (typically 3 or
-3).
Z-Score (Standard Score Method)

o Measures how many standard deviations a value is from the mean.

o Formula:

Where X is the data point, μ is the mean, and σ

o If ∣Z∣>3, the value is considered an outlier

o Interquartile Range (IQR): IQR identifies outliers as data points falling

outside the range defined by Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1
and Q3 are the first and third quartiles, and k is a factor (typically 1.5).
o Uses quartiles to detect outliers.
o Formula: IQR=Q3−Q1
o Outliers are values outside: [Q1−1.5×IQR,Q3+1.5×IQR]]

Machine Learning Approaches

● Clustering-based Outlier Detection

o Uses algorithms like DBSCAN to find points that do not belong to any dense cluster.
● Isolation Forest

o Identifies anomalies by randomly selecting features and checking data point isolation.

1.3 Techniques for Handling Outliers in Machine Learning

Outliers, data points that significantly deviate from the majority, can have detrimental
effects on machine learning models. To address this, several techniques can be employed
to handle outliers effectively:

1. Removal:

● This involves identifying and removing outliers from the dataset before training the
model. Common methods include:
o Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
o Distance-based methods: Outliers are identified based on their distance from
their nearest neighbors.
o Clustering: Outliers are identified as points not belonging to any cluster or
belonging to very small clusters.
2. Transformation:

● This involves transforming the data to reduce the influence of outliers. Common
methods include:
o Scaling: Standardizing or normalizing the data to have a mean of zero and a
standard deviation of one.
o Winsorization: Replacing outlier values with the nearest non-outlier value.
o Log transformation: Applying a logarithmic transformation to compress the
data and reduce the impact of extreme values.

3. Robust Estimation:

● This involves using algorithms that are less sensitive to outliers. Some examples
include:
o Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
o M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
o Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.

4. Modeling Outliers:

● This involves explicitly modeling the outliers as a separate group. This can be done
by:
o Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
o Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.

2. Data Types Conversion and Consistency

2.1 Why Convert Data Types?
● Data from multiple sources may use different formats (e.g., “Male” vs. “M”).
● Machine learning algorithms often require numerical input.
● Ensures data consistency and reduces errors in analysis.

2.2 Common Data Conversions

Numerical to Categorical (Discretization)

● Example: Converting age into bins (0-18 = "Child", 19-59 = "Adult", 60+ = "Senior").
● Methods:

o Equal-width binning: Divides range into equal intervals.

o Equal-frequency binning: Each bin has an equal number of observations.

Categorical to Numerical

● One-Hot Encoding: Converts categories into binary columns.

o Example: "Red", "Blue", "Green" → (1,0,0), (0,1,0), (0,0,1).

● Label Encoding: Assigns numerical values to categories.

o Example: "Low" → 0, "Medium" → 1, "High" → 2.
Text to Numeric Conversion

● TF-IDF (Term Frequency-Inverse Document Frequency)

● Word Embeddings (Word2Vec, GloVe, BERT)

Date Formatting

● Converting "12/31/2023" → "2023-12-31" f or consistency.

● Extracting features like day of the week, month, year.

3. Dealing with Missing Values and Noisy Data

3.1 Types of Missing Data
● MCAR (Missing Completely at Random) – No pattern, data is randomly missing.
● MAR (Missing at Random) – Missing values depend on observed variables.
● MNAR (Missing Not at Random) – Data is missing due to specific reasons.

3.2 Handling Missing Data

1. Deletion

● Remove rows with missing values (only if missing data is small).

2. Imputation Methods

● Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode.
● K-Nearest Neighbors (KNN) Imputation: Finds similar instances and imputes missing values based on
neighbors.
● Machine Learning-based Imputation:

o Regression models (predict missing values based on other features).

o Multiple Imputation (generates several plausible values and averages them).

3.3 Dealing with Noisy Data

● Smoothing Techniques:

o Binning: Grouping values into bins and replacing them with bin means.
o Regression: Fitting a function to the data to remove noise.
o Clustering: Replacing noisy points with cluster centroids.

4. Data Integration
4.1 Techniques for Data Integration
● Schema Mapping: Standardizing column names.
● Record Linkage: Matching records across datasets using unique identifiers.

4.1 Data Integration

Definition
Data integration is the process of combining information from different sources into a unified
dataset. A poorly executed integration process can lead to:

● Redundancies → increases data size and slows down processing.

● Inconsistencies → reduces accuracy in data mining (DM) models.
● Schema mismatches → different structures across datasets require reconciliation.

Challenges in Data Integration

● Schema Matching: Different sources may use different formats, names, or structures, making it difficult to
align datasets.
● Tuple Duplication: Records from different sources may represent the same entity but have different values.
● Redundant Attributes: Some attributes may provide duplicate information.
● Data Correlation: Overlapping information may exist across multiple fields.

Solutions & Approaches

● Data Mapping: A structured framework defining how instances should be arranged.
● In-Database Mining vs. External Processing:

o In-Database Mining: Directly processes data within a database (avoids extraction but requires
repeated preprocessing).
o External Processing: Extracting data into a separate file for preprocessing before applying DM
techniques. This approach is usually faster and more flexible.

4.2 Finding Redundant Attributes

Why Remove Redundant Attributes?
● Improves model performance by reducing dimensionality.
● Reduces computational complexity.

5.2 Detection Methods

● Correlation Analysis:(Correlation
is a statistical measure that expresses
the extent to which two variables are linearly related (meaning they
change together at a constant rate).)
o High correlation (Pearson correlation >0.9> 0.9>0.9) suggests redundancy.

● Feature Importance from Machine Learning Models:

o Decision Trees, Random Forests, and LASSO regression can identify unimportant features.

● Principal Component Analysis (PCA):

o Reduces redundant attributes by transforming data into uncorrelated components.

Definition
A redundant attribute is one that can be derived from another attribute or set of attributes.
Redundancy increases data size, processing time, and can cause overfitting.
Causes of Redundant Attributes
● Naming inconsistencies (e.g., “Salary” vs. “Income”).
● Attributes storing correlated information (e.g., “Age” and “Year of Birth”).
● Measurement units (e.g., “Height in cm” vs. “Height in inches”).

Methods for Detecting Redundancy

1. Correlation Analysis(Types of CorrelationPositive Linear Correlation. There is a positive linear
correlation when the variable on the x -axis increases as the variable on the y -axis increases. Negative
Linear Correlation.Non-linear Correlation (known as curvilinear correlation) No Correlation.)

Measures the dependency between attributes to determine whether one attribute can be removed.
2. χ² (Chi-Squared) Correlation Test (for Categorical Data)( Pearson’s
Chi-Square Test is a fundamental statistical method used to evaluate the relationship
between categorical variables. By comparing observed frequencies with expected
frequencies, this test determines whether significant differences exist within data.
● Observed Frequencies: The actual count of occurrences in each category, which can
be analyzed using tools such as “chisquare python.”
● Expected Frequencies: The counts expected if the null hypothesis is true, allowing
researchers to understand the distribution of data better.
● Chi-Square Statistic: A measure of how much the observed frequencies deviate
from the expected frequencies, commonly calculated in Python using the “chi2 test
python.”
● P-value: Indicates the probability of observing the data assuming the null hypothesis
is true, helping to determine the strength of the evidence against the null hypothesis.
● Null Hypothesis : A null Hypothesis is a general statistical statement or assumption
about a population parameter that is assumed to be true Until we have sufficient
evidence to reject it. It is generally denoted by Ho.
● Alternate Hypothesis : The Alternate Hypothesis is considered as competing of the
null hypothesis. It is generally denoted by H1 . The general goal of our hypothesis
testing is to test the Alternative hypothesis against the null hypothesis.
)
Used when attributes are nominal (categorical).

● Steps:

1. Create a contingency table for two attributes (A & B).(a table showing the distribution of one variable in rows
and another in columns, used to study the correlation between the two variables.)

1. Compute the χ² statistic:

where:

▪ oij= observed frequency of (Aᵢ, B )

▪ eij = expected frequency of (Aᵢ, B )

▪ m= total number of instances

2. Compare χ² value with a threshold from a statistical table.

3. If χ² is significant, the attributes are correlated.

3. Pearson Correlation Coefficient (for Numerical Data)

Measures the strength and direction of a linear relationship between two numerical attributes.

● Values Interpretation:

o r > 0 → Positive correlation (both increase together).

o r < 0 → Negative correlation (one increases, the other decreases).
o r = 0 → No correlation.

● High correlation (|r| ≈ 1) means one attribute might be removed.

4. Covariance Analysis
Indicates how two numerical attributes vary together.

● Cov(A, B) > 0 → Attributes increase/decrease together.

● Cov(A, B) < 0 → One increases while the other decreases.
● Cov(A, B) = 0 → No relationship, but does not guarantee independence.

4.3 Detecting Tuple Duplication and Inconsistency

Definition
Duplicate tuples occur when the same entity appears multiple times in the dataset. Inconsistencies
arise when data values conflict due to input errors or variations across sources.
Causes of Duplication & Inconsistency
● Denormalized tables (to optimize joins) may repeat records.
● Data entry errors (e.g., typos in unique identifiers).
● Different measurement units (e.g., kg vs. lbs).
● Variations in naming conventions (e.g., “John Doe” vs. “Doe, John”).

Techniques for Detecting Duplicate Tuples

1. String-Based Similarity Measures (for Nominal Attributes)
● Edit Distance (Levenshtein Distance): Minimum number of character operations (insert, delete, replace)
needed to transform one string into another.
● Affine Gap Distance: Extends edit distance to account for truncation errors.
● Jaro Distance: Measures common characters and transpositions between two strings.
● Q-Gram Similarity: Splits strings into substrings of length q and compares them.
● Token-Based Similarity: Used when word order varies (e.g., “Smith, John” vs. “John Smith”).
● Phonetic Matching: Algorithms like Metaphone and ONCA compare strings based on pronunciation.

2. Numeric Similarity Measures

● Encoding numbers as strings for similarity checks.
● Range-based thresholding (e.g., values within 1% are considered duplicates).
● Cosine Similarity for Numeric Data: Measures the cosine of the angle between vectors.

4.4 Approaches for Detecting Duplicate Instances

1. Probabilistic Approaches
● Fellegi-Sunter Model (Bayesian Inference):

o Models duplicate detection as a probability estimation problem.

o Uses Expectation-Maximization (EM) to estimate parameters.
o Bayes Decision Rule determines if an instance is unique or duplicated.

2. Supervised & Semi-Supervised ML Approaches

● Classification Models: Decision Trees (CART), SVM, etc.
● Clustering-Based Methods:

o Graph partitioning groups similar instances.

o Clustering bootstrapping refines duplicate detection over iterations.
3. Distance-Based Approaches
● Uses similarity metrics (edit distance, weighted modifications, etc.).
● Ranking Similar Instances: Sorts and selects least duplicated tuples.

4. Unsupervised Clustering Techniques

● Hierarchical Graph Models: Represent attributes as “match/no-match” binary variables.
● Density-Based Clustering (DBSCAN): Groups similar instances.

Key Takeaways
✔ Data Integration is critical for ensuring consistency and accuracy in data mining.
✔ Redundant attributes increase processing time and cause overfitting. Correlation and
covariance analysis help detect redundancy.
✔ Duplicate tuples and inconsistencies arise from various sources, including data entry errors,
different measurement units, and schema mismatches.
✔ Various approaches exist for duplicate detection, including probabilistic models, supervised
ML, and distance-based techniques.

● Fuzzy Matching: Using algorithms like Jaccard similarity or Levenshtein distance.

5. Finding Redundant Attributes

o 5.1

6. Detecting Tuple Duplication and Inconsistency

6.1 Types of Duplicates
● Exact Duplicates: Identical records appearing multiple times.
● Near Duplicates: Slight variations in records due to typos or inconsistent formatting.

6.2 Detecting Duplicate Tuples

● Exact Matching: Checking for identical values.
● Fuzzy Matching: Using algorithms like:

o Levenshtein Distance (measures character differences).

o Jaccard Similarity (measures common elements).

6.3 Handling Duplicates

● Standardization: Convert text to lowercase, remove extra spaces.
● Merging Records: Combine records based on shared attributes.
● Choosing the Most Reliable Source: Use data from authoritative sources.

Summary of Key Techniques

Task Techniques
Outlier Detection Z-score, IQR, DBSCAN, Isolation Forest
Data Conversion One-hot encoding, label encoding, binning, TF-IDF
Missing Data Mean/Median Imputation, KNN, Machine Learning
Noise Reduction Binning, Smoothing, Regression, Clustering
Data Integration Schema Mapping, Record Linkage, Fuzzy Matching
Redundant Attributes Correlation Analysis, Feature Importance, PCA
Duplicate Detection Exact Matching, Levenshtein Distance, Jaccard Similarity

Ads Exp2
No ratings yet
Ads Exp2
3 pages
Data Imputation For Missing Values
No ratings yet
Data Imputation For Missing Values
14 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
3 - Missing Values-1
No ratings yet
3 - Missing Values-1
9 pages
Unit - 3 - R Programming
No ratings yet
Unit - 3 - R Programming
16 pages
1.7-Identify and Handle Missing Values
No ratings yet
1.7-Identify and Handle Missing Values
27 pages
Data Imputation Techniques Guide
No ratings yet
Data Imputation Techniques Guide
6 pages
Imputation
No ratings yet
Imputation
3 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Missing Value Imputation in Machine Learning
No ratings yet
Missing Value Imputation in Machine Learning
8 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
Updated ABC Document
No ratings yet
Updated ABC Document
3 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Handling Missing Data
No ratings yet
Handling Missing Data
32 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
Handling Missing Values
No ratings yet
Handling Missing Values
5 pages
Art Mouad 3
No ratings yet
Art Mouad 3
9 pages
Data Cleaning Techniques Guide
No ratings yet
Data Cleaning Techniques Guide
11 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Missing Data
100% (2)
Missing Data
35 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Advanced Methods For Missing Values Imputation Based On Similarity Learning
No ratings yet
Advanced Methods For Missing Values Imputation Based On Similarity Learning
38 pages
Data Imputation Techniques Guide
No ratings yet
Data Imputation Techniques Guide
93 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
Missing Data
No ratings yet
Missing Data
14 pages
Updated ABC Document
No ratings yet
Updated ABC Document
1 page
Data Cleaning - Project Work
No ratings yet
Data Cleaning - Project Work
10 pages
Missing Data Techniques - UCLA
No ratings yet
Missing Data Techniques - UCLA
66 pages
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
No ratings yet
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
4 pages
Unit II 10 Data Preprocessing Techniques
No ratings yet
Unit II 10 Data Preprocessing Techniques
13 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
37 pages
An Effective Method For Classification With Missing Values
No ratings yet
An Effective Method For Classification With Missing Values
22 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
01 Dealing With Missing Data The Art and Science of Imputation
No ratings yet
01 Dealing With Missing Data The Art and Science of Imputation
26 pages
Chapter 3
No ratings yet
Chapter 3
58 pages
Marina Dealing With Missing Data HH
No ratings yet
Marina Dealing With Missing Data HH
20 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
SVD-Based Missing Data Imputation
No ratings yet
SVD-Based Missing Data Imputation
6 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
8 Hron Et Al 2010
No ratings yet
8 Hron Et Al 2010
13 pages
Imputation Visualization Accuracy
No ratings yet
Imputation Visualization Accuracy
5 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
No ratings yet
Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
8 pages
SICE: An Improved Missing Data Imputation Technique: Open Access Research
No ratings yet
SICE: An Improved Missing Data Imputation Technique: Open Access Research
21 pages
Intermediate Machine Learning
No ratings yet
Intermediate Machine Learning
12 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Business Analytics ST1
No ratings yet
Business Analytics ST1
13 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Multiple Imputation w2 2024
No ratings yet
Multiple Imputation w2 2024
45 pages
DA Unit 2 15m Handling Missing Data
No ratings yet
DA Unit 2 15m Handling Missing Data
3 pages
Missing Values
No ratings yet
Missing Values
3 pages
Meth 2024 Part3 Imput
No ratings yet
Meth 2024 Part3 Imput
32 pages
Tally Notes
No ratings yet
Tally Notes
33 pages
Data Preprocessing QA
No ratings yet
Data Preprocessing QA
4 pages
TY OE-III NPTEL AI in Industrial and Management Engineering 2025-26 P-I Details As On 31 - July - 2025
No ratings yet
TY OE-III NPTEL AI in Industrial and Management Engineering 2025-26 P-I Details As On 31 - July - 2025
12 pages
TY 2025-26 Multidisciplinary Minor Verticals
No ratings yet
TY 2025-26 Multidisciplinary Minor Verticals
3 pages
Unit II
No ratings yet
Unit II
46 pages
Notice CPM-I TY 2025-26 Part-I
No ratings yet
Notice CPM-I TY 2025-26 Part-I
1 page
Question Bank Unit-3 and 4
No ratings yet
Question Bank Unit-3 and 4
9 pages
Unit 6-1
No ratings yet
Unit 6-1
4 pages
DBMS Prerequisites and Theory
No ratings yet
DBMS Prerequisites and Theory
2 pages
Analytical Chemistry (Chem. 2021) : Statistical Evaluation of Analytical Data
No ratings yet
Analytical Chemistry (Chem. 2021) : Statistical Evaluation of Analytical Data
46 pages
CH 7 Statistical Data Treatment and Evaluation
No ratings yet
CH 7 Statistical Data Treatment and Evaluation
30 pages
Stats Exam 2 Fall 2024
No ratings yet
Stats Exam 2 Fall 2024
16 pages
Student Perceptions of ChatGPT Use in A College Essay
No ratings yet
Student Perceptions of ChatGPT Use in A College Essay
15 pages
Null and Alternative Hypothesis
No ratings yet
Null and Alternative Hypothesis
26 pages
Fixed Vs Random The Hausman Test Four Decades Later
No ratings yet
Fixed Vs Random The Hausman Test Four Decades Later
33 pages
STA 204 Lecture Note 2 - Continuation
No ratings yet
STA 204 Lecture Note 2 - Continuation
25 pages
Bus Stop Capacity Estimation in China
No ratings yet
Bus Stop Capacity Estimation in China
13 pages
Statistics and Probability 4TH Quarter
No ratings yet
Statistics and Probability 4TH Quarter
7 pages
Rena LihayLihay Preliminaries Being Edited
No ratings yet
Rena LihayLihay Preliminaries Being Edited
21 pages
World Trade 1800-1938: A New Synthesis
No ratings yet
World Trade 1800-1938: A New Synthesis
33 pages
Business Research Methods Previous Question Paper Solution (2017-18)
100% (1)
Business Research Methods Previous Question Paper Solution (2017-18)
27 pages
Hypothesis Testing: W&W, Chapter 9
No ratings yet
Hypothesis Testing: W&W, Chapter 9
27 pages
Ch10 Two Sample Tests
No ratings yet
Ch10 Two Sample Tests
17 pages
Hypothesis Research Paper Topics
100% (2)
Hypothesis Research Paper Topics
6 pages
Hypothesis
No ratings yet
Hypothesis
71 pages
One Sample Inference in Biostatistics
No ratings yet
One Sample Inference in Biostatistics
9 pages
(MAI 4.16-4.17) HYPOTHESIS TEST FOR TWO MEANS - Solutions
No ratings yet
(MAI 4.16-4.17) HYPOTHESIS TEST FOR TWO MEANS - Solutions
2 pages
Best Practices For
No ratings yet
Best Practices For
8 pages
How To Write Chapter 3 - Methods of Research and Procedures (Continuation)
No ratings yet
How To Write Chapter 3 - Methods of Research and Procedures (Continuation)
56 pages
Probability & Stats Final Exam
No ratings yet
Probability & Stats Final Exam
8 pages
A Caution Regarding Rules of Thumb For Variance Inflation Factors
No ratings yet
A Caution Regarding Rules of Thumb For Variance Inflation Factors
18 pages
Public Health Research
No ratings yet
Public Health Research
124 pages
Chapter 9 Nonparametric Sign Test
No ratings yet
Chapter 9 Nonparametric Sign Test
26 pages
Econ2330 Ch09
No ratings yet
Econ2330 Ch09
65 pages
Math11 - Test Questions TB DQAS 4thQ
No ratings yet
Math11 - Test Questions TB DQAS 4thQ
4 pages
Statistical Analysis Overview
No ratings yet
Statistical Analysis Overview
17 pages
Hypothesis Testing Guide
No ratings yet
Hypothesis Testing Guide
20 pages
Statistics Exercise Solution
100% (1)
Statistics Exercise Solution
19 pages
Comprehensive Guide to Research Concepts
No ratings yet
Comprehensive Guide to Research Concepts
12 pages

Unit 2 Notes - Docx-3

Uploaded by

Unit 2 Notes - Docx-3

Uploaded by

1.

Understanding Missing Values

●​ Human error: Inaccurate data entry or missing responses in surveys.

Proper handling of missing values is crucial as they can:

●​ Bias analysis results.

2. Types of Missing Data

o​ Missing values are independent of any variables.

Missing at Random (MAR):

Missing Not at Random (MNAR):

o​ Missing values depend on the variable with missing data itself.

3. Techniques to Handle Missing Values

o​ Removes rows with any missing values.

o​ Removes entire columns with high proportions of missing values.

B. Simple Imputation Techniques

o​ Replace missing values with a fixed constant (e.g., 0, Unknown).

o​ Replace missing values with the mean of the column.

o​ Replace missing values with the median of the column.

C. Advanced Imputation Techniques

▪​ Retains relationships between variables.

▪​ Computationally expensive for large datasets.

Iterative Imputation (e.g., MICE):

▪​ Handles complex relationships between variables.

Expectation-Maximization (EM) Algorithm:

▪​ Works well for data that follows known distributions.

▪​ Assumes the data fits a specific distribution.

o​ Forward-fill: Propagates the last observed value forward.

▪​ Useful for time-series data.

▪​ May propagate incorrect or stale values.

o​ Estimates missing values using interpolation techniques (e.g., linear, spline).

▪​ Captures trends in numerical data.

▪​ Assumes smooth trends, which may not always hold.

Domain Knowledge-Based Imputation:

▪​ Highly tailored to the data.

▪​ Requires extensive domain expertise.

4. Evaluating the Effectiveness of Imputation

Techniques for Handling Missing Values

●​ Description: Discards instances with missing values.

●​ Overview: Missing values are filled with a single estimate.

1.​ Mean/Mode/Median Substitution:

▪​ Replace MVs with the column mean (numerical) or mode (categorical).

2.​ Hot Deck Imputation:

▪​ A random similar instance from the dataset fills the MV.

3.​ Cold Deck Imputation:

▪​ Replacements come from external datasets​​.

●​ Description: Generates multiple datasets by imputing MVs differently in each.

o​ Multivariate Imputation by Chained Equations (MICE).

4. Maximum Likelihood Imputation

●​ Principle: Assumes data follows a known probability distribution.

●​ Challenges: Complex when distributions are multi-dimensional or irregular​​.

5. Machine Learning-Based Imputation

●​ Overview: Employs predictive models for imputation.

1.​ K-Nearest Neighbors (KNN) Imputation:

▪​ Finds the K nearest neighbors based on non-missing attributes.

2.​ Support Vector Machine Imputation (SVMI):

▪​ Treats missing attributes as targets in an SVM regression.

3.​ Fuzzy K-Means Clustering:

▪​ Uses membership values for clusters to determine imputed values.

6. Bayesian and Statistical Methods

●​ Bayesian Networks: Models dependencies between variables to infer MVs.

o​ Iteratively estimates distribution parameters and imputes MVs.

o​ Combine multiple methods, e.g., fuzzy clustering + regression.

o​ Use algorithms like Random Forests for robust imputation.

●​ Neural Network Approaches:

o​ Predict missing values using deep learning models​​.

Guidelines for Choosing an Imputation Method

o​ Low percentages: Simpler methods like mean imputation are sufficient.

There are two main types of outliers:

1.​ Calculate the mean of each cluster

1.2 Outlier Detection Methods in Machine Learning

o​ Measures how many standard deviations a value is from the mean.

Where X is the data point, μ is the mean, and σ

o​ If ∣Z∣>3, the value is considered an outlier

o​ Interquartile Range (IQR): IQR identifies outliers as data points falling

Machine Learning Approaches

●​ Clustering-based Outlier Detection

1.3 Techniques for Handling Outliers in Machine Learning

● Human error: Inaccurate data entry or missing responses in surveys.

● Bias analysis results.

o Missing values are independent of any variables.

o Missing values depend on the variable with missing data itself.

o Removes rows with any missing values.

o Removes entire columns with high proportions of missing values.

o Replace missing values with a fixed constant (e.g., 0, Unknown).

o Replace missing values with the mean of the column.

o Replace missing values with the median of the column.

▪ Retains relationships between variables.

▪ Computationally expensive for large datasets.

▪ Handles complex relationships between variables.

▪ Works well for data that follows known distributions.

▪ Assumes the data fits a specific distribution.

o Forward-fill: Propagates the last observed value forward.

▪ Useful for time-series data.

▪ May propagate incorrect or stale values.

o Estimates missing values using interpolation techniques (e.g., linear, spline).

▪ Captures trends in numerical data.

▪ Assumes smooth trends, which may not always hold.

▪ Highly tailored to the data.

▪ Requires extensive domain expertise.

● Description: Discards instances with missing values.

● Overview: Missing values are filled with a single estimate.

1. Mean/Mode/Median Substitution:

▪ Replace MVs with the column mean (numerical) or mode (categorical).

2. Hot Deck Imputation:

▪ A random similar instance from the dataset fills the MV.

3. Cold Deck Imputation:

▪ Replacements come from external datasets.

● Description: Generates multiple datasets by imputing MVs differently in each.

o Multivariate Imputation by Chained Equations (MICE).

● Principle: Assumes data follows a known probability distribution.

● Challenges: Complex when distributions are multi-dimensional or irregular.

● Overview: Employs predictive models for imputation.

1. K-Nearest Neighbors (KNN) Imputation:

▪ Finds the K nearest neighbors based on non-missing attributes.

2. Support Vector Machine Imputation (SVMI):

▪ Treats missing attributes as targets in an SVM regression.

3. Fuzzy K-Means Clustering:

▪ Uses membership values for clusters to determine imputed values.

● Bayesian Networks: Models dependencies between variables to infer MVs.

o Iteratively estimates distribution parameters and imputes MVs.

o Combine multiple methods, e.g., fuzzy clustering + regression.

o Use algorithms like Random Forests for robust imputation.

● Neural Network Approaches:

o Predict missing values using deep learning models.

o Low percentages: Simpler methods like mean imputation are sufficient.

1. Calculate the mean of each cluster

o Measures how many standard deviations a value is from the mean.

o If ∣Z∣>3, the value is considered an outlier

o Interquartile Range (IQR): IQR identifies outliers as data points falling

● Clustering-based Outlier Detection

o Equal-width binning: Divides range into equal intervals.

● One-Hot Encoding: Converts categories into binary columns.

o Example: "Red", "Blue", "Green" → (1,0,0), (0,1,0), (0,0,1).

● Label Encoding: Assigns numerical values to categories.

● TF-IDF (Term Frequency-Inverse Document Frequency)

● Converting "12/31/2023" → "2023-12-31" f or consistency.

● Remove rows with missing values (only if missing data is small).

o Regression models (predict missing values based on other features).

● Redundancies → increases data size and slows down processing.

● Feature Importance from Machine Learning Models:

● Principal Component Analysis (PCA):

o Reduces redundant attributes by transforming data into uncorrelated components.

1. Compute the χ² statistic:

▪ oij= observed frequency of (Aᵢ, B )

▪ eij = expected frequency of (Aᵢ, B )

▪ m= total number of instances

2. Compare χ² value with a threshold from a statistical table.

o r > 0 → Positive correlation (both increase together).

● High correlation (|r| ≈ 1) means one attribute might be removed.

● Cov(A, B) > 0 → Attributes increase/decrease together.

o Models duplicate detection as a probability estimation problem.

o Graph partitioning groups similar instances.

● Fuzzy Matching: Using algorithms like Jaccard similarity or Levenshtein distance.

o Levenshtein Distance (measures character differences).