0% found this document useful (0 votes)
20 views14 pages

Unit 2 Notes - Docx-3

The document discusses the importance of handling missing values and outliers in datasets, outlining various types of missing data and techniques for imputation, such as deletion, mean imputation, and advanced methods like KNN and EM algorithms. It also covers the identification and management of outliers, including statistical methods and machine learning approaches for detection and handling. Best practices for data type conversion and ensuring consistency in datasets are highlighted to improve data quality and analysis accuracy.

Uploaded by

akshay0311patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

Unit 2 Notes - Docx-3

The document discusses the importance of handling missing values and outliers in datasets, outlining various types of missing data and techniques for imputation, such as deletion, mean imputation, and advanced methods like KNN and EM algorithms. It also covers the identification and management of outliers, including statistical methods and machine learning approaches for detection and handling. Best practices for data type conversion and ensuring consistency in datasets are highlighted to improve data quality and analysis accuracy.

Uploaded by

akshay0311patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

Understanding Missing Values


Missing values in a dataset can arise due to several reasons:

●​ Human error: Inaccurate data entry or missing responses in surveys.


●​ Technical issues: Sensor failures, data corruption.
●​ Intentional omissions: Participants may intentionally skip questions (e.g., sensitive information).

Proper handling of missing values is crucial as they can:

●​ Bias analysis results.


●​ Reduce model accuracy.
●​ Impact data integrity.

2. Types of Missing Data


Missing Completely at Random (MCAR):

o​ Missing values are independent of any variables.


o​ Example: Sensor failure in a random interval.
o​ Solution: Deletion or imputation.

Missing at Random (MAR):

o​ Missing values depend on other observed variables, but not on the missing variable itself.
o​ Example: Older participants skipping online forms due to unfamiliarity with technology.
o​ Solution: Statistical imputation (mean, regression, etc.).

Missing Not at Random (MNAR):

o​ Missing values depend on the variable with missing data itself.


o​ Example: People with lower income might avoid answering salary questions.
o​ Solution: Complex modeling techniques (e.g., MICE, EM algorithm).

3. Techniques to Handle Missing Values


A. Deletion Techniques
Listwise Deletion (Complete Case Analysis):

o​ Removes rows with any missing values.


o​ Pros: Simple to implement.
o​ Cons: Risk of losing a significant portion of data, leading to bias if data is not MCAR.
o​ Suitable for: Large datasets with sparse missing values.

Column Deletion:

o​ Removes entire columns with high proportions of missing values.


o​ Pros: Useful when the column adds little value to analysis.
o​ Cons: Loss of potentially useful information.

B. Simple Imputation Techniques


Filling with a Constant Value:

o​ Replace missing values with a fixed constant (e.g., 0, Unknown).


o​ Pros: Useful for categorical data or placeholders.
o​ Cons: May introduce artificial patterns.

Mean Imputation:

o​ Replace missing values with the mean of the column.


o​ Pros: Retains the column's overall central tendency.
o​ Cons: Reduces variability, assumes data is MCAR
Median Imputation:

o​ Replace missing values with the median of the column.


o​ Pros: Robust to outliers compared to mean imputation.
o​ Cons: May not capture the true distribution of the data.

Mode Imputation:

o​ Replace missing values with the most frequent value in the column.
o​ Pros: Works well for categorical data.
o​ Cons: May distort the distribution if the mode is over-represented.

C. Advanced Imputation Techniques


K-Nearest Neighbors (KNN) Imputation:

o​ Finds the k closest rows (neighbors) and imputes the missing value as the mean (or weighted mean)
of the neighbors.
o​ Pros:

▪​ Retains relationships between variables.


▪​ Suitable for numerical data with clusters.

o​ Cons:

▪​ Computationally expensive for large datasets.


▪​ Sensitive to scaling and choice of k.

Iterative Imputation (e.g., MICE):

o​ Uses a regression model to predict missing values iteratively based on other variables.
o​ Each feature with missing values is modeled as a dependent variable.
o​ Pros:

▪​ Handles complex relationships between variables.


▪​ Provides multiple imputations to reflect uncertainty.

o​ Cons:

▪​ Computationally intensive.
▪​ Requires more expertise to implement effectively.

Expectation-Maximization (EM) Algorithm:

o​ Iteratively estimates missing data by finding the most likely values based on probability distributions.
o​ The Expectation-Maximization (EM) algorithm is an iterative optimization
method that combines different unsupervised machine learning algorithms
to find maximum likelihood or maximum posterior estimates of parameters
in statistical models that involve unobserved latent variables. The EM
algorithm is commonly used for latent variable models and can handle
missing data. It consists of an estimation step (E-step) and a maximization
step (M-step), forming an iterative process to improve model fit.
o​ Pros:

▪​ Works well for data that follows known distributions.


▪​ Accounts for uncertainty.

o​ Cons:

▪​ Assumes the data fits a specific distribution.


▪​ Computational complexity.
D. Custom and Domain-Specific Techniques
Forward-Fill and Backward-Fill:

o​ Forward-fill: Propagates the last observed value forward.


o​ Backward-fill: Propagates the next observed value backward.

●​ Forward Fill (ffill): Propagates the last known value forward until the next
non-missing value. Useful when you assume that the missing value should
carry forward the last observed value.
●​ Backward Fill (bfill): Propagates the next known value backward until the
previous missing value. Useful when you assume that future data points can
be used to fill earlier gaps.

●​

o​ Pros:

▪​ Useful for time-series data.


▪​ /Preserves temporal relationships.

o​ Cons:

▪​ May propagate incorrect or stale values.

Interpolation:

o​ Estimates missing values using interpolation techniques (e.g., linear, spline).


o​ Pros:

▪​ Captures trends in numerical data.


▪​ Suitable for time-series or sequential data.

o​ Cons:

▪​ Assumes smooth trends, which may not always hold.

Domain Knowledge-Based Imputation:

o​ Use domain expertise to fill missing values (e.g., replace missing ages in a medical dataset with
typical values based on the condition).
o​ Pros:

▪​ Highly tailored to the data.

o​ Cons:

▪​ Requires extensive domain expertise.

4. Evaluating the Effectiveness of Imputation


●​ Visual Inspection: Compare distributions before and after imputation.
●​ Model Performance: Train predictive models and evaluate their performance.
●​ Multiple Imputations: Compare results from different imputation methods to assess variability.

5. Best Practices
●​ Analyze the nature and pattern of missing data before selecting a technique.
●​ Use visualization (e.g., heatmaps) to understand missing data patterns.
●​ Avoid over-reliance on deletion techniques unless missing data is sparse.
●​ Consider advanced imputation for datasets with complex interdependencies.

Or
Introduction to Missing Values
●​ Definition: Missing Values (MVs) are incomplete entries in datasets due to errors during recording, manual
data entry, or incorrect measurements.
●​ Impact of MVs:
1.​ Loss of efficiency in data analysis.
2.​ Increased difficulty in handling and analyzing datasets.
3.​ Bias in results, reducing generalizability​​.

Techniques for Handling Missing Values


1. Case Deletion

●​ Description: Discards instances with missing values.


●​ Advantages: Unbiased parameter estimates under the MCAR (Missing Completely at Random) assumption.
●​ Limitations:
o​ Reduces dataset size significantly.
o​ Introduces bias if data is not MCAR​.

2. Single Imputation

●​ Overview: Missing values are filled with a single estimate.


●​ Methods:

1.​ Mean/Mode/Median Substitution:

▪​ Replace MVs with the column mean (numerical) or mode (categorical).


▪​ Advantages: Simple and widely used.
▪​ Limitation: Reduces variability and introduces bias in correlations.

2.​ Hot Deck Imputation:

▪​ A random similar instance from the dataset fills the MV.


▪​ Modern versions cluster data before selecting a replacement.

3.​ Cold Deck Imputation:

▪​ Replacements come from external datasets​​.

3. Multiple Imputation

●​ Description: Generates multiple datasets by imputing MVs differently in each.


●​ Approach:
o​ Each dataset is analyzed separately.
o​ Final analysis combines results.
●​ Advantage: Accounts for uncertainty in imputation.
●​ Common Frameworks:

o​ Multivariate Imputation by Chained Equations (MICE).


o​ Expectation-Maximization (EM) algorithm​​.

4. Maximum Likelihood Imputation

●​ Principle: Assumes data follows a known probability distribution.


●​ Steps:
o​ Estimate distribution parameters (e.g., mean, variance).
o​ Use parameters to impute MVs.

●​ Challenges: Complex when distributions are multi-dimensional or irregular​​.

5. Machine Learning-Based Imputation

●​ Overview: Employs predictive models for imputation.


●​ Methods:

1.​ K-Nearest Neighbors (KNN) Imputation:

▪​ Finds the K nearest neighbors based on non-missing attributes.


▪​ Imputes MVs as averages or modes of neighbors.
▪​ Limitation: Sensitive to K-value and data scaling​​.

2.​ Support Vector Machine Imputation (SVMI):

▪​ Treats missing attributes as targets in an SVM regression.

3.​ Fuzzy K-Means Clustering:

▪​ Uses membership values for clusters to determine imputed values.


▪​ Suitable for datasets with overlapping clusters​​.

6. Bayesian and Statistical Methods

●​ Bayesian Networks: Models dependencies between variables to infer MVs.


●​ Expectation-Maximization (EM):

o​ Iteratively estimates distribution parameters and imputes MVs.


o​ Advantage: Works well with Gaussian distributions​.

7. Recent Advances

●​ Hybrid Models:

o​ Combine multiple methods, e.g., fuzzy clustering + regression.

●​ Ensemble Methods:

o​ Use algorithms like Random Forests for robust imputation.

●​ Neural Network Approaches:

o​ Predict missing values using deep learning models​​.

Guidelines for Choosing an Imputation Method

1.​ Data Type: Consider the nature of the dataset (e.g., numerical, categorical).
2.​ Percentage of MVs:

o​ Low percentages: Simpler methods like mean imputation are sufficient.


o​ High percentages: Advanced methods (e.g., ML-based) may be required.

3.​ Data Correlations: Use advanced techniques when attributes have strong interdependencies.
4.​ Computational Resources: Complex methods (e.g., Bayesian, ML-based) require more resources
1. Identifying and Handling Outliers
1.1 What are Outliers?
An outlier is a data point that significantly deviates from the rest of the data. It can be
either much higher or much lower than the other data points, and its presence can have a
significant impact on the results of machine learning algorithms. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining. They may arise due to:
●​ Data entry errors (typos, incorrect values)
●​ Measurement errors (sensor malfunctions, incorrect readings)
●​ Natural variability (genuine but rare occurrences)

Types of Outliers

There are two main types of outliers:


●​ Global outliers: Global outliers are isolated data points that are far away from the
main body of the data. They are often easy to identify and remove.
●​ Contextual outliers: Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to determine
their significance.

Algorithm

1.​ Calculate the mean of each cluster


2.​ Initialize the Threshold value
3.​ Calculate the distance of the test data from each cluster mean
4.​ Find the nearest cluster to the test data
5.​ If (Distance > Threshold) then, Outlier

1.2 Outlier Detection Methods in Machine Learning


Outlier detection plays a crucial role in ensuring the quality and accuracy of machine
learning models. By identifying and removing or handling outliers effectively, we can
prevent them from biasing the model, reducing its performance, and hindering its
interpretability. Here’s an overview of various outlier detection methods:

1. Statistical Methods:

●​ Z-Score: This method calculates the standard deviation of the data points and
identifies outliers as those with Z-scores exceeding a certain threshold (typically 3 or
-3).
Z-Score (Standard Score Method)

o​ Measures how many standard deviations a value is from the mean.


o​ Formula:

Where X is the data point, μ is the mean, and σ

o​ If ∣Z∣>3, the value is considered an outlier

o​ Interquartile Range (IQR): IQR identifies outliers as data points falling


outside the range defined by Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1
and Q3 are the first and third quartiles, and k is a factor (typically 1.5).
o​ Uses quartiles to detect outliers.
o​ Formula: IQR=Q3−Q1
o​ Outliers are values outside: [Q1−1.5×IQR,Q3+1.5×IQR]]

Machine Learning Approaches

●​ Clustering-based Outlier Detection


o​ Uses algorithms like DBSCAN to find points that do not belong to any dense cluster.
●​ Isolation Forest

o​ Identifies anomalies by randomly selecting features and checking data point isolation.

1.3 Techniques for Handling Outliers in Machine Learning


Outliers, data points that significantly deviate from the majority, can have detrimental
effects on machine learning models. To address this, several techniques can be employed
to handle outliers effectively:

1. Removal:

●​ This involves identifying and removing outliers from the dataset before training the
model. Common methods include:
o​ Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
o​ Distance-based methods: Outliers are identified based on their distance from
their nearest neighbors.
o​ Clustering: Outliers are identified as points not belonging to any cluster or
belonging to very small clusters.
2. Transformation:

●​ This involves transforming the data to reduce the influence of outliers. Common
methods include:
o​ Scaling: Standardizing or normalizing the data to have a mean of zero and a
standard deviation of one.
o​ Winsorization: Replacing outlier values with the nearest non-outlier value.
o​ Log transformation: Applying a logarithmic transformation to compress the
data and reduce the impact of extreme values.

3. Robust Estimation:

●​ This involves using algorithms that are less sensitive to outliers. Some examples
include:
o​ Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
o​ M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
o​ Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.

4. Modeling Outliers:

●​ This involves explicitly modeling the outliers as a separate group. This can be done
by:
o​ Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
o​ Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.

2. Data Types Conversion and Consistency


2.1 Why Convert Data Types?
●​ Data from multiple sources may use different formats (e.g., “Male” vs. “M”).
●​ Machine learning algorithms often require numerical input.
●​ Ensures data consistency and reduces errors in analysis.

2.2 Common Data Conversions


Numerical to Categorical (Discretization)

●​ Example: Converting age into bins (0-18 = "Child", 19-59 = "Adult", 60+ = "Senior").
●​ Methods:

o​ Equal-width binning: Divides range into equal intervals.


o​ Equal-frequency binning: Each bin has an equal number of observations.

Categorical to Numerical

●​ One-Hot Encoding: Converts categories into binary columns.

o​ Example: "Red", "Blue", "Green" → (1,0,0), (0,1,0), (0,0,1).

●​ Label Encoding: Assigns numerical values to categories.


o​ Example: "Low" → 0, "Medium" → 1, "High" → 2.
Text to Numeric Conversion

●​ TF-IDF (Term Frequency-Inverse Document Frequency)


●​ Word Embeddings (Word2Vec, GloVe, BERT)

Date Formatting

●​ Converting "12/31/2023" → "2023-12-31" f​ or consistency.


●​ Extracting features like day of the week, month, year.

3. Dealing with Missing Values and Noisy Data


3.1 Types of Missing Data
●​ MCAR (Missing Completely at Random) – No pattern, data is randomly missing.
●​ MAR (Missing at Random) – Missing values depend on observed variables.
●​ MNAR (Missing Not at Random) – Data is missing due to specific reasons.

3.2 Handling Missing Data


1. Deletion

●​ Remove rows with missing values (only if missing data is small).

2. Imputation Methods

●​ Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode.
●​ K-Nearest Neighbors (KNN) Imputation: Finds similar instances and imputes missing values based on
neighbors.
●​ Machine Learning-based Imputation:

o​ Regression models (predict missing values based on other features).


o​ Multiple Imputation (generates several plausible values and averages them).

3.3 Dealing with Noisy Data


●​ Smoothing Techniques:

o​ Binning: Grouping values into bins and replacing them with bin means.
o​ Regression: Fitting a function to the data to remove noise.
o​ Clustering: Replacing noisy points with cluster centroids.

4. Data Integration
4.1 Techniques for Data Integration
●​ Schema Mapping: Standardizing column names.
●​ Record Linkage: Matching records across datasets using unique identifiers.

4.1 Data Integration


Definition
Data integration is the process of combining information from different sources into a unified
dataset. A poorly executed integration process can lead to:

●​ Redundancies → increases data size and slows down processing.


●​ Inconsistencies → reduces accuracy in data mining (DM) models.
●​ Schema mismatches → different structures across datasets require reconciliation.

Challenges in Data Integration


●​ Schema Matching: Different sources may use different formats, names, or structures, making it difficult to
align datasets.
●​ Tuple Duplication: Records from different sources may represent the same entity but have different values.
●​ Redundant Attributes: Some attributes may provide duplicate information.
●​ Data Correlation: Overlapping information may exist across multiple fields.

Solutions & Approaches


●​ Data Mapping: A structured framework defining how instances should be arranged.
●​ In-Database Mining vs. External Processing:

o​ In-Database Mining: Directly processes data within a database (avoids extraction but requires
repeated preprocessing).
o​ External Processing: Extracting data into a separate file for preprocessing before applying DM
techniques. This approach is usually faster and more flexible.

4.2 Finding Redundant Attributes


Why Remove Redundant Attributes?
●​ Improves model performance by reducing dimensionality.
●​ Reduces computational complexity.

5.2 Detection Methods

●​ Correlation Analysis:(Correlation
is a statistical measure that expresses
the extent to which two variables are linearly related (meaning they
change together at a constant rate).)
o​ High correlation (Pearson correlation >0.9> 0.9>0.9) suggests redundancy.

●​ Feature Importance from Machine Learning Models:

o​ Decision Trees, Random Forests, and LASSO regression can identify unimportant features.

●​ Principal Component Analysis (PCA):

o​ Reduces redundant attributes by transforming data into uncorrelated components.

Definition
A redundant attribute is one that can be derived from another attribute or set of attributes.
Redundancy increases data size, processing time, and can cause overfitting.
Causes of Redundant Attributes
●​ Naming inconsistencies (e.g., “Salary” vs. “Income”).
●​ Attributes storing correlated information (e.g., “Age” and “Year of Birth”).
●​ Measurement units (e.g., “Height in cm” vs. “Height in inches”).

Methods for Detecting Redundancy


1. Correlation Analysis(Types of CorrelationPositive Linear Correlation. There is a positive linear
correlation when the variable on the x -axis increases as the variable on the y -axis increases. Negative
Linear Correlation.Non-linear Correlation (known as curvilinear correlation) No Correlation.)

Measures the dependency between attributes to determine whether one attribute can be removed.
2.​ χ² (Chi-Squared) Correlation Test (for Categorical Data)( Pearson’s
Chi-Square Test is a fundamental statistical method used to evaluate the relationship
between categorical variables. By comparing observed frequencies with expected
frequencies, this test determines whether significant differences exist within data.
●​ Observed Frequencies: The actual count of occurrences in each category, which can
be analyzed using tools such as “chisquare python.”
●​ Expected Frequencies: The counts expected if the null hypothesis is true, allowing
researchers to understand the distribution of data better.
●​ Chi-Square Statistic: A measure of how much the observed frequencies deviate
from the expected frequencies, commonly calculated in Python using the “chi2 test
python.”
●​ P-value: Indicates the probability of observing the data assuming the null hypothesis
is true, helping to determine the strength of the evidence against the null hypothesis.
●​ Null Hypothesis : A null Hypothesis is a general statistical statement or assumption
about a population parameter that is assumed to be true Until we have sufficient
evidence to reject it. It is generally denoted by Ho.
●​ Alternate Hypothesis : The Alternate Hypothesis is considered as competing of the
null hypothesis. It is generally denoted by H1 . The general goal of our hypothesis
testing is to test the Alternative hypothesis against the null hypothesis.
)
Used when attributes are nominal (categorical).

●​ Steps:

1.​ Create a contingency table for two attributes (A & B).(a table showing the distribution of one variable in rows
and another in columns, used to study the correlation between the two variables.)

1.​ Compute the χ² statistic:

where:

▪​ oij​= observed frequency of (Aᵢ, B )

▪​ eij = expected frequency of (Aᵢ, B )

▪​ m= total number of instances

2.​ Compare χ² value with a threshold from a statistical table.


3.​ If χ² is significant, the attributes are correlated.

3. Pearson Correlation Coefficient (for Numerical Data)


Measures the strength and direction of a linear relationship between two numerical attributes.

●​ Values Interpretation:

o​ r > 0 → Positive correlation (both increase together).


o​ r < 0 → Negative correlation (one increases, the other decreases).
o​ r = 0 → No correlation.

●​ High correlation (|r| ≈ 1) means one attribute might be removed.


4. Covariance Analysis
Indicates how two numerical attributes vary together.

●​ Cov(A, B) > 0 → Attributes increase/decrease together.


●​ Cov(A, B) < 0 → One increases while the other decreases.
●​ Cov(A, B) = 0 → No relationship, but does not guarantee independence.

4.3 Detecting Tuple Duplication and Inconsistency


Definition
Duplicate tuples occur when the same entity appears multiple times in the dataset. Inconsistencies
arise when data values conflict due to input errors or variations across sources.
Causes of Duplication & Inconsistency
●​ Denormalized tables (to optimize joins) may repeat records.
●​ Data entry errors (e.g., typos in unique identifiers).
●​ Different measurement units (e.g., kg vs. lbs).
●​ Variations in naming conventions (e.g., “John Doe” vs. “Doe, John”).

Techniques for Detecting Duplicate Tuples


1. String-Based Similarity Measures (for Nominal Attributes)
●​ Edit Distance (Levenshtein Distance): Minimum number of character operations (insert, delete, replace)
needed to transform one string into another.
●​ Affine Gap Distance: Extends edit distance to account for truncation errors.
●​ Jaro Distance: Measures common characters and transpositions between two strings.
●​ Q-Gram Similarity: Splits strings into substrings of length q and compares them.
●​ Token-Based Similarity: Used when word order varies (e.g., “Smith, John” vs. “John Smith”).
●​ Phonetic Matching: Algorithms like Metaphone and ONCA compare strings based on pronunciation.

2. Numeric Similarity Measures


●​ Encoding numbers as strings for similarity checks.
●​ Range-based thresholding (e.g., values within 1% are considered duplicates).
●​ Cosine Similarity for Numeric Data: Measures the cosine of the angle between vectors.

4.4 Approaches for Detecting Duplicate Instances


1. Probabilistic Approaches
●​ Fellegi-Sunter Model (Bayesian Inference):

o​ Models duplicate detection as a probability estimation problem.


o​ Uses Expectation-Maximization (EM) to estimate parameters.
o​ Bayes Decision Rule determines if an instance is unique or duplicated.

2. Supervised & Semi-Supervised ML Approaches


●​ Classification Models: Decision Trees (CART), SVM, etc.
●​ Clustering-Based Methods:

o​ Graph partitioning groups similar instances.


o​ Clustering bootstrapping refines duplicate detection over iterations.
3. Distance-Based Approaches
●​ Uses similarity metrics (edit distance, weighted modifications, etc.).
●​ Ranking Similar Instances: Sorts and selects least duplicated tuples.

4. Unsupervised Clustering Techniques


●​ Hierarchical Graph Models: Represent attributes as “match/no-match” binary variables.
●​ Density-Based Clustering (DBSCAN): Groups similar instances.

Key Takeaways
✔ Data Integration is critical for ensuring consistency and accuracy in data mining.​
✔ Redundant attributes increase processing time and cause overfitting. Correlation and
covariance analysis help detect redundancy.​
✔ Duplicate tuples and inconsistencies arise from various sources, including data entry errors,
different measurement units, and schema mismatches.​
✔ Various approaches exist for duplicate detection, including probabilistic models, supervised
ML, and distance-based techniques.

●​ Fuzzy Matching: Using algorithms like Jaccard similarity or Levenshtein distance.

5. Finding Redundant Attributes


o​ 5.1

6. Detecting Tuple Duplication and Inconsistency


6.1 Types of Duplicates
●​ Exact Duplicates: Identical records appearing multiple times.
●​ Near Duplicates: Slight variations in records due to typos or inconsistent formatting.

6.2 Detecting Duplicate Tuples


●​ Exact Matching: Checking for identical values.
●​ Fuzzy Matching: Using algorithms like:

o​ Levenshtein Distance (measures character differences).


o​ Jaccard Similarity (measures common elements).

6.3 Handling Duplicates


●​ Standardization: Convert text to lowercase, remove extra spaces.
●​ Merging Records: Combine records based on shared attributes.
●​ Choosing the Most Reliable Source: Use data from authoritative sources.

Summary of Key Techniques


Task Techniques
Outlier Detection Z-score, IQR, DBSCAN, Isolation Forest
Data Conversion One-hot encoding, label encoding, binning, TF-IDF
Missing Data Mean/Median Imputation, KNN, Machine Learning
Noise Reduction Binning, Smoothing, Regression, Clustering
Data Integration Schema Mapping, Record Linkage, Fuzzy Matching
Redundant Attributes Correlation Analysis, Feature Importance, PCA
Duplicate Detection Exact Matching, Levenshtein Distance, Jaccard Similarity

You might also like