0% found this document useful (0 votes)
44 views16 pages

Extracting Knowledge From Data

Uploaded by

Did you KNOW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views16 pages

Extracting Knowledge From Data

Uploaded by

Did you KNOW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Extracting Knowledge from Data

Data Preparation, Enrichment, Encoding, and Standardization

Presented by: Bejaoui Ahmed


Plan

• Why is Data Preparation Important?


• Data Preparation and Cleaning
• Data Enrichment
• Data Encoding
• Data Standardization
• Data Normalization
• Challenges in Data Preparation
• Future Trends

2
Introduction

Extracting knowledge from data involves going beyond basic analysis; it


requires that data be carefully prepared, enriched, encoded, and
standardized. This process improves data quality, increases model
accuracy, and enhances decision-making. Today, we’ll explore key steps
like data cleaning, enrichment, encoding, and standardization.

3
Why is Data Preparation Important?

Data often comes in raw form with inconsistencies, missing values, and errors.
Properly prepared data:

 Increases model accuracy: Clean data improves prediction outcomes.

 Saves time and resources: Reduces the need for troubleshooting during analysis.

 Prevents unjust results: Findings and decisions can be distorted by inaccurate or


unclean data.

4
Data Preparation and Cleaning

1. Handling Missing Data 2. Handling Outliers


Use deletion to remove incomplete Use statistical methods like IQR (Interquartile
entries or imputation to fill gaps with Range) or Z-score to identify extreme values,
statistical estimates, balancing data then treat outliers by removing, transforming,
integrity and completeness. or replacing them as appropriate based on
domain knowledge.

3. Data Consistency 4. Removing Duplicates


Ensure uniform formats (e.g., dates, Identify and eliminate duplicate records
currencies) across the dataset. that may distort analysis results.

5
Data Enrichment
Adding new relevant data to enhance the existing dataset and improve
analysis.
Types of Data Enrichment:
External Data: Adding information Feature Engineering: Creating new
from other sources (e.g., social features from the existing data (e.g.,
media, weather data). combining date and time into one feature).

Benefits:
Enriched data provides deeper insights.
Improves model performance by adding relevant context or features.

6
Data Encoding
Converting categorical (non-numerical) data into numerical form so that
machine learning algorithms can use them.
Techniques:
Label Encoding:
• Assigns an integer to each category.
• Example: "Red" = 1, "Green" = 2, "Blue" = 3. Used for ordinal data.

One-Hot Encoding:
• Creates binary columns for each category.
• Example: "Color" column with values "Red," "Green," "Blue" becomes three binary
columns.

7
Data Encoding
 Frequency Encoding:
Replaces categories with their frequency in the dataset.
Example:
A column with colors: "Red," "Green," "Blue" becomes "Red" = 50%,
"Green" = 30%, "Blue" = 20%.

8
Data Standardization
Rescaling data so that it has a mean of zero and a standard deviation of
one.
Why It’s Important:
Algorithms like k-Means, SVM(Support Vector Machine), and Gradient
Descent are sensitive to data scaling.
Standardization ensures that large-scale features don’t dominate smaller-
scale features.

9
Example of data standardization

10
Data Normalization

Rescaling data to a range between 0 and 1 without


changing its distribution.
When to Use:
• It is preferred when working with algorithms that
rely on distances, such as k-NN or neural
networks.

11
Example of data Normalization

12
Challenges in Data Preparation

• High Dimensionality:
Datasets with many features can lead to overfitting or long processing times.
• Incomplete or Inconsistent External Data:
Data enrichment may introduce inconsistencies or new missing values.
• Complexity in Encoding:
Some categorical features have too many levels, making encoding
computationally expensive.

13
Future Trends

Automated Data Cleaning (AutoML): Data-Centric AI: Prioritizes data quality


Uses AI to automatically clean and prepare improvements over model tuning, ensuring
data, saving time and improving data better model performance from well-
quality. prepared data

Real-Time Data Preparation: Enables Synthetic Data Generation: Creates


on-the-fly data cleaning and artificial, privacy-safe data to supplement
transformation, essential for streaming real datasets, improving model training
analytics and IoT. without compromising sensitive
information.

14
Conclusion
Data preparation, enrichment, encoding, and standardization are
foundational to effective data analysis and machine learning.
Prioritizing these steps ensures cleaner, more consistent data and
enhances model performance.

15
References
•Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.

•Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Morgan
Kaufmann.

•Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and


TensorFlow. O'Reilly Media.

•Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.

16

You might also like