0% found this document useful (0 votes)
21 views26 pages

NN 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views26 pages

NN 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Pre-processing

Md. Apu
Hosen
Lecturer
Dept. of CSE, NUBTK
Introduction
• Data preprocessing and feature engineering are both crucial steps in preparing
data for machine learning models, but they serve different purposes and involve
distinct tasks.
• The main goal of data preprocessing is to clean and organize raw data into a
format suitable for machine learning algorithms.
• Feature engineering is about creating new features, transforming existing ones,
and selecting the most relevant features to enhance the model's performance.
Data Pre-Processing
• Data preprocessing is a process of preparing the raw data and making it suitable
for a machine learning model.
• It is the first and crucial step in creating a machine-learning model.
• When creating a machine learning project, it is not always the case that we come
across clean and formatted data.
• And while doing any operation with data, it is mandatory to clean it and put in a
formatted way.
• So for this, we use a data preprocessing task.
Why do we need Data Preprocessing?
• Real-world data generally contains noises, and missing values, and maybe in an
unusable format that cannot be directly used for machine learning models.
• Data preprocessing is a required task for cleaning the data and making it suitable
for a machine learning model which also increases the accuracy and efficiency of
a machine learning model.
Data Preprocessing Techniques
Now that you know more about the data preprocessing phase and why it’s
important, let’s look at the main techniques to apply to the data, making it more
usable for our future work. The techniques that we’ll explore are:
1. Data Cleaning
2. Sampling
3. Imbalanced Data Handling
4. Dimensionality Reduction
5. Feature Engineering
1. Data Cleaning
• One of the most important aspects of the data preprocessing phase is detecting
and fixing bad and inaccurate observations from your dataset in order to
improve its quality.
• This technique refers to identifying
⁃ Incomplete data,
⁃ Inaccurate data
⁃ Duplicated data
⁃ Null values in the data
• After identifying these issues, you will need to either modify or delete them.
2. Sampling
• Here’s a scenario I’m sure you are familiar with. You download a relatively big
dataset and are excited to get started with analyzing it and building your
machine-learning model. And snap – your machine gives an “out of memory”
error while trying to load the dataset.
• It’s happened to the best of us. It’s one of the biggest hurdles we face in data
science – dealing with massive amounts of data on computationally limited
machines (not all of us have Google’s resource power!).
• So how can we overcome this perennial problem? Is there a way to pick a
subset of the data and analyze that – and that can be a good representation of
the entire dataset?
• Yes! And that method is called sampling.
2. Sampling
• Sampling is a method that allows us to get information about the population
based on the statistics from a subset of the population (sample), without
having to investigate every individual.
Sampling Technique
There are various types of sampling techniques, and the choice of a particular
method depends on the specific requirements of the machine learning task. Here
are some common sampling techniques:
1.Random Sampling: In random sampling, data points are selected randomly
from the dataset. This method assumes that each data point has an equal chance
of being included in the sample.
2.Stratified Sampling: Stratified sampling involves dividing the dataset into
different strata or groups based on certain characteristics. Then, random samples
are taken from each stratum. This ensures that the sample is representative of the
distribution of the characteristics in the overall dataset.
Sampling Technique
3. Under-sampling: Under-sampling is a technique used to address class
imbalance in a dataset by reducing the number of instances in the majority
class. It is used to balance the class distribution and prevent the model from
being biased toward the majority class under-sampling is used
4. Over-sampling: Oversampling is a technique used to address class imbalance
by increasing the number of instances in the minority class. To provide the
model with more examples of the minority class, improving its ability to learn
and make accurate predictions for that class. Common techniques include
Random Oversampling, SMOTE, ADASYN, and MSMOTE.
5. Bootstrapping: Bootstrapping involves creating multiple random samples with
replacements from the original dataset.
3. Imbalanced Data Handling
• Imbalanced data refers to a situation where the distribution of classes in a
dataset is not equal.
• In other words, one class (the minority class) has significantly fewer instances
than another class (the majority class).
• This imbalance can pose challenges for machine learning algorithms, especially
when the goal is to build a model that can accurately predict or classify
instances from the minority class.
• Dealing with imbalanced data is an important aspect of data preprocessing. Here
are some common strategies to address imbalanced data:
⁃ Oversampling the Minority Class
⁃ Under sampling the Majority Class
⁃ Using Different Evaluation Metrics
4. Dimensionality Reduction
• Dimensionality reduction is a technique used in machine learning and data
analysis to reduce the number of input features (dimensions) in a dataset.
• The primary goal is to simplify the dataset while retaining as much relevant
information as possible.
• High-dimensional data, where the number of features is large, can present
challenges such as increased computational complexity, the risk of overfitting,
and difficulty in visualization.
5. Feature Engineering
• Feature engineering is the pre-processing step of machine learning, which is
used to transform raw data into features that can be used for creating a
predictive model using Machine learning or statistical Modelling.
• In other words, it is the process of selecting, extracting, and transforming the most
relevant features from the available data to build more accurate and efficient
machine learning models.
• Feature engineering in machine learning aims to improve the performance of
models.
What is Feature?
• Features are the attributes or dimensions of your data that contain relevant
information and are crucial for the model to learn patterns, and relationships, and
make decisions.
• In short, a feature is an individual measurable property within a recorded dataset.
• A model for predicting the size of a shirt for a person may have features such as
age, gender, height, weight, etc.
Common Feature Types
• Numerical: Values with numeric types (int, float, etc.). Examples: age, salary,
height.
• Categorical Features: Features that can take one of a limited number of values.
Examples: gender (male, female, X), color (red, blue, green).
• Ordinal Features: Categorical features that have a clear ordering. Examples: T-
shirt size (S, M, L, XL).
• Binary Features: A special case of categorical features with only two categories.
Examples: is_smoker (yes, no), has_subscription (true, false).
• Text Features: Features that contain textual data. Textual data typically requires
special preprocessing steps (like tokenization) to transform it into a format suitable
for machine learning models.
5. Feature Engineering
• The success of machine learning models heavily depends on the quality of the
features used to train them.
• Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and relationships in
the data, which in turn helps the machine learning model to learn from the data
more effectively.
6. Feature Engineering
Process Involved in Feature Engineering
• Feature engineering in Machine learning consists of mainly 5 processes:
a) Feature Creation
b) Feature Transformation
c) Feature Extraction
d) Feature Selection
e) Feature Scaling
• It is an iterative process that requires experimentation and testing to find the best
combination of features for a given problem.
a. Feature Creation
• Feature Creation is the process of generating new features based on domain
knowledge or by observing patterns in the data.
• Feature creation is finding the most useful variables to be used in a predictive
model.
• The process is subjective, and it requires human creativity and intervention.
• The new features are created by mixing existing features using addition,
subtraction, and rotation, and these new features have great flexibility.
Types of Feature Creation

1.Domain-Specific: Creating new features based on domain knowledge, such as


creating features based on business rules or industry standards.

2.Data-Driven: Creating new features by observing patterns in the data, such as


calculating aggregations or creating interaction features.

3.Synthetic: Generating new features by combining existing features or


synthesizing new data points.
b. Feature Transformation
• Feature transformation is converting the data to the same structure
• It involves converting raw data into a format suitable for analysis or machine
learning.
• For example, it ensures that the model is flexible to take input from a variety of
data; it ensures that all the variables are on the same scale, making the model
easier to understand.
• It improves the model's accuracy and ensures that all the features are within the
acceptable range to avoid any computational error.
c. Feature Extraction
• Feature extraction is a process in machine learning and data analysis that
involves identifying and extracting relevant features from raw data.
• These features are later used to create a more informative dataset, which can
be further utilized for various tasks such as: Classification, Prediction,
Clustering etc.
• Feature extraction aims to reduce the dimensionality, complexity, and
redundancy of the data, while preserving the essential information and
relationships.
c. Feature Extraction
• Here are some common methods used for feature extraction:
⁃ Principal Component Analysis (PCA)
⁃ Linear Discriminant Analysis (LDA)
⁃ Independent Component Analysis (ICA)
⁃ Feature Hashing
⁃ Kernel Principal Component Analysis (Kernel PCA)
d. Feature Selection
• Feature Selection is the process of selecting a subset of relevant features
from the dataset to be used in a machine-learning model.
• It is an important step in the feature engineering process as it can have a
significant impact on the model’s performance.
• Common feature selection methods are,
⁃ Filter Methods
⁃ Wrapper Methods
⁃ Embedded Methods
e. Feature Scaling or Feature Normalization

• Feature Scaling is the process of transforming the features so that they have a
similar scale.
• The main objective of feature scaling is to bring all features to a similar scale,
preventing one feature from dominating the others due to differences in their
magnitudes.
• There are several methods for feature scaling, and the choice of method often
depends on the characteristics of the data and the requirements of the machine
learning algorithm. Here are some common techniques:
⁃ Min-Max Scaling (Normalization)
⁃ Standardization (Z-score normalization)
⁃ Robust Scaling
Any
Question?

You might also like