NN 7

Uploaded by

Ashikur Rahman Joy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views26 pages

NN 7

Uploaded by

Ashikur Rahman Joy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Pre-processing

Md. Apu
Hosen
Lecturer
Dept. of CSE, NUBTK
Introduction
• Data preprocessing and feature engineering are both crucial steps in preparing
data for machine learning models, but they serve different purposes and involve
distinct tasks.
• The main goal of data preprocessing is to clean and organize raw data into a
format suitable for machine learning algorithms.
• Feature engineering is about creating new features, transforming existing ones,
and selecting the most relevant features to enhance the model's performance.
Data Pre-Processing
• Data preprocessing is a process of preparing the raw data and making it suitable
for a machine learning model.
• It is the first and crucial step in creating a machine-learning model.
• When creating a machine learning project, it is not always the case that we come
across clean and formatted data.
• And while doing any operation with data, it is mandatory to clean it and put in a
formatted way.
• So for this, we use a data preprocessing task.
Why do we need Data Preprocessing?
• Real-world data generally contains noises, and missing values, and maybe in an
unusable format that cannot be directly used for machine learning models.
• Data preprocessing is a required task for cleaning the data and making it suitable
for a machine learning model which also increases the accuracy and efficiency of
a machine learning model.
Data Preprocessing Techniques
Now that you know more about the data preprocessing phase and why it’s
important, let’s look at the main techniques to apply to the data, making it more
usable for our future work. The techniques that we’ll explore are:
1. Data Cleaning
2. Sampling
3. Imbalanced Data Handling
4. Dimensionality Reduction
5. Feature Engineering
1. Data Cleaning
• One of the most important aspects of the data preprocessing phase is detecting
and fixing bad and inaccurate observations from your dataset in order to
improve its quality.
• This technique refers to identifying
⁃ Incomplete data,
⁃ Inaccurate data
⁃ Duplicated data
⁃ Null values in the data
• After identifying these issues, you will need to either modify or delete them.
2. Sampling
• Here’s a scenario I’m sure you are familiar with. You download a relatively big
dataset and are excited to get started with analyzing it and building your
machine-learning model. And snap – your machine gives an “out of memory”
error while trying to load the dataset.
• It’s happened to the best of us. It’s one of the biggest hurdles we face in data
science – dealing with massive amounts of data on computationally limited
machines (not all of us have Google’s resource power!).
• So how can we overcome this perennial problem? Is there a way to pick a
subset of the data and analyze that – and that can be a good representation of
the entire dataset?
• Yes! And that method is called sampling.
2. Sampling
• Sampling is a method that allows us to get information about the population
based on the statistics from a subset of the population (sample), without
having to investigate every individual.
Sampling Technique
There are various types of sampling techniques, and the choice of a particular
method depends on the specific requirements of the machine learning task. Here
are some common sampling techniques:
1.Random Sampling: In random sampling, data points are selected randomly
from the dataset. This method assumes that each data point has an equal chance
of being included in the sample.
2.Stratified Sampling: Stratified sampling involves dividing the dataset into
different strata or groups based on certain characteristics. Then, random samples
are taken from each stratum. This ensures that the sample is representative of the
distribution of the characteristics in the overall dataset.
Sampling Technique
3. Under-sampling: Under-sampling is a technique used to address class
imbalance in a dataset by reducing the number of instances in the majority
class. It is used to balance the class distribution and prevent the model from
being biased toward the majority class under-sampling is used
4. Over-sampling: Oversampling is a technique used to address class imbalance
by increasing the number of instances in the minority class. To provide the
model with more examples of the minority class, improving its ability to learn
and make accurate predictions for that class. Common techniques include
Random Oversampling, SMOTE, ADASYN, and MSMOTE.
5. Bootstrapping: Bootstrapping involves creating multiple random samples with
replacements from the original dataset.
3. Imbalanced Data Handling
• Imbalanced data refers to a situation where the distribution of classes in a
dataset is not equal.
• In other words, one class (the minority class) has significantly fewer instances
than another class (the majority class).
• This imbalance can pose challenges for machine learning algorithms, especially
when the goal is to build a model that can accurately predict or classify
instances from the minority class.
• Dealing with imbalanced data is an important aspect of data preprocessing. Here
are some common strategies to address imbalanced data:
⁃ Oversampling the Minority Class
⁃ Under sampling the Majority Class
⁃ Using Different Evaluation Metrics
4. Dimensionality Reduction
• Dimensionality reduction is a technique used in machine learning and data
analysis to reduce the number of input features (dimensions) in a dataset.
• The primary goal is to simplify the dataset while retaining as much relevant
information as possible.
• High-dimensional data, where the number of features is large, can present
challenges such as increased computational complexity, the risk of overfitting,
and difficulty in visualization.
5. Feature Engineering
• Feature engineering is the pre-processing step of machine learning, which is
used to transform raw data into features that can be used for creating a
predictive model using Machine learning or statistical Modelling.
• In other words, it is the process of selecting, extracting, and transforming the most
relevant features from the available data to build more accurate and efficient
machine learning models.
• Feature engineering in machine learning aims to improve the performance of
models.
What is Feature?
• Features are the attributes or dimensions of your data that contain relevant
information and are crucial for the model to learn patterns, and relationships, and
make decisions.
• In short, a feature is an individual measurable property within a recorded dataset.
• A model for predicting the size of a shirt for a person may have features such as
age, gender, height, weight, etc.
Common Feature Types
• Numerical: Values with numeric types (int, float, etc.). Examples: age, salary,
height.
• Categorical Features: Features that can take one of a limited number of values.
Examples: gender (male, female, X), color (red, blue, green).
• Ordinal Features: Categorical features that have a clear ordering. Examples: T-
shirt size (S, M, L, XL).
• Binary Features: A special case of categorical features with only two categories.
Examples: is_smoker (yes, no), has_subscription (true, false).
• Text Features: Features that contain textual data. Textual data typically requires
special preprocessing steps (like tokenization) to transform it into a format suitable
for machine learning models.
5. Feature Engineering
• The success of machine learning models heavily depends on the quality of the
features used to train them.
• Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and relationships in
the data, which in turn helps the machine learning model to learn from the data
more effectively.
6. Feature Engineering
Process Involved in Feature Engineering
• Feature engineering in Machine learning consists of mainly 5 processes:
a) Feature Creation
b) Feature Transformation
c) Feature Extraction
d) Feature Selection
e) Feature Scaling
• It is an iterative process that requires experimentation and testing to find the best
combination of features for a given problem.
a. Feature Creation
• Feature Creation is the process of generating new features based on domain
knowledge or by observing patterns in the data.
• Feature creation is finding the most useful variables to be used in a predictive
model.
• The process is subjective, and it requires human creativity and intervention.
• The new features are created by mixing existing features using addition,
subtraction, and rotation, and these new features have great flexibility.
Types of Feature Creation

1.Domain-Specific: Creating new features based on domain knowledge, such as

creating features based on business rules or industry standards.

2.Data-Driven: Creating new features by observing patterns in the data, such as

calculating aggregations or creating interaction features.

3.Synthetic: Generating new features by combining existing features or

synthesizing new data points.
b. Feature Transformation
• Feature transformation is converting the data to the same structure
• It involves converting raw data into a format suitable for analysis or machine
learning.
• For example, it ensures that the model is flexible to take input from a variety of
data; it ensures that all the variables are on the same scale, making the model
easier to understand.
• It improves the model's accuracy and ensures that all the features are within the
acceptable range to avoid any computational error.
c. Feature Extraction
• Feature extraction is a process in machine learning and data analysis that
involves identifying and extracting relevant features from raw data.
• These features are later used to create a more informative dataset, which can
be further utilized for various tasks such as: Classification, Prediction,
Clustering etc.
• Feature extraction aims to reduce the dimensionality, complexity, and
redundancy of the data, while preserving the essential information and
relationships.
c. Feature Extraction
• Here are some common methods used for feature extraction:
⁃ Principal Component Analysis (PCA)
⁃ Linear Discriminant Analysis (LDA)
⁃ Independent Component Analysis (ICA)
⁃ Feature Hashing
⁃ Kernel Principal Component Analysis (Kernel PCA)
d. Feature Selection
• Feature Selection is the process of selecting a subset of relevant features
from the dataset to be used in a machine-learning model.
• It is an important step in the feature engineering process as it can have a
significant impact on the model’s performance.
• Common feature selection methods are,
⁃ Filter Methods
⁃ Wrapper Methods
⁃ Embedded Methods
e. Feature Scaling or Feature Normalization

• Feature Scaling is the process of transforming the features so that they have a
similar scale.
• The main objective of feature scaling is to bring all features to a similar scale,
preventing one feature from dominating the others due to differences in their
magnitudes.
• There are several methods for feature scaling, and the choice of method often
depends on the characteristics of the data and the requirements of the machine
learning algorithm. Here are some common techniques:
⁃ Min-Max Scaling (Normalization)
⁃ Standardization (Z-score normalization)
⁃ Robust Scaling
Any
Question?

7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
What Is A Feature?: 5.5M 732 Oops Concepts in Java
No ratings yet
What Is A Feature?: 5.5M 732 Oops Concepts in Java
20 pages
ML Da
No ratings yet
ML Da
55 pages
Data
No ratings yet
Data
36 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
86 pages
Feature Engineering and Normalization
No ratings yet
Feature Engineering and Normalization
7 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
نسخة من prep
No ratings yet
نسخة من prep
17 pages
Feature Engineering in ML Guide
No ratings yet
Feature Engineering in ML Guide
6 pages
Machine Learning Dataset Handling Guide
No ratings yet
Machine Learning Dataset Handling Guide
15 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
What Is Feature Engineering
No ratings yet
What Is Feature Engineering
2 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Feature Engineering for ML Experts
No ratings yet
Feature Engineering for ML Experts
11 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Supervised Learning Research Paper With Images
No ratings yet
Supervised Learning Research Paper With Images
10 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Unit 4
No ratings yet
Unit 4
25 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Unit 2
No ratings yet
Unit 2
18 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Chương
No ratings yet
Chương
12 pages
Data Preparation For Machine Learning Mo
No ratings yet
Data Preparation For Machine Learning Mo
5 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Topic-2 ML Concepts
No ratings yet
Topic-2 ML Concepts
9 pages
NOTES
No ratings yet
NOTES
9 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Machine Learning
No ratings yet
Machine Learning
48 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Data Science
No ratings yet
Data Science
64 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
MMC102 - Module 4 - Notes
No ratings yet
MMC102 - Module 4 - Notes
39 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
MAD LabManual - SSEC
No ratings yet
MAD LabManual - SSEC
171 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
MODEL NO.: V315B5 Suffix:L13: Product Specification
No ratings yet
MODEL NO.: V315B5 Suffix:L13: Product Specification
34 pages
Managing CAD and BIM Standards Using Vault: Learning Objectives
No ratings yet
Managing CAD and BIM Standards Using Vault: Learning Objectives
42 pages
Electrical Drawing: Bitumen Tank Fo Tank Bitumen Tank Agitator 2 1 0
No ratings yet
Electrical Drawing: Bitumen Tank Fo Tank Bitumen Tank Agitator 2 1 0
13 pages
Engineering Design of A Radar Structure Using Design of Experiments and Structural Optimization
No ratings yet
Engineering Design of A Radar Structure Using Design of Experiments and Structural Optimization
102 pages
Worksheet - Computer Networks
No ratings yet
Worksheet - Computer Networks
3 pages
EF 680 Series User Guide
No ratings yet
EF 680 Series User Guide
39 pages
16 Port 10/100Mbps Ethernet Switch Quick Installation Guide: Model # ANS-16P
No ratings yet
16 Port 10/100Mbps Ethernet Switch Quick Installation Guide: Model # ANS-16P
5 pages
Recursive Programming Basics
No ratings yet
Recursive Programming Basics
4 pages
IIT M ES FOUNDATION FN EXAM FEF1 24 Dec 2023
No ratings yet
IIT M ES FOUNDATION FN EXAM FEF1 24 Dec 2023
89 pages
Project3-Arc1 1
No ratings yet
Project3-Arc1 1
7 pages
Cybersecurity Threat Intel Insights
No ratings yet
Cybersecurity Threat Intel Insights
30 pages
Incident Response Methodology - Malicious Network Behavior
No ratings yet
Incident Response Methodology - Malicious Network Behavior
4 pages
2nd File To Students KVM - Host - Creation
No ratings yet
2nd File To Students KVM - Host - Creation
45 pages
Geometric Modeling Problems in Industrial CAD/CAM/CAE
No ratings yet
Geometric Modeling Problems in Industrial CAD/CAM/CAE
10 pages
3.2 Least Square and Polynomial Regression
No ratings yet
3.2 Least Square and Polynomial Regression
39 pages
Technical Publication: Direction 2296441-100 Revision 06 Ge Medical Systems Lightspeed 3.X - Schematics and Boards
100% (2)
Technical Publication: Direction 2296441-100 Revision 06 Ge Medical Systems Lightspeed 3.X - Schematics and Boards
380 pages
FortiGate HA & SD-WAN Setup Guide
No ratings yet
FortiGate HA & SD-WAN Setup Guide
7 pages
Discrete Structures 2 - SEQUENCES, SUMMATION, AND SERIES
No ratings yet
Discrete Structures 2 - SEQUENCES, SUMMATION, AND SERIES
11 pages
R Data Import Guide
No ratings yet
R Data Import Guide
14 pages
Unit-1-1. Foundations of Digital Forensics
No ratings yet
Unit-1-1. Foundations of Digital Forensics
31 pages
10 Use Cases of Multi Modal Agents
No ratings yet
10 Use Cases of Multi Modal Agents
13 pages
IT8501-Web-techonology QP 21
No ratings yet
IT8501-Web-techonology QP 21
3 pages
User Manual Tuya Smart IR + RF Ufo R2 Control WiFi Universale
No ratings yet
User Manual Tuya Smart IR + RF Ufo R2 Control WiFi Universale
1 page
Audit Trail
No ratings yet
Audit Trail
1 page
Sachu Suresh
No ratings yet
Sachu Suresh
68 pages
Manual Radio Porsche 959
No ratings yet
Manual Radio Porsche 959
32 pages
Class 2 Ict
No ratings yet
Class 2 Ict
3 pages
Module 4 - Social Media Advertising
100% (1)
Module 4 - Social Media Advertising
45 pages

NN 7

Uploaded by

NN 7

Uploaded by

Data Pre-processing

1.Domain-Specific: Creating new features based on domain knowledge, such as

2.Data-Driven: Creating new features by observing patterns in the data, such as

3.Synthetic: Generating new features by combining existing features or

You might also like