0% found this document useful (0 votes)

10 views23 pages

Data Pre Processing and Cleaning

Data preprocessing is essential for transforming raw datasets into usable formats, significantly impacting the outcomes of analytic algorithms. The process includes data cleaning, which addresses issues like missing values through methods such as deletion and imputation. Effective data processing improves model performance, accuracy, and representation, while also being time-consuming and potentially error-prone.

Uploaded by

tianikban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views23 pages

Data Pre Processing and Cleaning

Uploaded by

tianikban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Pre processing and cleaning

Data preprocessing involves the transformation of the raw

dataset into an understandable format. Preprocessing data
is a fundamental stage in data mining to improve data
efficiency. The data preprocessing methods directly affect
the outcomes of any analytic algorithm.
1.DATA CLEANING
The first step of Data Preprocessing is Data Cleaning. Most
of the data that we work today are not clean and requires
substantial amount of Data Cleaning. Some have missing
values and some have junk data in it. If these missing
values and inconsistencies are not handled properly then
our model wouldn’t give accurate results.
So, before getting into the nitty gritty details of Data
Cleaning, let’s have a high level understanding of what are
the possible problems we face in real world data scenarios.
MISSING VALUES :
Missing values is very crucial when it comes to building a
model. It can break your complete model by predicting
inaccurately if not handled properly. Let’s check the below
example to understand more about it.
Below dataset can be used to predict the graduate
admission of students. However it has some missing values
which are critical to predict their admissions.

As you could see above, some of the records have the GRE
Score, TOEFL Score, University
Rating, SOP, LOR, CGPA and Research missing, which
are important features for predicting the admit for the
student.
HANDLING MISSING DATA:
In order to build a robust model which handles complex
tasks we need to handle missing data more efficiently.
There are many ways of handling missing data. Some of
them are as follows:
METHOD 1. REMOVING THE DATA
The first step that we should do is to check if a dataset has
any missing values. A model cannot accept missing values.
So one common and easy method to handle missing values
is to delete the entire row if there is any missing value in
that row or we delete an entire column if it has 70 to 75%
of missing data, however this percent limit is not fixed and
mostly depends on what kind of data we are dealing with,
and what kind of features are there in the dataset.
Advantage of this method is, it’s a pretty quick and dirty
method of fixing the missing values issue. But this is not
always the go to method as you might sometime end up
losing critical information by deleting the features.
1. Load the dataset
In [1]:

import pandas as pd
import numpy as np
In [2]:

df = pd.read_csv("Banking_Marketing.csv")
In [3]:

df.head()

2. Check the datatype for each column

df.dtypes

age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object

3. Finding the missing values in each column

df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Dropping/Deleting the rows containing the missing values.

datadrop = df.dropna()

5. Checking to see the NA's after deletion of the rows.

datadrop.isna().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

METHOD 2: MEAN/MEDIAN/MODE IMPUTATION

In this method we will use the Mean/Median/Mode to
replace missing values.
1. In the case of Numerical data, we can compute its mean
or median and use the result to replace missing values.
2. While if there is Categorical (non-numerical) data, we
can compute its mode to replace the missing value.
This process is known as Mean/Median/Mode imputation.
Advantage of this method is that we don’t remove the data
which prevents data loss.
The drawback is that you don’t know how accurate using
the mean, median, or mode is going to be in a given
situation.
Method 2: Imputing Missing Data
1. Loading the data
In [2]:

import pandas as pd
import numpy as np
In [3]:

df = pd.read_csv("Banking_Marketing.csv")
In [4]:

df.head()

2. Calculate the mean of the age column

In [6]:

age_mean = df.age.mean()

In [7]:
print("Mean of age column: ",age_mean)
Mean of age column: 40.023812413525256

3.Impute the missing data in the age column with the mean age value
In [8]:
df.age.fillna(age_mean,inplace=True)

df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Checking all the records in the dataset for which the 'duration' column is NA
df[df['duration'].isnull()]

5.Sort the values of duration in ascending order

In [11]:

df['duration'].sort_values(ascending=False).head()
Out[11]:
7802 4918.0
18610 4199.0
32880 3785.0
1974 3643.0
10633 3631.0
Name: duration, dtype: float64

6.Calculating the median value of the duration column

duration_med= df.duration.median()

print("The median of duration is: ",duration_med)

The median of duration is: 180.0

7. Imputing the median value of duration to all the NA fields in duration

df.duration.fillna(duration_med,inplace=True)
df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

8. In the above steps both the 'duration' and 'age' column were numerical so we used mean
and median to impute the missing values. However in this case 'contact' is a categorical value,
so we will use Mode here to impute the missing values.¶

df['contact'].unique()

out:
array(['cellular', 'telephone', nan], dtype=object)

contact_mode = df.contact.mode()[0]
print("Mode for contact: ",contact_mode)

Mode for contact: cellular

9.Imputing the Mode value of 'contact' to all the NA fields in 'contact'

df.contact.fillna(contact_mode,inplace=True)
In [20]:
df.isna().sum()
Out[20]:
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
Understanding Data Processing

Data Processing is the task of converting data from a given form to

a much more usable and desired form i.e. making it more
meaningful and informative. Using Machine Learning algorithms,
mathematical modeling, and statistical knowledge, this entire
process can be automated. The output of this complete process can
be in any desired form like graphs, videos, charts, tables, images,
and many more, depending on the task we are performing and the
requirements of the machine. This might seem to be simple but
when it comes to massive organizations like Twitter, Facebook,
Administrative bodies like Parliament, UNESCO, and health sector
organizations, this entire process needs to be performed in a very
structured manner. So, the steps to perform are as follows:
Data processing is a crucial step in the machine learning (ML)
pipeline, as it prepares the data for use in building and training ML
models. The goal of data processing is to clean, transform, and
prepare the data in a format that is suitable for modeling.
The main steps involved in data processing typically
include:
1.Data collection: This is the process of gathering data from
various sources, such as sensors, databases, or other systems. The
data may be structured or unstructured, and may come in various
formats such as text, images, or audio.
2.Data preprocessing: This step involves cleaning, filtering, and
transforming the data to make it suitable for further analysis. This
may include removing missing values, scaling or normalizing the
data, or converting it to a different format.
3.Data analysis: In this step, the data is analyzed using various
techniques such as statistical analysis, machine learning
algorithms, or data visualization. The goal of this step is to derive
insights or knowledge from the data.
4.Data interpretation: This step involves interpreting the results
of the data analysis and drawing conclusions based on the insights
gained. It may also involve presenting the findings in a clear and
concise manner, such as through reports, dashboards, or other
visualizations.
5.Data storage and management: Once the data has been
processed and analyzed, it must be stored and managed in a way
that is secure and easily accessible. This may involve storing the
data in a database, cloud storage, or other systems, and
implementing backup and recovery strategies to protect against
data loss.
6.Data visualization and reporting: Finally, the results of the
data analysis are presented to stakeholders in a format that is
easily understandable and actionable. This may involve creating
visualizations, reports, or dashboards that highlight key findings
and trends in the data.

 Collection :
The most crucial step when starting with ML is to have data
of good quality and accuracy. Data can be collected from
any authenticated source like data.gov.in, Kaggle or UCI
dataset repository. For example, while preparing for a
competitive exam, students study from the best study
material that they can access so that they learn the best to
obtain the best results. In the same way, high-quality and
accurate data will make the learning process of the model
easier and better and at the time of testing, the model
would yield state-of-the-art results.
A huge amount of capital, time and resources are
consumed in collecting data. Organizations or researchers
have to decide what kind of data they need to execute their
tasks or research.
Example: Working on the Facial Expression Recognizer,
needs numerous images having a variety of human
expressions. Good data ensures that the results of the
model are valid and can be trusted upon.

 Preparation :
The collected data can be in a raw form which can’t be
directly fed to the machine. So, this is a process of
collecting datasets from different sources, analyzing these
datasets and then constructing a new dataset for further
processing and exploration. This preparation can be
performed either manually or from the automatic approach.
Data can also be prepared in numeric forms also which
would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N
dimensions, the value of each cell will indicate the image
pixel.
 Input :
Now the prepared data can be in the form that may not be
machine-readable, so to convert this data to the readable
form, some conversion algorithms are needed. For this task
to be executed, high computation and accuracy is needed.
Example: Data can be collected through the sources like
MNIST Digit data(images), Twitter comments, audio files,
video clips.
 Processing :
This is the stage where algorithms and ML techniques are
required to perform the instructions provided over a large
volume of data with accuracy and optimal computation.
 Output :
In this stage, results are procured by the machine in a
meaningful manner which can be inferred easily by the
user. Output can be in the form of reports, graphs, videos,
etc
 Storage :
This is the final step in which the obtained output and the
data model data and all the useful information are saved
for future use.

Advantages of data processing in Machine Learning:

1. Improved model performance: Data processing helps
improve the performance of the ML model by cleaning and
transforming the data into a format that is suitable for
modeling.
2. Better representation of the data: Data processing allows
the data to be transformed into a format that better
represents the underlying relationships and patterns in the
data, making it easier for the ML model to learn from the
data.
3. Increased accuracy: Data processing helps ensure that the
data is accurate, consistent, and free of errors, which can
help improve the accuracy of the ML model.

Disadvantages of data processing in Machine Learning:

1. Time-consuming: Data processing can be a time-consuming

task, especially for large and complex datasets.
2. Error-prone: Data processing can be error-prone, as it
involves transforming and cleaning the data, which can
result in the loss of important information or the
introduction of new errors.
3. Limited understanding of the data: Data processing can
lead to a limited understanding of the data, as the
transformed data may not be representative of the
underlying relationships and patterns in the data.

Overview of Data Cleaning

Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model. It surely isn’t the
fanciest part of machine learning and at the same time, there
aren’t any hidden tricks or secrets to uncover. However, the
success or failure of a project relies on proper data cleaning.
Professional data scientists usually invest a very large portion of
their time in this step because of the belief that “Better data
beats fancier algorithms”.
If we have a well-cleaned dataset, there are chances that we can
get achieve good results with simple algorithms also, which can
prove very beneficial at times especially in terms of computation
when the dataset size is large. Obviously, different types of data
will require different types of cleaning. However, this systematic
approach can always serve as a good starting point.
Steps Involved in Data Cleaning
Data cleaning is a crucial step in the machine learning (ML)
pipeline, as it involves identifying and removing any missing,
duplicate, or irrelevant data. The goal of data cleaning is to ensure
that the data is accurate, consistent, and free of errors, as incorrect
or inconsistent data can negatively impact the performance of the
ML model.
Data cleaning, also known as data cleansing or data
preprocessing, is a crucial step in the data science pipeline that
involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality
and usability. Data cleaning is essential because raw data is often
noisy, incomplete, and inconsistent, which can negatively impact
the accuracy and reliability of the insights derived from it.

The following are the most common steps involved in

data cleaning:

 Import the necessary libraries

 Load the dataset
 Check the data information using df.info()
Processing CSV Data
import pandas as pd
import numpy as np

# Load the dataset

df = pd.read_csv('input.csv')
df.head()

id name salary start_date dept

0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance

Reading Specific Rows

The read_csv function of the pandas library can also be used to
read some specific rows for a given column. We slice the result
from the read_csv function using the code shown below for first 5
rows for the column named salary.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Slice the result for first 5 rows

print (data[0:5]['salary'])
When we execute the above code, it produces the following
result.
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64
Reading Specific Columns
The read_csv function of the pandas library can also be used to
read some specific columns. We use the multi-axes indexing
method called .loc() for this purpose. We choose to display the
salary and name column for all the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion

print (data.loc[:,['salary','name']])

When we execute the above code, it produces the following

result.

salary name
0 623.30 Rick
1 515.20 Dan
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru
Reading Specific Columns and Rows
The read_csv function of the pandas library can also be used to
read some specific columns and specific rows. We use the multi-
axes indexing method called .loc() for this purpose. We choose to
display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion

print (data.loc[[1,3,5],['salary','name']])
When we execute the above code, it produces the following
result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi
Reading Specific Columns for a Range of Rows
The read_csv function of the pandas library can also be used to
read some specific columns and a range of rows. We use the
multi-axes indexing method called .loc() for this purpose. We
choose to display the salary and name column for some of the
rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion

print (data.loc[2:6,['salary','name']])
When we execute the above code, it produces the following
result.
salary name
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab

Detect and Remove the Outliers using Python

An Outlier is a data-item/object that deviates significantly from the

rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier detection
is referred to as outlier mining. There are many ways to detect the
outliers, and the removal process is the data frame same as
removing a data item from the panda’s data frame.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is

preloaded in the sklearn library.
Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset

diabetics = load_diabetes()

# Create the dataframe

column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
df_diabetics.head()
Outliers can be detected using visualization, implementing
mathematical formulas on the dataset, or using the statistical
approach. All of these are discussed below.

Outliers Visualization

Visualizing Outliers Using Box Plot

A Box Plot is also known as Whisker plot is created to display the

summary of the set of data values having properties like minimum,
first quartile, median, third quartile and maximum. In the box plot,
a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the
median. Here x-axis denotes the data to be plotted while the y-axis
shows the frequency distribution.
it is primarily used to indicate a distribution is skewed or not and if there are
potential unusual observations (also called outliers) present in the data set.
Boxplots are also very beneficial when large numbers of data sets are involved or
compared.

Parts of Box Plots

Check the image below which shows the minimum, maximum, first
quartile, third quartile, median and outliers.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of
the data set.

Median: The median is the middle value of the dataset, which divides the
given dataset into two equal parts. The median is considered as the
second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of
the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile

and first quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data
is tested to be the outliers. Generally, the outliers fall more than the
specified distance from the first and third quartile.

(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR)

import seaborn as sns

sns.boxplot(df_diabetics['bmi'])
Outli

ers present in the bmi columns

In the above graph, can clearly see that values above 10 are acting
as outliers.
# Position of the Outlier
import numpy as np
print(np.where(df_diabetics['bmi']>0.12))

output:

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Visualizing Outliers Using ScatterPlot.

It is used when you have paired numerical data and when your
dependent variable has multiple values for each reading
independent variable, or when trying to determine the relationship
between the two variables. In the process of utilizing the scatter
plot, one can also use it for outlier detection.

# Scatter plot
fig, ax = plt.subplots(figsize = (6,4))
ax.scatter(df_diabetics['bmi'],df_diabetics['bp'])

# x-axis label
ax.set_xlabel('(body mass index of people)')

# y-axis label
ax.set_ylabel('(bp of the people )')
plt.show()

Looking at the graph can summarize that most of the data points
are in the bottom left corner of the graph but there are few points
that are exactly;y opposite that is the top right corner of the graph.
Those points in the top right corner can be regarded as Outliers.
Using approximation can say all those data points that are x>20
and y>600 are outliers.
Outliers in BMI and BP Column Combined
Python3
# Position of the Outlier
print(np.where((df_diabetics['bmi']>0.12) &
(df_diabetics['bp']<0.8)))

Output:
(array([ 32, 145, 256, 262, 366, 367, 405]),)

IQR (Inter Quartile Range)

IQR (Inter Quartile Range) Inter Quartile Range approach to finding
the outliers is the most commonly used and most trusted approach
used in the research field.
IQR = Quartile3 – Quartile1

Python3
# IQR

Q1 = np.percentile(df_diabetics['bmi'], 25,
method='midpoint')

Q3 = np.percentile(df_diabetics['bmi'], 75,
method='midpoint')

IQR = Q3 - Q1

print(IQR)

Output:
0.06520763046978838
Syntax: numpy.percentile(arr, n, axis=None, out=None)
Parameters :
arr :input array.
n : percentile value.
To define the outlier base value is defined above and below
dataset’s normal range namely Upper and Lower bounds, define
the upper and the lower bound (1.5*IQR value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of
IQR (new_IQR = IQR + 0.5*IQR) is taken, to consider all the data
between 2.7 standard deviations in the Gaussian Distribution.

Python3
# Above Upper bound

upper=Q3+1.5*IQR

upper_array=np.array(df_diabetics['bmi']>=upper)

print("Upper Bound:",upper)

print(upper_array.sum())
#Below Lower bound

lower=Q1-1.5*IQR

lower_array=np.array(df_diabetics['bmi']<=lower)

print("Lower Bound:",lower)

print(lower_array.sum())

Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0

Removing the outliers

For removing the outlier, one must follow the same process of
removing an entry from the dataset using its exact position in the
dataset because in all the above methods of detecting the outliers
end result is the list of all those data items that satisfy the outlier
definition according to the method used.
References: How to delete exactly one row in python?
dataframe.drop(row index,inplace=True)
The above code can be used to drop a row from the dataset given
the row_indexes to be dropped. Inplace =True is used to tell Python
to make the required change in the original dataset. row_index can
be only one value or list of values or NumPy array but it must be
one dimensional.

Example:
df_diabetics.drop(lists[0],inplace = True)
Full Code: Detecting the outliers using IQR and removing them.
Python3
# Importing

import sklearn

from sklearn.datasets import load_diabetes

import pandas as pd

# Load the dataset

diabetes = load_diabetes()

# Create the dataframe

column_name = diabetes.feature_names

df_diabetes = pd.DataFrame(diabetes.data)

df_diabetes .columns = column_name

df_diabetes .head()

print("Old Shape: ", df_diabetes.shape)

''' Detection '''

# IQR

# Calculate the upper and lower limits

Q1 = df_diabetes['bmi'].quantile(0.25)

Q3 = df_diabetes['bmi'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5*IQR

upper = Q3 + 1.5*IQR

# Create arrays of Boolean values indicating the outlier rows

upper_array = np.where(df_diabetes['bmi']>=upper)[0]

lower_array = np.where(df_diabetes['bmi']<=lower)[0]

# Removing the outliers

df_diabetes.drop(index=upper_array, inplace=True)

df_diabetes.drop(index=lower_array, inplace=True)

# Print the new shape of the DataFrame

print("New Shape: ", df_diabetes.shape)

Output:
Old Shape: (442, 10)
New Shape: (439, 10)

Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Asg One
No ratings yet
Asg One
10 pages
Analytics People Programming Parte 5
No ratings yet
Analytics People Programming Parte 5
14 pages
Bank Rpubs
No ratings yet
Bank Rpubs
24 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
No ratings yet
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
17 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Unit 2
No ratings yet
Unit 2
76 pages
Lec 4
No ratings yet
Lec 4
9 pages
Handout 4
No ratings yet
Handout 4
15 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Pandas: Data Cleaning Essentials
No ratings yet
Pandas: Data Cleaning Essentials
6 pages
Bank Marketing Data Analysis
No ratings yet
Bank Marketing Data Analysis
18 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Subtitle
No ratings yet
Subtitle
2 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Handling Missing Values in Dataset - 9 Methods That You Need To Know - by Subha - Medium
No ratings yet
Handling Missing Values in Dataset - 9 Methods That You Need To Know - by Subha - Medium
28 pages
1st Project
No ratings yet
1st Project
24 pages
BankX Marketing 1744722258
No ratings yet
BankX Marketing 1744722258
29 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
71 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Data Cleaning Techniques Guide
No ratings yet
Data Cleaning Techniques Guide
11 pages
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
No ratings yet
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
2 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Animesh Jain
No ratings yet
Animesh Jain
13 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Data Analyst Interview Assignment
No ratings yet
Data Analyst Interview Assignment
26 pages
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
No ratings yet
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
31 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
7 pages
Final - EDA Assignment - Sourabh S Hubballi
No ratings yet
Final - EDA Assignment - Sourabh S Hubballi
34 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Data Quality
No ratings yet
Data Quality
14 pages
4.3. Handling Missing Values - Ipynb - Colaboratory
No ratings yet
4.3. Handling Missing Values - Ipynb - Colaboratory
2 pages
PFDA
No ratings yet
PFDA
23 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Missing Data
No ratings yet
Missing Data
14 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
5 pages
EVS Practical 5 Nikhil Banerjee
No ratings yet
EVS Practical 5 Nikhil Banerjee
7 pages
Subjects
No ratings yet
Subjects
1 page
GE Python and Data Fundamentals
No ratings yet
GE Python and Data Fundamentals
2 pages
Practical
No ratings yet
Practical
2 pages
Hindi-B DU Q Paper
No ratings yet
Hindi-B DU Q Paper
2 pages
PARCC Math SP 2017 Grade 7 Released Answer Key
No ratings yet
PARCC Math SP 2017 Grade 7 Released Answer Key
7 pages
Graham Giller Wilmott Talk
No ratings yet
Graham Giller Wilmott Talk
31 pages
IKUMJFE
No ratings yet
IKUMJFE
383 pages
Lesson 2.3 Standard Normal Curve and Z Scores
No ratings yet
Lesson 2.3 Standard Normal Curve and Z Scores
18 pages
Shear Strength
No ratings yet
Shear Strength
6 pages
Charging System Theory
No ratings yet
Charging System Theory
20 pages
Current-Limited, Power-Distribution Switches: Features Description
No ratings yet
Current-Limited, Power-Distribution Switches: Features Description
22 pages
Eaton DX-RT 06-10 kVA User Manual
No ratings yet
Eaton DX-RT 06-10 kVA User Manual
56 pages
Catálogo Técnico Recuperador VRF
No ratings yet
Catálogo Técnico Recuperador VRF
20 pages
Waves
No ratings yet
Waves
15 pages
CDKB Case Study
No ratings yet
CDKB Case Study
3 pages
R.M.K. College of Engineering and Technology: Date: 15.02.2024
No ratings yet
R.M.K. College of Engineering and Technology: Date: 15.02.2024
2 pages
Acer Aspire 4253 - 5253 - Compal - LA-7092P - JE50 - HM50 - SJV50 - BZ - P5 1
No ratings yet
Acer Aspire 4253 - 5253 - Compal - LA-7092P - JE50 - HM50 - SJV50 - BZ - P5 1
47 pages
SAT Suite Question Bank - Algebra
No ratings yet
SAT Suite Question Bank - Algebra
27 pages
Get More Accurate Results Faster Xlrotor
No ratings yet
Get More Accurate Results Faster Xlrotor
5 pages
Jot Brochure 2021-22
No ratings yet
Jot Brochure 2021-22
60 pages
ADSA Practical
No ratings yet
ADSA Practical
46 pages
A Case Study of A Company Called Tobacco Industry and Marketing Board.
No ratings yet
A Case Study of A Company Called Tobacco Industry and Marketing Board.
50 pages
Semantics: Speech Acts
No ratings yet
Semantics: Speech Acts
14 pages
Computerized Building Energy Simulation Handbook PDF
No ratings yet
Computerized Building Energy Simulation Handbook PDF
149 pages
Hadiyyisa POS Tagger With Deep Learning
100% (2)
Hadiyyisa POS Tagger With Deep Learning
34 pages
Ooad Complete Material
No ratings yet
Ooad Complete Material
177 pages
Bridge
No ratings yet
Bridge
2 pages
ASQ PracticeTest CSSGB v2021-06-02 by Lyanna 86q
100% (3)
ASQ PracticeTest CSSGB v2021-06-02 by Lyanna 86q
34 pages
BladeCenter H Redbook
100% (1)
BladeCenter H Redbook
31 pages
(Ebook PDF) Programming Language Pragmatics, 4th Edition Kindle & PDF Formats
100% (5)
(Ebook PDF) Programming Language Pragmatics, 4th Edition Kindle & PDF Formats
85 pages
1) Bisection Method
No ratings yet
1) Bisection Method
3 pages
Control Systems Compensators Guide
No ratings yet
Control Systems Compensators Guide
2 pages
How Do Calculators Work - ScienceABC
No ratings yet
How Do Calculators Work - ScienceABC
18 pages
CHEN20191 Materials Science & Mech. Design Exam Paper - Final
No ratings yet
CHEN20191 Materials Science & Mech. Design Exam Paper - Final
6 pages

Data Pre Processing and Cleaning

Uploaded by

Data Pre Processing and Cleaning

Uploaded by

Data Pre processing and cleaning

Data preprocessing involves the transformation of the raw

2. Check the datatype for each column

3. Finding the missing values in each column

4. Dropping/Deleting the rows containing the missing values.

5. Checking to see the NA's after deletion of the rows.

METHOD 2: MEAN/MEDIAN/MODE IMPUTATION

2. Calculate the mean of the age column

5.Sort the values of duration in ascending order

6.Calculating the median value of the duration column

print("The median of duration is: ",duration_med)

The median of duration is: 180.0

7. Imputing the median value of duration to all the NA fields in duration

Mode for contact: cellular

9.Imputing the Mode value of 'contact' to all the NA fields in 'contact'

Data Processing is the task of converting data from a given form to

Advantages of data processing in Machine Learning:

Disadvantages of data processing in Machine Learning:

1. Time-consuming: Data processing can be a time-consuming

Overview of Data Cleaning

The following are the most common steps involved in

 Import the necessary libraries

# Load the dataset

id name salary start_date dept

Reading Specific Rows

# Slice the result for first 5 rows

# Use the multi-axes indexing funtion

When we execute the above code, it produces the following

# Use the multi-axes indexing funtion

# Use the multi-axes indexing funtion

Detect and Remove the Outliers using Python

An Outlier is a data-item/object that deviates significantly from the

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is

# Load the dataset

# Create the dataframe

Visualizing Outliers Using Box Plot

A Box Plot is also known as Whisker plot is created to display the

Parts of Box Plots

Maximum: The maximum value in the given dataset.

Interquartile Range (IQR): The difference between the third quartile

import seaborn as sns

ers present in the bmi columns

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Visualizing Outliers Using ScatterPlot.

IQR (Inter Quartile Range)

Removing the outliers

from sklearn.datasets import load_diabetes

# Load the dataset

# Create the dataframe

df_diabetes .columns = column_name

print("Old Shape: ", df_diabetes.shape)

''' Detection '''

# Calculate the upper and lower limits

# Create arrays of Boolean values indicating the outlier rows

# Removing the outliers

# Print the new shape of the DataFrame

print("New Shape: ", df_diabetes.shape)

You might also like