0% found this document useful (0 votes)
10 views23 pages

Data Pre Processing and Cleaning

Data preprocessing is essential for transforming raw datasets into usable formats, significantly impacting the outcomes of analytic algorithms. The process includes data cleaning, which addresses issues like missing values through methods such as deletion and imputation. Effective data processing improves model performance, accuracy, and representation, while also being time-consuming and potentially error-prone.

Uploaded by

tianikban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Data Pre Processing and Cleaning

Data preprocessing is essential for transforming raw datasets into usable formats, significantly impacting the outcomes of analytic algorithms. The process includes data cleaning, which addresses issues like missing values through methods such as deletion and imputation. Effective data processing improves model performance, accuracy, and representation, while also being time-consuming and potentially error-prone.

Uploaded by

tianikban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Pre processing and cleaning

Data preprocessing involves the transformation of the raw


dataset into an understandable format. Preprocessing data
is a fundamental stage in data mining to improve data
efficiency. The data preprocessing methods directly affect
the outcomes of any analytic algorithm.
1.DATA CLEANING
The first step of Data Preprocessing is Data Cleaning. Most
of the data that we work today are not clean and requires
substantial amount of Data Cleaning. Some have missing
values and some have junk data in it. If these missing
values and inconsistencies are not handled properly then
our model wouldn’t give accurate results.
So, before getting into the nitty gritty details of Data
Cleaning, let’s have a high level understanding of what are
the possible problems we face in real world data scenarios.
MISSING VALUES :
Missing values is very crucial when it comes to building a
model. It can break your complete model by predicting
inaccurately if not handled properly. Let’s check the below
example to understand more about it.
Below dataset can be used to predict the graduate
admission of students. However it has some missing values
which are critical to predict their admissions.

As you could see above, some of the records have the GRE
Score, TOEFL Score, University
Rating, SOP, LOR, CGPA and Research missing, which
are important features for predicting the admit for the
student.
HANDLING MISSING DATA:
In order to build a robust model which handles complex
tasks we need to handle missing data more efficiently.
There are many ways of handling missing data. Some of
them are as follows:
METHOD 1. REMOVING THE DATA
The first step that we should do is to check if a dataset has
any missing values. A model cannot accept missing values.
So one common and easy method to handle missing values
is to delete the entire row if there is any missing value in
that row or we delete an entire column if it has 70 to 75%
of missing data, however this percent limit is not fixed and
mostly depends on what kind of data we are dealing with,
and what kind of features are there in the dataset.
Advantage of this method is, it’s a pretty quick and dirty
method of fixing the missing values issue. But this is not
always the go to method as you might sometime end up
losing critical information by deleting the features.
1. Load the dataset
In [1]:

import pandas as pd
import numpy as np
In [2]:

df = pd.read_csv("Banking_Marketing.csv")
In [3]:

df.head()

2. Check the datatype for each column

df.dtypes

age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object

3. Finding the missing values in each column

df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Dropping/Deleting the rows containing the missing values.

datadrop = df.dropna()

5. Checking to see the NA's after deletion of the rows.

datadrop.isna().sum()
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

METHOD 2: MEAN/MEDIAN/MODE IMPUTATION


In this method we will use the Mean/Median/Mode to
replace missing values.
1. In the case of Numerical data, we can compute its mean
or median and use the result to replace missing values.
2. While if there is Categorical (non-numerical) data, we
can compute its mode to replace the missing value.
This process is known as Mean/Median/Mode imputation.
Advantage of this method is that we don’t remove the data
which prevents data loss.
The drawback is that you don’t know how accurate using
the mean, median, or mode is going to be in a given
situation.
Method 2: Imputing Missing Data
1. Loading the data
In [2]:

import pandas as pd
import numpy as np
In [3]:

df = pd.read_csv("Banking_Marketing.csv")
In [4]:

df.head()

df.isna().sum()
age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

2. Calculate the mean of the age column


In [6]:

age_mean = df.age.mean()

In [7]:
print("Mean of age column: ",age_mean)
Mean of age column: 40.023812413525256

3.Impute the missing data in the age column with the mean age value
In [8]:
df.age.fillna(age_mean,inplace=True)

df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

4. Checking all the records in the dataset for which the 'duration' column is NA
df[df['duration'].isnull()]

5.Sort the values of duration in ascending order


In [11]:

df['duration'].sort_values(ascending=False).head()
Out[11]:
7802 4918.0
18610 4199.0
32880 3785.0
1974 3643.0
10633 3631.0
Name: duration, dtype: float64

6.Calculating the median value of the duration column

duration_med= df.duration.median()

print("The median of duration is: ",duration_med)

The median of duration is: 180.0

7. Imputing the median value of duration to all the NA fields in duration


df.duration.fillna(duration_med,inplace=True)
df.isna().sum()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

8. In the above steps both the 'duration' and 'age' column were numerical so we used mean
and median to impute the missing values. However in this case 'contact' is a categorical value,
so we will use Mode here to impute the missing values.¶

df['contact'].unique()

out:
array(['cellular', 'telephone', nan], dtype=object)

contact_mode = df.contact.mode()[0]
print("Mode for contact: ",contact_mode)

Mode for contact: cellular

9.Imputing the Mode value of 'contact' to all the NA fields in 'contact'


df.contact.fillna(contact_mode,inplace=True)
In [20]:
df.isna().sum()
Out[20]:
age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64
Understanding Data Processing

Data Processing is the task of converting data from a given form to


a much more usable and desired form i.e. making it more
meaningful and informative. Using Machine Learning algorithms,
mathematical modeling, and statistical knowledge, this entire
process can be automated. The output of this complete process can
be in any desired form like graphs, videos, charts, tables, images,
and many more, depending on the task we are performing and the
requirements of the machine. This might seem to be simple but
when it comes to massive organizations like Twitter, Facebook,
Administrative bodies like Parliament, UNESCO, and health sector
organizations, this entire process needs to be performed in a very
structured manner. So, the steps to perform are as follows:
Data processing is a crucial step in the machine learning (ML)
pipeline, as it prepares the data for use in building and training ML
models. The goal of data processing is to clean, transform, and
prepare the data in a format that is suitable for modeling.
The main steps involved in data processing typically
include:
1.Data collection: This is the process of gathering data from
various sources, such as sensors, databases, or other systems. The
data may be structured or unstructured, and may come in various
formats such as text, images, or audio.
2.Data preprocessing: This step involves cleaning, filtering, and
transforming the data to make it suitable for further analysis. This
may include removing missing values, scaling or normalizing the
data, or converting it to a different format.
3.Data analysis: In this step, the data is analyzed using various
techniques such as statistical analysis, machine learning
algorithms, or data visualization. The goal of this step is to derive
insights or knowledge from the data.
4.Data interpretation: This step involves interpreting the results
of the data analysis and drawing conclusions based on the insights
gained. It may also involve presenting the findings in a clear and
concise manner, such as through reports, dashboards, or other
visualizations.
5.Data storage and management: Once the data has been
processed and analyzed, it must be stored and managed in a way
that is secure and easily accessible. This may involve storing the
data in a database, cloud storage, or other systems, and
implementing backup and recovery strategies to protect against
data loss.
6.Data visualization and reporting: Finally, the results of the
data analysis are presented to stakeholders in a format that is
easily understandable and actionable. This may involve creating
visualizations, reports, or dashboards that highlight key findings
and trends in the data.

 Collection :
The most crucial step when starting with ML is to have data
of good quality and accuracy. Data can be collected from
any authenticated source like data.gov.in, Kaggle or UCI
dataset repository. For example, while preparing for a
competitive exam, students study from the best study
material that they can access so that they learn the best to
obtain the best results. In the same way, high-quality and
accurate data will make the learning process of the model
easier and better and at the time of testing, the model
would yield state-of-the-art results.
A huge amount of capital, time and resources are
consumed in collecting data. Organizations or researchers
have to decide what kind of data they need to execute their
tasks or research.
Example: Working on the Facial Expression Recognizer,
needs numerous images having a variety of human
expressions. Good data ensures that the results of the
model are valid and can be trusted upon.

 Preparation :
The collected data can be in a raw form which can’t be
directly fed to the machine. So, this is a process of
collecting datasets from different sources, analyzing these
datasets and then constructing a new dataset for further
processing and exploration. This preparation can be
performed either manually or from the automatic approach.
Data can also be prepared in numeric forms also which
would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N
dimensions, the value of each cell will indicate the image
pixel.
 Input :
Now the prepared data can be in the form that may not be
machine-readable, so to convert this data to the readable
form, some conversion algorithms are needed. For this task
to be executed, high computation and accuracy is needed.
Example: Data can be collected through the sources like
MNIST Digit data(images), Twitter comments, audio files,
video clips.
 Processing :
This is the stage where algorithms and ML techniques are
required to perform the instructions provided over a large
volume of data with accuracy and optimal computation.
 Output :
In this stage, results are procured by the machine in a
meaningful manner which can be inferred easily by the
user. Output can be in the form of reports, graphs, videos,
etc
 Storage :
This is the final step in which the obtained output and the
data model data and all the useful information are saved
for future use.

Advantages of data processing in Machine Learning:


1. Improved model performance: Data processing helps
improve the performance of the ML model by cleaning and
transforming the data into a format that is suitable for
modeling.
2. Better representation of the data: Data processing allows
the data to be transformed into a format that better
represents the underlying relationships and patterns in the
data, making it easier for the ML model to learn from the
data.
3. Increased accuracy: Data processing helps ensure that the
data is accurate, consistent, and free of errors, which can
help improve the accuracy of the ML model.

Disadvantages of data processing in Machine Learning:

1. Time-consuming: Data processing can be a time-consuming


task, especially for large and complex datasets.
2. Error-prone: Data processing can be error-prone, as it
involves transforming and cleaning the data, which can
result in the loss of important information or the
introduction of new errors.
3. Limited understanding of the data: Data processing can
lead to a limited understanding of the data, as the
transformed data may not be representative of the
underlying relationships and patterns in the data.

Overview of Data Cleaning


Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model. It surely isn’t the
fanciest part of machine learning and at the same time, there
aren’t any hidden tricks or secrets to uncover. However, the
success or failure of a project relies on proper data cleaning.
Professional data scientists usually invest a very large portion of
their time in this step because of the belief that “Better data
beats fancier algorithms”.
If we have a well-cleaned dataset, there are chances that we can
get achieve good results with simple algorithms also, which can
prove very beneficial at times especially in terms of computation
when the dataset size is large. Obviously, different types of data
will require different types of cleaning. However, this systematic
approach can always serve as a good starting point.
Steps Involved in Data Cleaning
Data cleaning is a crucial step in the machine learning (ML)
pipeline, as it involves identifying and removing any missing,
duplicate, or irrelevant data. The goal of data cleaning is to ensure
that the data is accurate, consistent, and free of errors, as incorrect
or inconsistent data can negatively impact the performance of the
ML model.
Data cleaning, also known as data cleansing or data
preprocessing, is a crucial step in the data science pipeline that
involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality
and usability. Data cleaning is essential because raw data is often
noisy, incomplete, and inconsistent, which can negatively impact
the accuracy and reliability of the insights derived from it.

The following are the most common steps involved in


data cleaning:

 Import the necessary libraries


 Load the dataset
 Check the data information using df.info()
Processing CSV Data
import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('input.csv')
df.head()

id name salary start_date dept


0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance

Reading Specific Rows


The read_csv function of the pandas library can also be used to
read some specific rows for a given column. We slice the result
from the read_csv function using the code shown below for first 5
rows for the column named salary.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Slice the result for first 5 rows


print (data[0:5]['salary'])
When we execute the above code, it produces the following
result.
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64
Reading Specific Columns
The read_csv function of the pandas library can also be used to
read some specific columns. We use the multi-axes indexing
method called .loc() for this purpose. We choose to display the
salary and name column for all the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[:,['salary','name']])

When we execute the above code, it produces the following


result.

salary name
0 623.30 Rick
1 515.20 Dan
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru
Reading Specific Columns and Rows
The read_csv function of the pandas library can also be used to
read some specific columns and specific rows. We use the multi-
axes indexing method called .loc() for this purpose. We choose to
display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[[1,3,5],['salary','name']])
When we execute the above code, it produces the following
result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi
Reading Specific Columns for a Range of Rows
The read_csv function of the pandas library can also be used to
read some specific columns and a range of rows. We use the
multi-axes indexing method called .loc() for this purpose. We
choose to display the salary and name column for some of the
rows.
import pandas as pd
data = pd.read_csv('path/input.csv')

# Use the multi-axes indexing funtion


print (data.loc[2:6,['salary','name']])
When we execute the above code, it produces the following
result.
salary name
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab

Detect and Remove the Outliers using Python

An Outlier is a data-item/object that deviates significantly from the


rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier detection
is referred to as outlier mining. There are many ways to detect the
outliers, and the removal process is the data frame same as
removing a data item from the panda’s data frame.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is


preloaded in the sklearn library.
Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


diabetics = load_diabetes()

# Create the dataframe


column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
df_diabetics.head()
Outliers can be detected using visualization, implementing
mathematical formulas on the dataset, or using the statistical
approach. All of these are discussed below.

Outliers Visualization

Visualizing Outliers Using Box Plot

A Box Plot is also known as Whisker plot is created to display the


summary of the set of data values having properties like minimum,
first quartile, median, third quartile and maximum. In the box plot,
a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the
median. Here x-axis denotes the data to be plotted while the y-axis
shows the frequency distribution.
it is primarily used to indicate a distribution is skewed or not and if there are
potential unusual observations (also called outliers) present in the data set.
Boxplots are also very beneficial when large numbers of data sets are involved or
compared.

Parts of Box Plots


Check the image below which shows the minimum, maximum, first
quartile, third quartile, median and outliers.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of
the data set.

Median: The median is the middle value of the dataset, which divides the
given dataset into two equal parts. The median is considered as the
second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of
the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile


and first quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data
is tested to be the outliers. Generally, the outliers fall more than the
specified distance from the first and third quartile.

(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR)

import seaborn as sns


sns.boxplot(df_diabetics['bmi'])
Outli

ers present in the bmi columns

In the above graph, can clearly see that values above 10 are acting
as outliers.
# Position of the Outlier
import numpy as np
print(np.where(df_diabetics['bmi']>0.12))

output:

(array([ 32, 145, 256, 262, 366, 367, 405]),)

Visualizing Outliers Using ScatterPlot.


It is used when you have paired numerical data and when your
dependent variable has multiple values for each reading
independent variable, or when trying to determine the relationship
between the two variables. In the process of utilizing the scatter
plot, one can also use it for outlier detection.

# Scatter plot
fig, ax = plt.subplots(figsize = (6,4))
ax.scatter(df_diabetics['bmi'],df_diabetics['bp'])

# x-axis label
ax.set_xlabel('(body mass index of people)')

# y-axis label
ax.set_ylabel('(bp of the people )')
plt.show()

Looking at the graph can summarize that most of the data points
are in the bottom left corner of the graph but there are few points
that are exactly;y opposite that is the top right corner of the graph.
Those points in the top right corner can be regarded as Outliers.
Using approximation can say all those data points that are x>20
and y>600 are outliers.
Outliers in BMI and BP Column Combined
Python3
# Position of the Outlier
print(np.where((df_diabetics['bmi']>0.12) &
(df_diabetics['bp']<0.8)))

Output:
(array([ 32, 145, 256, 262, 366, 367, 405]),)

IQR (Inter Quartile Range)


IQR (Inter Quartile Range) Inter Quartile Range approach to finding
the outliers is the most commonly used and most trusted approach
used in the research field.
IQR = Quartile3 – Quartile1

Python3
# IQR

Q1 = np.percentile(df_diabetics['bmi'], 25,
method='midpoint')

Q3 = np.percentile(df_diabetics['bmi'], 75,
method='midpoint')

IQR = Q3 - Q1

print(IQR)

Output:
0.06520763046978838
Syntax: numpy.percentile(arr, n, axis=None, out=None)
Parameters :
arr :input array.
n : percentile value.
To define the outlier base value is defined above and below
dataset’s normal range namely Upper and Lower bounds, define
the upper and the lower bound (1.5*IQR value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of
IQR (new_IQR = IQR + 0.5*IQR) is taken, to consider all the data
between 2.7 standard deviations in the Gaussian Distribution.

Python3
# Above Upper bound

upper=Q3+1.5*IQR

upper_array=np.array(df_diabetics['bmi']>=upper)

print("Upper Bound:",upper)

print(upper_array.sum())
#Below Lower bound

lower=Q1-1.5*IQR

lower_array=np.array(df_diabetics['bmi']<=lower)

print("Lower Bound:",lower)

print(lower_array.sum())

Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0

Removing the outliers


For removing the outlier, one must follow the same process of
removing an entry from the dataset using its exact position in the
dataset because in all the above methods of detecting the outliers
end result is the list of all those data items that satisfy the outlier
definition according to the method used.
References: How to delete exactly one row in python?
dataframe.drop(row index,inplace=True)
The above code can be used to drop a row from the dataset given
the row_indexes to be dropped. Inplace =True is used to tell Python
to make the required change in the original dataset. row_index can
be only one value or list of values or NumPy array but it must be
one dimensional.

Example:
df_diabetics.drop(lists[0],inplace = True)
Full Code: Detecting the outliers using IQR and removing them.
Python3
# Importing

import sklearn

from sklearn.datasets import load_diabetes

import pandas as pd

# Load the dataset


diabetes = load_diabetes()

# Create the dataframe

column_name = diabetes.feature_names

df_diabetes = pd.DataFrame(diabetes.data)

df_diabetes .columns = column_name

df_diabetes .head()

print("Old Shape: ", df_diabetes.shape)

''' Detection '''

# IQR

# Calculate the upper and lower limits

Q1 = df_diabetes['bmi'].quantile(0.25)

Q3 = df_diabetes['bmi'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5*IQR

upper = Q3 + 1.5*IQR

# Create arrays of Boolean values indicating the outlier rows

upper_array = np.where(df_diabetes['bmi']>=upper)[0]

lower_array = np.where(df_diabetes['bmi']<=lower)[0]

# Removing the outliers

df_diabetes.drop(index=upper_array, inplace=True)

df_diabetes.drop(index=lower_array, inplace=True)

# Print the new shape of the DataFrame

print("New Shape: ", df_diabetes.shape)

Output:
Old Shape: (442, 10)
New Shape: (439, 10)

You might also like