Handling Outliers
May 30, 2022
[18]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
[6]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.
,→xlsx")
data.head()
[6]: age job marital education default balance housing loan \
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no
contact day month duration campaign pdays previous poutcome y
0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no
1 There are different techniques to detect outliers
1.1 1- Z Score
1.2 Z= (Observation- Mean)/Standard Deviation
1.2.1 Z Score is not advisable if the data has skewness.
We calculate Z score for each observation and if the Zscore > 3 or Zscore < -3 then we classify that
observation as an outlier. Any point out of 3 standard deviations is known as an outlier.
It also means if any value greater than or lower than 3 stanadrd deviations from mean then it is
treated as an outlier
[10]: sns.displot(x='age',data=data)
1
[10]: <seaborn.axisgrid.FacetGrid at 0x1f9b2f74af0>
The data is not normally distributed and is right skewed
2 The following methods are used when the data is normally dis-
tributed
[4]: high_limit=data.age.mean()+3*data.age.std()
low_limit=data.age.mean()-3*data.age.std()
[5]: print(high_limit,'\n')
print(low_limit)
72.79249633725466
9.079924091402077
2
[6]: data['age'].describe()
[6]: count 45211.000000
mean 40.936210
std 10.618762
min 18.000000
25% 33.000000
50% 39.000000
75% 48.000000
max 95.000000
Name: age, dtype: float64
[20]: sns.boxplot(y='age',data=data) # boxplots are used to plot outliers
[20]: <AxesSubplot:ylabel='age'>
3 The very common method to cap outliers is quantile method.
4 We cap the values between 1 percentile and 99 percentile value.
5 This method can be used even if the data is skewed.
[21]: # We hvae to set the limits.
# The lower limit will be 1 percentile value.
# The upper limit will be 99 percentile value.
3
6 We are treating the column age.
[22]: lower_limit=data.age.quantile(0.01)
upper_limit=data.age.quantile(0.99)
print(lower_limit)
print(upper_limit)
23.0
71.0
Nowthat we have got our lower and upper limits. Now we have to cap the data between these two
values. This way outliers will be removed and that we can check with the help of the bixplot.
7 To cap the values we can use two methods.
8 1- np.where()
9 2- clip()
We will use both the methods one by one.
10 1- np.where()
[24]: data.age=np.where(data.age<lower_limit,lower_limit,np.where(data.
,→age>upper_limit,upper_limit,data.age))
data.age.head()
[24]: 0 58.0
1 44.0
2 33.0
3 47.0
4 33.0
Name: age, dtype: float64
we have capped the values. Now let’s check the outilers with the help of a boxplot
[26]: sns.boxplot(y='age',data=data)
[26]: <AxesSubplot:ylabel='age'>
4
Now we can compare the both boxplots that we created earlier and this one and can see that we
ahve traeted the outliers.
11 2- clip()
To use the clip() let’s import the data again and then check and remove outliers
[28]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.
,→xlsx")
data.head()
[28]: age job marital education default balance housing loan \
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no
contact day month duration campaign pdays previous poutcome y
0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no
[31]: sns.boxplot(y='age',data=data)
5
[31]: <AxesSubplot:ylabel='age'>
Here we can see that there are a lot of outliers in my age column.
[32]: data.age=data.age.clip(lower=data.age.quantile(0.01),upper=data.age.quantile(0.
,→99))
data.head()
[32]: age job marital education default balance housing loan \
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no
contact day month duration campaign pdays previous poutcome y
0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no
Now let’s create a boxplot again to check whether the outliers have been capped or not.
[33]: sns.boxplot(y='age',data=data)
[33]: <AxesSubplot:ylabel='age'>
6
Here we can see that we have succeessully capped the outliers.
12 Thank you
[ ]: