0% found this document useful (0 votes)

98 views7 pages

Outlier Detection and Capping

There are several techniques to detect and handle outliers in a dataset. The document discusses and demonstrates 1) using z-scores to identify outliers more than 3 standard deviations from the mean, 2) capping outlier values between the 1st and 99th percentiles to remove their influence, and 3) two methods for capping outliers in Python - using np.where() to replace values below/above thresholds and clip() to restrict values within a given range. Boxplots are used before and after handling outliers to visualize the impact of these outlier treatment steps.

Uploaded by

santro985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views7 pages

Outlier Detection and Capping

Uploaded by

santro985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Handling Outliers

May 30, 2022

[18]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

[6]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.

,→xlsx")

data.head()

[6]: age job marital education default balance housing loan \

0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
2 33 entrepreneur married secondary no 2 yes yes
3 47 blue-collar married unknown no 1506 yes no
4 33 unknown single unknown no 1 no no

contact day month duration campaign pdays previous poutcome y

0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

1 There are different techniques to detect outliers

1.1 1- Z Score
1.2 Z= (Observation- Mean)/Standard Deviation
1.2.1 Z Score is not advisable if the data has skewness.
We calculate Z score for each observation and if the Zscore > 3 or Zscore < -3 then we classify that
observation as an outlier. Any point out of 3 standard deviations is known as an outlier.
It also means if any value greater than or lower than 3 stanadrd deviations from mean then it is
treated as an outlier
[10]: sns.displot(x='age',data=data)

1
[10]: <seaborn.axisgrid.FacetGrid at 0x1f9b2f74af0>

The data is not normally distributed and is right skewed

2 The following methods are used when the data is normally dis-
tributed
[4]: high_limit=data.age.mean()+3*data.age.std()
low_limit=data.age.mean()-3*data.age.std()

[5]: print(high_limit,'\n')
print(low_limit)

72.79249633725466

9.079924091402077

2
[6]: data['age'].describe()

[6]: count 45211.000000

mean 40.936210
std 10.618762
min 18.000000
25% 33.000000
50% 39.000000
75% 48.000000
max 95.000000
Name: age, dtype: float64

[20]: sns.boxplot(y='age',data=data) # boxplots are used to plot outliers

[20]: <AxesSubplot:ylabel='age'>

3 The very common method to cap outliers is quantile method.

4 We cap the values between 1 percentile and 99 percentile value.

5 This method can be used even if the data is skewed.

[21]: # We hvae to set the limits.
# The lower limit will be 1 percentile value.
# The upper limit will be 99 percentile value.

3
6 We are treating the column age.
[22]: lower_limit=data.age.quantile(0.01)
upper_limit=data.age.quantile(0.99)
print(lower_limit)
print(upper_limit)

23.0
71.0
Nowthat we have got our lower and upper limits. Now we have to cap the data between these two
values. This way outliers will be removed and that we can check with the help of the bixplot.

7 To cap the values we can use two methods.

8 1- np.where()

9 2- clip()
We will use both the methods one by one.

10 1- np.where()
[24]: data.age=np.where(data.age<lower_limit,lower_limit,np.where(data.
,→age>upper_limit,upper_limit,data.age))

data.age.head()

[24]: 0 58.0
1 44.0
2 33.0
3 47.0
4 33.0
Name: age, dtype: float64

we have capped the values. Now let’s check the outilers with the help of a boxplot
[26]: sns.boxplot(y='age',data=data)

[26]: <AxesSubplot:ylabel='age'>

4
Now we can compare the both boxplots that we created earlier and this one and can see that we
ahve traeted the outliers.

11 2- clip()
To use the clip() let’s import the data again and then check and remove outliers

[28]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.

,→xlsx")

data.head()

[28]: age job marital education default balance housing loan \

contact day month duration campaign pdays previous poutcome y

0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

[31]: sns.boxplot(y='age',data=data)

5
[31]: <AxesSubplot:ylabel='age'>

Here we can see that there are a lot of outliers in my age column.
[32]: data.age=data.age.clip(lower=data.age.quantile(0.01),upper=data.age.quantile(0.
,→99))

data.head()

[32]: age job marital education default balance housing loan \

contact day month duration campaign pdays previous poutcome y

0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no

Now let’s create a boxplot again to check whether the outliers have been capped or not.
[33]: sns.boxplot(y='age',data=data)

[33]: <AxesSubplot:ylabel='age'>

6
Here we can see that we have succeessully capped the outliers.

12 Thank you
[ ]:

Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Outlier Treatment
No ratings yet
Outlier Treatment
16 pages
Data Analysis for Outlier Detection
100% (1)
Data Analysis for Outlier Detection
28 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Eda U2
No ratings yet
Eda U2
141 pages
Nikita Prasad - Outliers Basics
No ratings yet
Nikita Prasad - Outliers Basics
13 pages
DataPreparation - Outlier - Treatment ASSIGNMENT 1
100% (1)
DataPreparation - Outlier - Treatment ASSIGNMENT 1
7 pages
11 Different Ways For Outlier Detection in Python
No ratings yet
11 Different Ways For Outlier Detection in Python
11 pages
Outliers
No ratings yet
Outliers
3 pages
Visualization 2
No ratings yet
Visualization 2
1 page
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Ads 7
No ratings yet
Ads 7
6 pages
6735367a5d6e24a5f185bf9c 99512104437
No ratings yet
6735367a5d6e24a5f185bf9c 99512104437
2 pages
Module 3
No ratings yet
Module 3
108 pages
How To Handle Outliers
No ratings yet
How To Handle Outliers
6 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Unit 4
No ratings yet
Unit 4
17 pages
EDA - Session-5 - Outlier Analysis
No ratings yet
EDA - Session-5 - Outlier Analysis
11 pages
DataPreparation Outlier Treatment
No ratings yet
DataPreparation Outlier Treatment
5 pages
Outliers
No ratings yet
Outliers
3 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Outlier Analysis
No ratings yet
Outlier Analysis
2 pages
# Importing: Import From Import Import As Import As
No ratings yet
# Importing: Import From Import Import As Import As
6 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
2
No ratings yet
2
3 pages
Handle Outliers PySpark
No ratings yet
Handle Outliers PySpark
1 page
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
6.outlier Code - Jupyter Notebook
No ratings yet
6.outlier Code - Jupyter Notebook
5 pages
Meran
No ratings yet
Meran
2 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Ds Prac10
No ratings yet
Ds Prac10
7 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Wrangling Assignment Guide
No ratings yet
Data Wrangling Assignment Guide
4 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Ads Exp 7
No ratings yet
Ads Exp 7
10 pages
3-Introduction To Data Cleaning Outlires
No ratings yet
3-Introduction To Data Cleaning Outlires
5 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Adsl Exp 8 2024
No ratings yet
Adsl Exp 8 2024
10 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
Outlier Treatment - Jupyter Notebook
No ratings yet
Outlier Treatment - Jupyter Notebook
15 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
Family 360 Banking Socs 01
No ratings yet
Family 360 Banking Socs 01
9 pages
Bank Wire Transfer Details
No ratings yet
Bank Wire Transfer Details
1 page
Combined List of Eligible Candidates For 3rd Round For Web
No ratings yet
Combined List of Eligible Candidates For 3rd Round For Web
32 pages
Brochure IISc AI For Autonomous Systems
No ratings yet
Brochure IISc AI For Autonomous Systems
14 pages
Vu+Cinema+Action+Series 50LX Specification
No ratings yet
Vu+Cinema+Action+Series 50LX Specification
2 pages
Unit 4
No ratings yet
Unit 4
13 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Project Synopsis
No ratings yet
Project Synopsis
14 pages
Peyush Project Report 15
No ratings yet
Peyush Project Report 15
25 pages
Uji Validitas New
No ratings yet
Uji Validitas New
3 pages
Measure of Variability Ungrouped Data
No ratings yet
Measure of Variability Ungrouped Data
22 pages
Marketing Research 10th Edition McDaniel Fast Access
No ratings yet
Marketing Research 10th Edition McDaniel Fast Access
310 pages
Statistical Inferance Anova, Monova, Moncova Submitted By: Ans Muhammad Submitted To: Sir Adnan Ali CH
No ratings yet
Statistical Inferance Anova, Monova, Moncova Submitted By: Ans Muhammad Submitted To: Sir Adnan Ali CH
9 pages
Customer Satisfaction With Reference To Yamaha Motors
0% (1)
Customer Satisfaction With Reference To Yamaha Motors
58 pages
BD Project
100% (1)
BD Project
29 pages
Waste Generation and Socio-Economic Factors
No ratings yet
Waste Generation and Socio-Economic Factors
10 pages
Big Data Research Paper
No ratings yet
Big Data Research Paper
14 pages
R470, General Education, Common Course Numbering, Lower-Division Pre-Major Requirements, Transfer of Credits, and Credit by Examination
No ratings yet
R470, General Education, Common Course Numbering, Lower-Division Pre-Major Requirements, Transfer of Credits, and Credit by Examination
20 pages
Linear Regression Assumptions Impact
No ratings yet
Linear Regression Assumptions Impact
21 pages
Time Series Analysis and Forecasting
No ratings yet
Time Series Analysis and Forecasting
9 pages
Happiness Excellence and Optimal Human Functioning Revisited Examining The Peer Reviewed Literature Linked To Positive Psychology - 2015
No ratings yet
Happiness Excellence and Optimal Human Functioning Revisited Examining The Peer Reviewed Literature Linked To Positive Psychology - 2015
12 pages
Radicic and Petkovic 2023
No ratings yet
Radicic and Petkovic 2023
16 pages
Lesson - Statprob-1
No ratings yet
Lesson - Statprob-1
8 pages
Acm Queue PDF
No ratings yet
Acm Queue PDF
12 pages
Correlation & Regression Numericals
No ratings yet
Correlation & Regression Numericals
4 pages
Capstone Project Report Template (BBA) - Final - DM
No ratings yet
Capstone Project Report Template (BBA) - Final - DM
21 pages
1.5 Three Tier Architecture
No ratings yet
1.5 Three Tier Architecture
12 pages
Time Series Forecasting - Test Your Understanding
No ratings yet
Time Series Forecasting - Test Your Understanding
5 pages
Linear Regression Analysis
100% (3)
Linear Regression Analysis
53 pages
Linear Correlation Analysis Guide
100% (1)
Linear Correlation Analysis Guide
3 pages
A Study of Building Information Modeling (Bim) Uptake
No ratings yet
A Study of Building Information Modeling (Bim) Uptake
17 pages
Health Informatics
No ratings yet
Health Informatics
55 pages
Proposal Fyp
No ratings yet
Proposal Fyp
15 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages

Outlier Detection and Capping

Uploaded by

Outlier Detection and Capping

Uploaded by

Handling Outliers

May 30, 2022

[18]: import numpy as np

[6]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.

[6]: age job marital education default balance housing loan \

contact day month duration campaign pdays previous poutcome y

1 There are different techniques to detect outliers

The data is not normally distributed and is right skewed

[6]: count 45211.000000

[20]: sns.boxplot(y='age',data=data) # boxplots are used to plot outliers

3 The very common method to cap outliers is quantile method.

4 We cap the values between 1 percentile and 99 percentile value.

5 This method can be used even if the data is skewed.

7 To cap the values we can use two methods.

[28]: data=pd.read_excel("S:\\Data Science\\Projects\\Datasets\\Term Deposit\\train.

[28]: age job marital education default balance housing loan \

contact day month duration campaign pdays previous poutcome y

[32]: age job marital education default balance housing loan \

contact day month duration campaign pdays previous poutcome y

You might also like