0% found this document useful (0 votes)

77 views21 pages

Just Give Me The Codes Lecture 5: Data Preprocessing II

This document summarizes key steps from a lecture on data preprocessing, including: 1) Checking for normality using univariate, bivariate, and multivariate normality tests on variables from a dataset. 2) Detecting and removing outliers using interquartile range and visual inspection. 3) Assessing normality again after removing outliers by imputation. The goals are to understand normality assumptions and detect/handle outliers to improve downstream statistical analyses.

Uploaded by

DragosCavescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views21 pages

Just Give Me The Codes Lecture 5: Data Preprocessing II

Uploaded by

DragosCavescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Just Give me the Codes

Lecture 5: Data Preprocessing II

GOALS
• Normality (multivariate/bivariate/univariate distribution)
• Outlier detection and removal
Recap & Step 25

 From last lecture:

 Created a new df ‘Norway’
 From ‘Norway’ created yet another df ‘selection’ (3
numerical variables)
 This lecture (Step 25)
 Warnings are a nuisance
 Follow Step 25 to view dimensions of ‘selection’ and to
suppress ‘warnings’
 Place a hashtag before filter.warnings() should you
select to read the warnings
Normality – The Assumption of Normality

 Data needs to follow a normal distribution for many statistical tests

 Referred to as the Assumption of Normality
 critical for sample sizes <30
 Choose appropriate statistical test for your sample size
 Example: Sample size 23
 Royston test for multivariate normality
 Fail to reject the null at the 5% level (data follows a multivariate normal distribution)

 Henze-Zirkler test for multivariate normality

 Reject the null at the 5% level (data does not follow a multivariate normal distribution)

 Refer to the links at the end of the lecture for more information on normality tests
Step 26 – Install and import Pingouin

 Python limited with multivariate normality tests

 pingouin package
 Univariate, bivariate, multivariate
 Follow Step 26 to install and import the
penguin package
 Refer to links at the end of the lecture for more
information on pingouin
Step 27 – Shapiro-Wilk normality test

 The null hypothesis for the Shapiro-Wilk

normality test states that the data is normally
distributed
 The alternative hypothesis for the Shapiro-Wilk
normality test that the data is not normally
distributed.
 Follow Step 27 to determine the normality of
each variable
 All 3 numerical variables have univariate
normal distributions at signiﬁcance level 0.05,
however it is good practice to visualize your
dataset to diagnose any deviation from
multivariate normality (when testing for
multivariate normality)
Step 28 – Visually inspect TFR_foreign

 Step 28
 One could
additionally place
a k after -o to
account for a
black marker
 For example
plt.plot(a, fit, ‘-ok’)
Step 29 – Visually inspect TFR_native

 Step 29
Step 30 – Visually inspect Overall_TFR

 Step 30
Step 31 – Skewness & Kurtosis
 Measuring skewness:
 skewness = 0 : normally distributed (symmetrical
distribution)
 skewness > 0 : longer right tail; mass of distribution is
concentrated on the left of the figure
 skewness < 0 : longer left tail; mass of distribution is
concentrated on the right of the figure
 Measuring kurtosis:
 kurtosis = 0: normal distribution
 Kurtosis > 0: distribution’s tails are larger than for a normal
distribution
 Kurtosis < 0: distribution’s tails are smaller than for a normal
distribution
 Results from Step 31 show TFR_foreign to be moderately
skewed, whilst TFR_native and Overall_TFR are fairly
symmetrical. TFR_foreign is heavy-tailed whilst TFR_native is
light-tailed. Overall_TFR has a kurtosis value consistent with a
normal distribution.
Step 32: Multivariate normal distribution

 The null hypothesis for the Henze-Zirkler multivariate normality test states that the data follows
multivariate normal distribution
 The alternative hypothesis for the Henze-Zirkler multivariate normality test states that the data does
not follow a multivariate normal distribution.
 Dataset (‘selection’) does NOT have a multivariate normal distribution at significance level 0.05
 We will try removing one of the variables (establish bivariate normality between each variable pair)
Steps 33-34: Bivariate normal distribution

 TFR_foreign and Overall_TFR:

 DO NOT satisfy the bivariate normality
assumption at the significance level
0.05.
 TFR_foreign and TFR_native:
 Satisfy the bivariate normality
assumption at the significance level
0.05.
 TFR_native and Overall_TFR:
 Satisfy the bivariate normality
assumption at the significance level
0.05.
 In case you weren’t aware:
 Bivariate = 2 variables
 Multivariate = 2 or more variables
Steps 35-36: IQR and outliers

 Follow Steps 35-36

to determine IQR
for each column
and accordingly,
number of outliers
per column
Step 37 – Position of outliers

 Follow Step 37 to
view position of
outliers
 Remember,
index=0 is position 1
 Therefore outliers
at position 13 and
23
 Steps 35-37
checked for
univariate outliers
Step 38 – Concat() & import seaborn

 Concatenating in this direction:

 x→y
 y→z
 x→z
Steps 39-41: PairGrid

 Bivariate relationships:
 g = TFR_foreign & TFR_native
 h = TFR_native & Overall_TFR
 i = TFR_foreign & Overall_TFR
Steps 39-41: PairGrid plots
Step 42 – Pearson’s correlation coefficient

 pairwise_corr() function is part of the pingouin package

 Pearson’s correlation coefficient hypothesis:
 NULL: No linear relationship exists
 ALTERNATIVE: A linear relationship does exist
 There IS a significant linear relationship between TFR_native and Overall_TFR
 There is NO significant linear relationship between TFR_foreign & Overall_TFR
 There is NO significant linear relationship between TFR_foreign & TFR_native
What does all this mean?

 TFR_foreign cause of outliers

 Deleting outliers would reduce dataset by 20%
 Deleting, imputing or transforming
 No guarantee outliers deleted first instance
 Faced with deleting or imputing new outliers
 Originality of dataset reduced even further

 No universal method for outlier detection and removal; choice comes

with experience
Steps 43-47: If the 5 outliers were
deleted…..
Steps 48-52 – Imputing outliers with the
median & new MVN test

 Note: a p-value
>0.05 for ‘impute’
and ‘no_outliers’
datasets does not
infer no outliers
 outlier tests need
to be conducted
again
End of Lecture 5

 Well done! You have gained intermediate skills in Data Preprocessing!

 Where to go from here? Lecture 6 of course! But things to consider:
 Read up on normality tests
 Read up on Pingouin
 A great place to start:
 Link to Pingouin: https://pingouin-stats.org/index.html
 Pingouin univariate normality: https://pingouin-stats.org/generated/pingouin.normality.html
 Pingouin multivariate normality: https://pingouin-
stats.org/generated/pingouin.multivariate_normality.html
 Link to article on Normality tests: https://www.nrc.gov/docs/ML1714/ML17143A100.pdf

Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
No ratings yet
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
45 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Lecture23 2
No ratings yet
Lecture23 2
10 pages
Checking The Normality of A Dataset
No ratings yet
Checking The Normality of A Dataset
6 pages
Eda U2
No ratings yet
Eda U2
141 pages
TOBo ML
No ratings yet
TOBo ML
135 pages
Data Analysis for Students
No ratings yet
Data Analysis for Students
32 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
TOBo ML
No ratings yet
TOBo ML
120 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
6735367a5d6e24a5f185bf9c 99512104437
No ratings yet
6735367a5d6e24a5f185bf9c 99512104437
2 pages
Statistics & Probability Guide
No ratings yet
Statistics & Probability Guide
38 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
28 pages
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
No ratings yet
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
107 pages
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Parta PDF
No ratings yet
Parta PDF
153 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
R Studio Cheat Sheet For Math1041
No ratings yet
R Studio Cheat Sheet For Math1041
3 pages
R Programming Lab Assignments
No ratings yet
R Programming Lab Assignments
40 pages
Confirmatory Factor Analysis and Raykov's Rho: June 2018
No ratings yet
Confirmatory Factor Analysis and Raykov's Rho: June 2018
16 pages
Chapter1 MV
No ratings yet
Chapter1 MV
72 pages
Mathematics of The Linear Model and Linear Mixed Model: Brian Zhang February 2020
No ratings yet
Mathematics of The Linear Model and Linear Mixed Model: Brian Zhang February 2020
20 pages
DS 5-Marks Semeseter Suggestion
No ratings yet
DS 5-Marks Semeseter Suggestion
56 pages
Principle of Multilinear Regression, Normality and Herterschedasity
No ratings yet
Principle of Multilinear Regression, Normality and Herterschedasity
3 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
BE184
No ratings yet
BE184
47 pages
Lecture 3b Descriptive Statistics - Numerical Measures
No ratings yet
Lecture 3b Descriptive Statistics - Numerical Measures
34 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
CLASS Analysis
No ratings yet
CLASS Analysis
14 pages
Ouliers in Statistica
0% (1)
Ouliers in Statistica
5 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Datos Atipicos
No ratings yet
Datos Atipicos
59 pages
Seo PDF
No ratings yet
Seo PDF
59 pages
Multivariate Statistical Functions in R
100% (3)
Multivariate Statistical Functions in R
382 pages
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
No ratings yet
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
8 pages
MA Hw2sol
No ratings yet
MA Hw2sol
16 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
ADS EXP Assignments
No ratings yet
ADS EXP Assignments
38 pages
Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
ADS Practical Exam Questions
No ratings yet
ADS Practical Exam Questions
14 pages
Diabetic Retinopathy Risk Modeling
No ratings yet
Diabetic Retinopathy Risk Modeling
24 pages
Data Science 01 - Basics
No ratings yet
Data Science 01 - Basics
52 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Re Center Psych Stats
No ratings yet
Re Center Psych Stats
560 pages
12 Anomaly Detection SVD III
No ratings yet
12 Anomaly Detection SVD III
25 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Multivariate Statistical Functions in R
No ratings yet
Multivariate Statistical Functions in R
138 pages
Exploratory Factor Analysis and Cronbach's Alpha: Questionnaire Validation Workshop, 10/10/2017, USM Health Campus
No ratings yet
Exploratory Factor Analysis and Cronbach's Alpha: Questionnaire Validation Workshop, 10/10/2017, USM Health Campus
22 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
AP Statistics Michel Liao
No ratings yet
AP Statistics Michel Liao
20 pages
Statistics Normality
No ratings yet
Statistics Normality
42 pages
Lecture 1 Data Quality and Statistics
50% (2)
Lecture 1 Data Quality and Statistics
31 pages
Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) All Chapters Available
No ratings yet
Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) All Chapters Available
88 pages
Data Mining & Analysis Guide
No ratings yet
Data Mining & Analysis Guide
148 pages
Course - 4.04.2020 - CommunicationManagement (Review)
No ratings yet
Course - 4.04.2020 - CommunicationManagement (Review)
10 pages
Course 4.04.2020 BPMN
No ratings yet
Course 4.04.2020 BPMN
35 pages
BPMN Diagram for Alfa Co. Restatement
No ratings yet
BPMN Diagram for Alfa Co. Restatement
1 page
Just Give Me The Codes Lecture 4: Data Preprocessing I: Goals: Format Variables Check For Missing Values
No ratings yet
Just Give Me The Codes Lecture 4: Data Preprocessing I: Goals: Format Variables Check For Missing Values
7 pages
BPMN - Team Work
No ratings yet
BPMN - Team Work
3 pages
Just Give Me The Codes Lecture 3: Descriptive Analysis
No ratings yet
Just Give Me The Codes Lecture 3: Descriptive Analysis
6 pages
Just Give Me The Codes Lecture 2: Data Importation: Goals: Import Data Into Jupyterlab View The Dataset
No ratings yet
Just Give Me The Codes Lecture 2: Data Importation: Goals: Import Data Into Jupyterlab View The Dataset
9 pages
7 Introduction To Binomial Trees
No ratings yet
7 Introduction To Binomial Trees
25 pages
Stochastic Systems in Manufacturing
No ratings yet
Stochastic Systems in Manufacturing
88 pages
EXERCISE#11 Z Test Proportions
No ratings yet
EXERCISE#11 Z Test Proportions
2 pages
Phillips & Perron - Biometrika - 1988 - Unit Root Test
No ratings yet
Phillips & Perron - Biometrika - 1988 - Unit Root Test
13 pages
Dummy Variables & Path Analysis
No ratings yet
Dummy Variables & Path Analysis
12 pages
Chapter 9 Testing A Claim-9.3
No ratings yet
Chapter 9 Testing A Claim-9.3
29 pages
HW 2 Chap 2-1-2
No ratings yet
HW 2 Chap 2-1-2
2 pages
Check-In Activity 1 Problem Statement:: Fine Arts Arts and Sciences Engineerin G
No ratings yet
Check-In Activity 1 Problem Statement:: Fine Arts Arts and Sciences Engineerin G
2 pages
Telecom Churn Prediction with Logistic Regression
No ratings yet
Telecom Churn Prediction with Logistic Regression
38 pages
Hypothesis Testing For Means & Proportions
No ratings yet
Hypothesis Testing For Means & Proportions
20 pages
KMJ - Muafakat SM025 Set 2
No ratings yet
KMJ - Muafakat SM025 Set 2
5 pages
120 Interview Questions
83% (12)
120 Interview Questions
19 pages
Econometrics: Understanding Autocorrelation
No ratings yet
Econometrics: Understanding Autocorrelation
17 pages
Bmat202l Cat 1 D2 2024
No ratings yet
Bmat202l Cat 1 D2 2024
7 pages
The Big Problems File
No ratings yet
The Big Problems File
197 pages
Business Statistics: Module 3. Normal Distribution Page 1 of 8
No ratings yet
Business Statistics: Module 3. Normal Distribution Page 1 of 8
8 pages
Cohen 1992 A Power Primer PDF
No ratings yet
Cohen 1992 A Power Primer PDF
8 pages
Asco Valve Canada Introduces New Red Hat Valve
No ratings yet
Asco Valve Canada Introduces New Red Hat Valve
5 pages
Ma3355 CDP
No ratings yet
Ma3355 CDP
9 pages
Evaluating The Uncertainty of Polynomial Regression Models Using Excel
No ratings yet
Evaluating The Uncertainty of Polynomial Regression Models Using Excel
18 pages
Introduction To Probability and Statistics For Engineers and Scientists 6th Edition Sheldon M. Ross Latest PDF 2025
No ratings yet
Introduction To Probability and Statistics For Engineers and Scientists 6th Edition Sheldon M. Ross Latest PDF 2025
96 pages
Statistical Terms
No ratings yet
Statistical Terms
1 page
Best Linear Predictor
No ratings yet
Best Linear Predictor
15 pages
SEM Techniques Performance Analysis
No ratings yet
SEM Techniques Performance Analysis
11 pages
Difference in Differences
No ratings yet
Difference in Differences
7 pages
Regression Practice Questions
No ratings yet
Regression Practice Questions
2 pages
Contoh Interpretasi Data Pada SPSS 21
No ratings yet
Contoh Interpretasi Data Pada SPSS 21
22 pages
Mclogit
No ratings yet
Mclogit
19 pages
17.hypothesis Test On Proportion and Variance
No ratings yet
17.hypothesis Test On Proportion and Variance
2 pages
Probability Cheat Sheet
No ratings yet
Probability Cheat Sheet
8 pages
Statistics For Management QP
No ratings yet
Statistics For Management QP
4 pages

Just Give Me The Codes Lecture 5: Data Preprocessing II

Uploaded by

Just Give Me The Codes Lecture 5: Data Preprocessing II

Uploaded by

Just Give me the Codes

Lecture 5: Data Preprocessing II

 From last lecture:

 Data needs to follow a normal distribution for many statistical tests

 Henze-Zirkler test for multivariate normality

 Python limited with multivariate normality tests

 The null hypothesis for the Shapiro-Wilk

 TFR_foreign and Overall_TFR:

 Follow Steps 35-36

 Concatenating in this direction:

 pairwise_corr() function is part of the pingouin package

 TFR_foreign cause of outliers

 No universal method for outlier detection and removal; choice comes

 Well done! You have gained intermediate skills in Data Preprocessing!

You might also like