Just Give me the Codes
Lecture 5: Data Preprocessing II
GOALS
• Normality (multivariate/bivariate/univariate distribution)
• Outlier detection and removal
Recap & Step 25
From last lecture:
Created a new df ‘Norway’
From ‘Norway’ created yet another df ‘selection’ (3
numerical variables)
This lecture (Step 25)
Warnings are a nuisance
Follow Step 25 to view dimensions of ‘selection’ and to
suppress ‘warnings’
Place a hashtag before filter.warnings() should you
select to read the warnings
Normality – The Assumption of Normality
Data needs to follow a normal distribution for many statistical tests
Referred to as the Assumption of Normality
critical for sample sizes <30
Choose appropriate statistical test for your sample size
Example: Sample size 23
Royston test for multivariate normality
Fail to reject the null at the 5% level (data follows a multivariate normal distribution)
Henze-Zirkler test for multivariate normality
Reject the null at the 5% level (data does not follow a multivariate normal distribution)
Refer to the links at the end of the lecture for more information on normality tests
Step 26 – Install and import Pingouin
Python limited with multivariate normality tests
pingouin package
Univariate, bivariate, multivariate
Follow Step 26 to install and import the
penguin package
Refer to links at the end of the lecture for more
information on pingouin
Step 27 – Shapiro-Wilk normality test
The null hypothesis for the Shapiro-Wilk
normality test states that the data is normally
distributed
The alternative hypothesis for the Shapiro-Wilk
normality test that the data is not normally
distributed.
Follow Step 27 to determine the normality of
each variable
All 3 numerical variables have univariate
normal distributions at significance level 0.05,
however it is good practice to visualize your
dataset to diagnose any deviation from
multivariate normality (when testing for
multivariate normality)
Step 28 – Visually inspect TFR_foreign
Step 28
One could
additionally place
a k after -o to
account for a
black marker
For example
plt.plot(a, fit, ‘-ok’)
Step 29 – Visually inspect TFR_native
Step 29
Step 30 – Visually inspect Overall_TFR
Step 30
Step 31 – Skewness & Kurtosis
Measuring skewness:
skewness = 0 : normally distributed (symmetrical
distribution)
skewness > 0 : longer right tail; mass of distribution is
concentrated on the left of the figure
skewness < 0 : longer left tail; mass of distribution is
concentrated on the right of the figure
Measuring kurtosis:
kurtosis = 0: normal distribution
Kurtosis > 0: distribution’s tails are larger than for a normal
distribution
Kurtosis < 0: distribution’s tails are smaller than for a normal
distribution
Results from Step 31 show TFR_foreign to be moderately
skewed, whilst TFR_native and Overall_TFR are fairly
symmetrical. TFR_foreign is heavy-tailed whilst TFR_native is
light-tailed. Overall_TFR has a kurtosis value consistent with a
normal distribution.
Step 32: Multivariate normal distribution
The null hypothesis for the Henze-Zirkler multivariate normality test states that the data follows
multivariate normal distribution
The alternative hypothesis for the Henze-Zirkler multivariate normality test states that the data does
not follow a multivariate normal distribution.
Dataset (‘selection’) does NOT have a multivariate normal distribution at significance level 0.05
We will try removing one of the variables (establish bivariate normality between each variable pair)
Steps 33-34: Bivariate normal distribution
TFR_foreign and Overall_TFR:
DO NOT satisfy the bivariate normality
assumption at the significance level
0.05.
TFR_foreign and TFR_native:
Satisfy the bivariate normality
assumption at the significance level
0.05.
TFR_native and Overall_TFR:
Satisfy the bivariate normality
assumption at the significance level
0.05.
In case you weren’t aware:
Bivariate = 2 variables
Multivariate = 2 or more variables
Steps 35-36: IQR and outliers
Follow Steps 35-36
to determine IQR
for each column
and accordingly,
number of outliers
per column
Step 37 – Position of outliers
Follow Step 37 to
view position of
outliers
Remember,
index=0 is position 1
Therefore outliers
at position 13 and
23
Steps 35-37
checked for
univariate outliers
Step 38 – Concat() & import seaborn
Concatenating in this direction:
x→y
y→z
x→z
Steps 39-41: PairGrid
Bivariate relationships:
g = TFR_foreign & TFR_native
h = TFR_native & Overall_TFR
i = TFR_foreign & Overall_TFR
Steps 39-41: PairGrid plots
Step 42 – Pearson’s correlation coefficient
pairwise_corr() function is part of the pingouin package
Pearson’s correlation coefficient hypothesis:
NULL: No linear relationship exists
ALTERNATIVE: A linear relationship does exist
There IS a significant linear relationship between TFR_native and Overall_TFR
There is NO significant linear relationship between TFR_foreign & Overall_TFR
There is NO significant linear relationship between TFR_foreign & TFR_native
What does all this mean?
TFR_foreign cause of outliers
Deleting outliers would reduce dataset by 20%
Deleting, imputing or transforming
No guarantee outliers deleted first instance
Faced with deleting or imputing new outliers
Originality of dataset reduced even further
No universal method for outlier detection and removal; choice comes
with experience
Steps 43-47: If the 5 outliers were
deleted…..
Steps 48-52 – Imputing outliers with the
median & new MVN test
Note: a p-value
>0.05 for ‘impute’
and ‘no_outliers’
datasets does not
infer no outliers
outlier tests need
to be conducted
again
End of Lecture 5
Well done! You have gained intermediate skills in Data Preprocessing!
Where to go from here? Lecture 6 of course! But things to consider:
Read up on normality tests
Read up on Pingouin
A great place to start:
Link to Pingouin: https://pingouin-stats.org/index.html
Pingouin univariate normality: https://pingouin-stats.org/generated/pingouin.normality.html
Pingouin multivariate normality: https://pingouin-
stats.org/generated/pingouin.multivariate_normality.html
Link to article on Normality tests: https://www.nrc.gov/docs/ML1714/ML17143A100.pdf