Class Information
AE 5418 Statistical Hydrology
Engr. Dr. Muhammad Ajmal
Associate Professor
PhD in Water Resources & Environmental Engineering
Civil & Environmental Engineering, Hanyang University, South Korea
Agricultural Engineering Department
University of Engineering & Technology Peshawar, Pakistan
1
Statistical Hydrology
Data Transformation
22
04. Data Transformation
What is a data transformation?
➢ Many statistical methods require that the numeric variables
we are working with have an approximate normal distribution.
➢ It is a mathematical function that is applied to all the
observations of a given variable
y = f ( x)
⚫ x represents the original variable, y is the transformed variable, and f
is a mathematical function that is applied to the data.
⚫ Transformation of a variable can change its distribution from a
skewed distribution to a normal distribution (bell-shaped, symmetric
about its center). 3
04. Data Transformation
Why data transformation?
(a) (b)
Fig.: A scatterplot in which the areas of the sovereign states and dependent
territories in the world are plotted on the vertical axis against their population
on the horizontal axis.
❖ Fig. (a) uses raw data. Fig. (b) both the area and population data have been
transformed using the logarithm function. (Wikipedia.com) 4
04. Data Transformation
▣ Data Transformation
⚫ Transformation are used for three purposes
① To make data more symmetric
② To make data more linear, and
③ To make data more constant in variance
✓ In order to make an asymmetric distribution become more
symmetric, the data can be transformed or re-expressed into new
units.
✓ The new nits alter the distances between observations on a line
plot.
✓ The effect is to either expand or contract the distances to extreme
observations on one side of the median making it look more like
the other side.
5
04. Data Transformation
✓ The most commonly used transformation in water resources is the
logarithm.
✓ Logs of water resources data for example, stream discharge,
hydraulic conductivity and sediment concentration are often
taken before statistical analyses are performed.
6
04. Data Transformation
✓ The purpose of data transformation in most instances is not
merely getting a normal distribution from a non-normal, but a try
to meet the assumptions of a statistical test or procedure
(parametric or non-parametric).
✓ If the data does not meet the assumptions of a given test or
procedure, and the problem appears due to the distribution of a
variable we are using, then we often try transformations, although
alternatively we can try a different test or procedure that might
have different assumptions or be more robust.
7
04. Data Transformation
✓ The normality can be decided using different statistical tests
based on the p-value. If the p-value is small (majority of the time
p < 0.05 or p < 0.01), then the data distribution will be non-
normal.
✓ To decide how to transform a variable, you might find the term
"Tukey's ladder" to be a useful search term, as the great
mathematician John Wilder Tukey (1916-2000) created an
ordered list of transformations to bring skewed distributions
toward normality.
8
04. Data Transformation
✓ In simple cases, it might make sense to use a test that say convert
the raw values to ranks (as many nonparametric tests do) and
sidesteps some of the problems that a skewed distribution may be
causing with some parametric test.
✓ If you need something more complex, such as multiple
regression, a Tukey-style transformation may help you meet the
requirements for the residuals that you cannot meet with the
original, untransformed variable.
9
Tukey’s Ladder of Powers for Transformation
UP
Here X represents our variable of
interest. We are going to consider this
variable raised to a power l, i.e. Xl X4 Left skewed
Bigger
Impact X3
We go up the ladder to remove left X2
skewness and down the ladder to
remove right skewness. X Middle rung:
No transformation
(l = 1)
Bigger
Impact 2
X
3
X
log10 X (think of this as X 0 )
DOWN
−1
X
Right skewed
− 1
X2
Tukey’s Ladder of Power
11
Tukey’s Ladder of Powers for Transformation
✓ To remove right skewness, we typically take the square root,
cube root, logarithm, or reciprocal of a variable etc., i.e.
V 0.5, V 0.333, log10(V) (think of V0) , V -1, etc.
✓ To remove left skewness, we raise the variable to a power
greater than 1, such as squaring or cubing the values, i.e. V 2,
V 3 etc.
12
Transformations to Achieve Normality
⚫ How can we determine if observations are normally distributed?
⚫ Graphical examination
✓ Frequency plot (histogram)
✓ Boxplot
✓ Normal quantile-quantile plot (QQ-plot)
⚫ Goodness of fit tests
✓ Chi-Square Text
✓ Shapiro-Wilk Test
✓ Kolmogorov-Smirnov Test
✓ Anderson-Darling Test
13
How to Express a Distribution
Cumulative Density
Probability Density
Which method conveys the
information best to you?
Probability Plot Equation
14
Transformations to Achieve Normality
⚫ Original and Transformed Data
15
Transformations to Achieve Normality
Q-Q Plot for Normally Distributed Data
16
Transformations to Achieve Normality
Q-Q Plot for Left Skewed Data
17
Transformations to Achieve Normality
Q-Q Plot for Right Skewed Data
18
Transformations to Achieve Normality
Q-Q Plot for Leptokurtic (high peak) and Low Spread Data
19
Transformations to Achieve Normality
Q-Q Plot for Platykurtic (low peak) and More Spread Data
20
Transformations to Achieve Normality
⚫ Some Models with Transformed Data
21
Homework
What is a normality test? Why is it conducted? Use any software
for an example data from water resources and discuss the results
in terms of its normality or non-normality. Also, which
techniques will be suitable to normalize it?
22
Questions?
22