0% found this document useful (0 votes)
20 views23 pages

Data Transformation

The document discusses data transformation in statistical hydrology, emphasizing its importance for achieving normal distribution in variables. It outlines various transformation techniques, particularly logarithmic transformations, and introduces Tukey's Ladder of Powers for addressing skewness in data. Additionally, it highlights methods for assessing normality, including graphical examinations and goodness-of-fit tests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views23 pages

Data Transformation

The document discusses data transformation in statistical hydrology, emphasizing its importance for achieving normal distribution in variables. It outlines various transformation techniques, particularly logarithmic transformations, and introduces Tukey's Ladder of Powers for addressing skewness in data. Additionally, it highlights methods for assessing normality, including graphical examinations and goodness-of-fit tests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Class Information

AE 5418 Statistical Hydrology

Engr. Dr. Muhammad Ajmal


Associate Professor
PhD in Water Resources & Environmental Engineering
Civil & Environmental Engineering, Hanyang University, South Korea

Agricultural Engineering Department


University of Engineering & Technology Peshawar, Pakistan

1
Statistical Hydrology

Data Transformation

22
04. Data Transformation
What is a data transformation?
➢ Many statistical methods require that the numeric variables
we are working with have an approximate normal distribution.

➢ It is a mathematical function that is applied to all the


observations of a given variable

y = f ( x)
⚫ x represents the original variable, y is the transformed variable, and f
is a mathematical function that is applied to the data.

⚫ Transformation of a variable can change its distribution from a


skewed distribution to a normal distribution (bell-shaped, symmetric
about its center). 3
04. Data Transformation

Why data transformation?


(a) (b)

Fig.: A scatterplot in which the areas of the sovereign states and dependent
territories in the world are plotted on the vertical axis against their population
on the horizontal axis.
❖ Fig. (a) uses raw data. Fig. (b) both the area and population data have been
transformed using the logarithm function. (Wikipedia.com) 4
04. Data Transformation

▣ Data Transformation
⚫ Transformation are used for three purposes
① To make data more symmetric
② To make data more linear, and
③ To make data more constant in variance

✓ In order to make an asymmetric distribution become more


symmetric, the data can be transformed or re-expressed into new
units.
✓ The new nits alter the distances between observations on a line
plot.
✓ The effect is to either expand or contract the distances to extreme
observations on one side of the median making it look more like
the other side.
5
04. Data Transformation

✓ The most commonly used transformation in water resources is the


logarithm.
✓ Logs of water resources data for example, stream discharge,
hydraulic conductivity and sediment concentration are often
taken before statistical analyses are performed.

6
04. Data Transformation

✓ The purpose of data transformation in most instances is not


merely getting a normal distribution from a non-normal, but a try
to meet the assumptions of a statistical test or procedure
(parametric or non-parametric).

✓ If the data does not meet the assumptions of a given test or


procedure, and the problem appears due to the distribution of a
variable we are using, then we often try transformations, although
alternatively we can try a different test or procedure that might
have different assumptions or be more robust.

7
04. Data Transformation

✓ The normality can be decided using different statistical tests


based on the p-value. If the p-value is small (majority of the time
p < 0.05 or p < 0.01), then the data distribution will be non-
normal.

✓ To decide how to transform a variable, you might find the term


"Tukey's ladder" to be a useful search term, as the great
mathematician John Wilder Tukey (1916-2000) created an
ordered list of transformations to bring skewed distributions
toward normality.

8
04. Data Transformation

✓ In simple cases, it might make sense to use a test that say convert
the raw values to ranks (as many nonparametric tests do) and
sidesteps some of the problems that a skewed distribution may be
causing with some parametric test.

✓ If you need something more complex, such as multiple


regression, a Tukey-style transformation may help you meet the
requirements for the residuals that you cannot meet with the
original, untransformed variable.

9
Tukey’s Ladder of Powers for Transformation
UP
Here X represents our variable of
interest. We are going to consider this
variable raised to a power l, i.e. Xl X4 Left skewed
Bigger
Impact X3

We go up the ladder to remove left X2


skewness and down the ladder to
remove right skewness. X Middle rung:
No transformation
(l = 1)
Bigger
Impact 2
X
3
X
log10 X (think of this as X 0 )
DOWN
−1
X
Right skewed
− 1
X2
Tukey’s Ladder of Power

11
Tukey’s Ladder of Powers for Transformation

✓ To remove right skewness, we typically take the square root,


cube root, logarithm, or reciprocal of a variable etc., i.e.
V 0.5, V 0.333, log10(V) (think of V0) , V -1, etc.

✓ To remove left skewness, we raise the variable to a power


greater than 1, such as squaring or cubing the values, i.e. V 2,
V 3 etc.

12
Transformations to Achieve Normality
⚫ How can we determine if observations are normally distributed?

⚫ Graphical examination
✓ Frequency plot (histogram)
✓ Boxplot
✓ Normal quantile-quantile plot (QQ-plot)

⚫ Goodness of fit tests


✓ Chi-Square Text
✓ Shapiro-Wilk Test
✓ Kolmogorov-Smirnov Test
✓ Anderson-Darling Test

13
How to Express a Distribution

Cumulative Density

Probability Density

Which method conveys the


information best to you?

Probability Plot Equation

14
Transformations to Achieve Normality
⚫ Original and Transformed Data

15
Transformations to Achieve Normality
Q-Q Plot for Normally Distributed Data

16
Transformations to Achieve Normality
Q-Q Plot for Left Skewed Data

17
Transformations to Achieve Normality
Q-Q Plot for Right Skewed Data

18
Transformations to Achieve Normality
Q-Q Plot for Leptokurtic (high peak) and Low Spread Data

19
Transformations to Achieve Normality
Q-Q Plot for Platykurtic (low peak) and More Spread Data

20
Transformations to Achieve Normality
⚫ Some Models with Transformed Data

21
Homework

What is a normality test? Why is it conducted? Use any software


for an example data from water resources and discuss the results
in terms of its normality or non-normality. Also, which
techniques will be suitable to normalize it?

22
Questions?

22

You might also like