0% found this document useful (0 votes)
94 views22 pages

Biometry - Chapter 1

1. This document provides an introduction to descriptive statistics and measures of central tendency including the mean, median, and mode. 2. Descriptive statistics are used to describe and summarize data through methods like calculating averages, variance, and percentiles. Inferential statistics are used to make conclusions about a population based on a sample. 3. Measures of central tendency indicate the central or typical value of a data set. Common measures include the mean, which is the average, the median, which is the middle value, and the mode, which is the most frequently occurring value. The appropriate measure depends on the characteristics of the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views22 pages

Biometry - Chapter 1

1. This document provides an introduction to descriptive statistics and measures of central tendency including the mean, median, and mode. 2. Descriptive statistics are used to describe and summarize data through methods like calculating averages, variance, and percentiles. Inferential statistics are used to make conclusions about a population based on a sample. 3. Measures of central tendency indicate the central or typical value of a data set. Common measures include the mean, which is the average, the median, which is the middle value, and the mode, which is the most frequently occurring value. The appropriate measure depends on the characteristics of the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DEPARTMENT OF AGRICULTURE

DIPLOMA IN AGRICULTURE (D3AGRC)

BIOMETRY – BID250S

2022

INTRODUCTION TO APPLIED STATISTICS

COMPILED BY: A. ADETUNJI, H. THERON, G. SCHOLTZ and O. SINDESI

1
CHAPTER 1
DESCRIPTION OF DATA

Layout
1. Measures of centrality:
arithmetic mean; mode, and median
2. Measures of skewness
3. Measures of peakedness
4. Measures of dispersion:
range; quartiles; deciles; percentiles; inter/semi-inter quartile range;
variance; standard deviation

INTRODUCTION

Statistics is one of the most important branches of mathematics. It deals with the
collection and analysis of numeric data and is widely used in research. It is used for
describing properties of data (descriptive statistics) and drawing conclusions about a
population based on information in a sample (inferential statistics). A population is a
set of all the possible values for a given problem. A population can also be referred to
as the collection of all individuals or items under consideration in a statistical study. ‘n’
is used to represent this in a population

In most cases, it is not possible to measure each individual in the population.


Therefore, a Sample is taken which is a subset of the population. That is, a sample is
that part of the population from which information is collected. The sample must be
unbiased, random and representative of the population (characteristics of a good
sample).
There are two major types of statistics (both have the goal of discovering hidden
trends, patterns and relationships in the data. 1) Descriptive statistics consist of
methods for organizing and summarizing information. Descriptive statistics include the
construction of graphs, charts, and tables, and the calculation of various descriptive
measures such as averages, measures of variation, and percentiles. Inferential
statistics is concerned with using sample data to make a conclusion about a

2
population of data. Inferential statistics includes methods like point estimation, interval
estimation and hypothesis testing which are all based on probability theory.

Descriptive Statistics analyses the past, by describing the basic features of data.
Descriptive Statistics is based on the calculation of some measures: central tendency
(mean, median, mode), frequency (count, percentage), variability (maximum,
minimum, range, quartile, variance). Inferential Statistics aims at building predictive
models to understand the trend of a given phenomenon. It includes the following types
of analysis: Hypothesis testing (ANOVA, t-test, Box-Cox, …) and confidence interval
estimation.

Figure 1: Population and Sample

Statistics works with data, data is a set of variables that can be quantitative or
qualitative. Understanding the difference between quantitative and qualitative data is
very important because they are treated and analysed in different ways. Quantitative
data include data that can be expressed as numbers, therefore they can be measured,
counted and analysed through statistical calculations. Qualitative data cannot be
measured through standard computation techniques, because they express feelings,
sensations and experiences. Qualitative data can be used to understand the context
around a given phenomenon and discover new aspects.

3
Figure 2: Quantitative data vs qualitative data

In this chapter, we will discuss and explore some basic statistics and their uses using
grouped and ungrouped data. Grouped data means the data (or information) given
in the form of class intervals such as 0-20, 20-40 and so on. Ungrouped data is defined
as the data given as individual points (i.e. values or numbers) such as 15, 63, 34, 20,
25, and so on.

4
1. Measures of centrality
Measures of centrality are descriptive measures that indicate where the centre or the
most typical value of the variable lies in the collected set of measurements. These are
also classed as summary statistics. The mean (often called the average) is most likely
the measure of central tendency that most people are familiar with, but there are
others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to
use than others.

1.1. Arithmetic mean


The mean (or average) is the most popular and well-known measure of central
tendency. It can be used with both discrete and continuous data, although its use is
most often with continuous data. The mean is equal to the sum of all the values in
the data set divided by the number of values in the data set. So, if we
have n values (number of individuals) in a data set and they have values x1,x2, …,xn,
the sample mean, usually denoted by x̅-bar ― (pronounced "x bar"), is formula 1:

This formula is usually written in a slightly different manner using the Greek letter, ∑,
pronounced "sigma", which means "sum of...": formula 2:

The above formula (formula 2) refers to the sample mean and not the population mean.
This is because, in statistics, samples and populations have very different meanings
and these differences are very important, even if, in the case of the mean, they are
calculated in the same way. To acknowledge that we are calculating the population
mean and not the sample mean, we use the Greek lower case letter "mu", denoted as
µ formula 3:

5
When not to use the mean

The mean has one main disadvantage: it is susceptible to the influence of outliers.
Outliers are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value. For example, consider the wages of staff
at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10

Salary R15k R18k R16k R14k R15k R15k R12k R17k R90k R95k

The mean salary for these ten staff is R30.7k. However, inspecting the raw data
suggests that this mean value might not be the best way to accurately reflect the typical
salary of a worker, as most workers have salaries in the R12k to R18k range. The
mean is being skewed by the two large salaries. Therefore, in this situation, we would
likely be deceived if we use the mean as a measure of centrality.
Example for calculating mean in ungrouped data:
Table 1: Ungrouped Swiss chard pant height data
Week Treatment Swiss chard plant height (mm)
1 T1R1Z0Sa 115
1 T1R1Z0Sb 100
1 T1R1Z0Sc 67
1 T1R2Z0Sa 76
1 T1R2Z0Sb 86
1 T1R2Z0Sc 161
1 T1R3Z0Sa 171
1 T1R3Z0Sb 93
1 T1R3Z0Sc 98
1 T1R4Z0Sa 102
1 T1R4Z0Sb 132
1 T1R4Z0Sc 145
1 T1R5Z0Sa 122
1 T1R5Z0Sb 129
1 T1R5Z0Sc 142

6
1 T1R6Z0Sa 111
1 T1R6Z0Sb 119
1 T1R6Z0Sc 104

Above is 18 observations of the Swiss chard plant height taken on the first week of
data collection in a scientific experiment.

The arithmetic mean is

or simply put
115+100+67+⋯104
= 18
2073
= 18
≈115.17 mm

The average plant height is therefore 115.17 mm (note the unit)

Example (grouped data):

The above example is for ungrouped data. If there are many observations, then the
addition involved in finding ∑𝑛𝑖=1 𝑥𝑖 becomes tedious and it is advantageous to classify
the data before finding the mean.
A mean can also be determined for data that are grouped or placed in intervals. Unlike
listed data, the individual values for grouped data are not available, and you are not
able to calculate their sum. To calculate the mean of grouped data, the first step
is to determine the midpoint of each interval or class. These midpoints must then
be multiplied by the frequencies of the corresponding classes. The sum of the products
divided by the total number of values will be the value of the mean.

7
Table 2: Grouped Swiss chard pant height data
Swiss chard plant Number of observations Midpoint (x) fixi
height classes (mm) (frequency=f)
60-69 1 64,5 64,5
70-79 1 74,5 74,5
80-89 1 84,5 84,5
90-99 2 94,5 189
100-109 3 104,5 313,5
110-119 3 114,5 343,5
120-129 2 124,5 249
130-139 1 134,5 134,5
140-149 2 144,5 289
150-159 0 154,5 0
160-169 1 164,5 164,5
170-180 1 175 175

1) NB: Mean is the {sum of the product of the (frequency value (f) multiplied
by the midpoint (x))} all divided by the sum of the frequency values (n)
2) Find the midpoint of each interval or class denoted by (x)
3) Multiply these midpoints by the frequencies of the corresponding classes (f * x
= f.x)
4) Find the sum of the f.x values (2081.5 mm)
5) Divide the sum by the total number of observations (18 observations)
a. 2081.5/18= 115.64

The formula described above is denoted as follow:

8
Note that the arithmetic mean of the grouped data is not exactly the same as that for
the ungrouped data. Since we let a single class midpoint represent all the
determinations that are grouped in that class, we lose information. The arithmetic
mean is thus not the exact mean but an approximation.

 The arithmetic mean takes every observation into account and is the most often
used of all the measures of central tendency.
 The reliability of this measure is however suspect where extreme values are
found in the dataset (the average of 3, 4 and 5 is 4; and of 3, 4, 5 and 36 is 12.
The latter is not representative of the majority of the observations).

1.2. Mode
The mode of a set of data is simply the value that appears most frequently in the set.
The mode is defined as the value occurring the most frequently in a dataset. The mode
is not often determined in the case of ungrouped data – although it is possible where
the dataset is small by grouping the data from the smallest to the largest value. The
mode can easily be calculated when the data has been grouped into a frequency
distribution, be it directly from the frequency distribution or the absolute histogram.
If two or more values appear with the same frequency, each is a mode. The downside
to using the mode as a measure of central tendency is that a set of data may have no
mode, or it may have more than one mode. However, the same set of data will have
only one mean and only one median.

9
 The word modal is often used when referring to the mode of a data set.
 If a data set has only one value that occurs most often, the set is
called unimodal.
 A data set that has two values that occur with the same greatest frequency is
referred to as bimodal.
 When a set of data has more than two values that occur with the same greatest
frequency, the set is called multimodal.

NB: When determining the mode of a data set, calculations are not required, but keen
observation is a must. The mode is a measure of central tendency that is simple to
locate, but it is not used much in practical applications.

The data grouped in Table 2 is bimodal with the mode classes being at 100-109 mm
and 110-119 mm, both classes with 3 observations each.

Figure 3: Shows an example of two modes (bimodal)

The mode will not represent the central tendency of data if a typical value occurs the
most frequently in a dataset. The shape of the frequency distribution should therefore
first be examined before selecting the mode as a central measurement. The mode can
also not be used when the frequency distribution is bimodal (when there are two
modes). An advantage of the mode, however, is that it is not affected by extreme
values as in the case of the arithmetic mean.

10
1.3. Median
The median is the middle value of a dataset listed in ascending order (i.e., from
smallest to largest value). The measure divides the lower half from the higher half of
the dataset. To find the median in ungrouped data the data set should first be arranged
from the smallest to the largest value before the median can be determined .
In the case of an uneven number of observations, the centre value will be the median.
𝑛
When there is an even number of observations, the average of the ( 2 )th value and the

value directly on its right is determined in order to obtain the median value.

Table 3: ungrouped Swiss chard pant height data (arranged in ascending order)

Week Treatment Swiss chard plant height (mm)


1 T1R1Z0Sc 67
1 T1R2Z0Sa 76
1 T1R2Z0Sb 86

1 T1R3Z0Sb 93

1 T1R3Z0Sc 98
1 T1R1Z0Sb 100
1 T1R4Z0Sa 102
1 T1R6Z0Sc 104
1 T1R6Z0Sa 111
1 T1R1Z0Sa 115
1 T1R6Z0Sb 119
1 T1R5Z0Sa 122
1 T1R5Z0Sb 129
1 T1R4Z0Sb 132
1 T1R5Z0Sc 142
1 T1R4Z0Sc 145
1 T1R2Z0Sc 161
1 T1R3Z0Sa 171

11
Example (Ungrouped data):
 Given the data on table 3 arranged in ascending order the number of
observations is even (n is even=18). Thus the median is halfway between 111
and 115 mm.
111+115
 To find this we add the numbers and divide their sum by 2 e.g ( ) = 113
2

mm
 Had we for instance omitted the last value in the data set we would have an
uneven number of observations (n=17). In which case the middle value would
be 111 mm.

NB: The median is a "central" value – there are as many values greater than it as
there are less than it. An advantage of the median is that it is unaffected by extreme
values and is useful where the distribution of the variable is severely skewed.

Which measure to choose?


The mode should be used when calculating the measure of centre for the qualitative
variable. When the variable is quantitative with symmetric distribution, then the mean
is a proper measure of centre. In the case of quantitative variables with skewed
distribution, the median is a good choice for the measure of centre. This is related
to the fact that the mean can be highly influenced by an observation that falls far from
the rest of the data, called an outlier.

It should be noted that the sample mode, the sample median and the sample mean of
the variable in question have corresponding population measures of centre, i.e., we
can assume that the variable in question have also the population mode, the
population median and the population mean, which are all unknown. Then the sample
mode, the sample median and the sample mean can be used to estimate the values
of these corresponding unknown population values.

12
1.4. Measures of skewness
Another measure of a variable can be used to describe the shape of the distribution
curve. This measure is called the measure of skewness. Skewness refers to the
degree of deviation from symmetry. In simple words, skewness is the measure of how
much the probability distribution of a random variable deviates from the normal
distribution. The normal distribution is the probability distribution without any
skewness.

Figure 4 below shows symmetrical distribution that’s basically a normal distribution


and you can see that it is symmetrical on both sides of the dashed line. Apart from
this, there are two types of skewness:
1) Positive Skewness
2) Negative Skewness
The probability distribution with its tail on the right side is a positively skewed
distribution and the one with its tail on the left side is a negatively skewed distribution.

Figure 4: Skewness

Symmetrical distribution
This type of distribution has a central peak and mirror images on either side of the
central value. In this case, the arithmetic mean, the median and the mode would be
the same. Due to this, the value of skewness for a normal distribution is zero. Because,
in reality, no real word data has a perfectly normal distribution. Therefore, even the
value of skewness is not exactly zero; it is nearly zero. Although the value of zero is
used as a reference for determining the skewness of a distribution.

13
Figure 5: Symmetrical or normal distribution

NB: A distribution, or data set, is symmetric if it looks the same to the left and right of
the centre point.

Skewed to the right (positively skewed)


This type of distribution has a few relatively large values in the dataset – thus the
distribution curve has a ‘’long tail’’ on the right-hand (positive) side. The mean is
the measure of central tendency that is the most affected by this and hence lies the
furthest to the right. The median is not as strongly affected, while the mode still lies at
the peak of the distribution. If the mean is greater than the median, then the distribution
is skewed to the right.

Figure 6: Positively skewed

14
Skewed to the left (negatively skewed)
This type of distribution has a few relatively small values in the dataset – thus the
distribution curve has a ‘‘long tail’’ on the left-hand (negative) side. The mean will
therefore move the furthest to the left, the median not as much, while the mode will
remain at the peak of the curve.

Figure 7: Negatively skewed

When a dataset is very skew, it is often best to choose the median as the measure
of central tendency. It is not as strongly affected by extreme values as the arithmetic
mean, while the frequency of one specific value does not play such a large role as in
the case of the mode. If the mean is less than the median, then the distribution is
said to be skewed to the left.

NB: A negative value for skew (𝑆𝑘 ) indicates skewness to the left and a positive value
skewness to the right. The closer to a value of 0, the more symmetrical the data.

15
1.5. Measures of peakedness
The formula used to calculate peakedness is very complex. Therefore it is usually
determined by inspecting a distribution graph (e.g. a frequency polygon).
 A highly peaked graph (heavy concentration of observations around the central
value), is called leptokurtic
 A moderately peaked distribution is called mesokurtic
 A flat distribution (observations are widely spread around the central value), is
called platykurtic

Figure 8: Distribution graph for Peakedness

1.6. Measures of dispersion (variation)


Measures of dispersion refer to the extent to which the observations of a variable are
spread around a central value. The figure below illustrates three different cases where
the central value is the same, but with very different measures of dispersion. Measures
of dispersion supply useful information regarding the reliability and representativeness
of the central value. Widely dispersed data indicates that the central value is not very
reliable or representative of the dataset. Closely spaced data on the other hand
indicates that the central value is a good, reliable representative of the data.

16
Figure 9: Measures of dispersion

Just as there are several different measures of centre, there are also several different
measures of variation. Below, we will examine four of the most frequently used
measures of variation; the sample range, the sample interquartile range, the sample
variance and standard deviation. Measures of variation are used mostly only for
quantitative variables.

Range
The range is the simplest statistic which indicates the spread of a set of numbers.
It is simply equal to the difference between the lowest and highest value in a sample.
The range has the serious disadvantage that it only uses the extreme two values of
the sample and ignores the information contained in the rest of the sample.

Range = Max −Min

 Think tank: 7 participants in a bike race had the following finishing


times in minutes: 28,22,26,29,21,23,24. What is the range?

17
Quartiles; Deciles and Percentiles
The median divides the set of data into two equal parts. Similarly, this set could
be divided into four equal parts called quartiles: 𝑄1 , 𝑄2 , 𝑄3 respectively. Ten equal
parts called deciles, denoted by 𝐷1 , 𝐷2 … 𝐷9 or one hundred equal parts, called
percentiles, denoted by 𝑃1 , 𝑃2 … 𝑃99
[The median is thus equal to 𝑄2 , 𝐷5 and 𝑃50 ]

Figure 10: Quartiles, deciles and percentiles

18
 Looking at the last example we can find the 10 th percentile by first, having our
data in order.
 The 10th percentile is the value below which 10% of the data points are found.
 In this case, it’s 1.
 With a similar method, you can find the 90th percentile (below which 90% of
your data is found); in this case, it’s 99. I’ll get back to the exact calculation
soon.
 Using the same example, had we been asked to find the second quartile
(Q2) or the fifth decile using that data, we would find that Q2=P5=D50 and
they all are equal to the median.

Think tank: What is the median of the example in question?

Q1 = P25 is the value where 25% of the total number of values measured was smaller
Q3 = P75 is the value where 75% of the total number of values measured was smaller
P10 = D1 is the value where 10% of the total number of values measured was smaller
P90 = D9 is the value where 90% of the total number of values measured was smaller

A useful measure of dispersion is the interquartile range. It represents the middle 50%
of the observations, i.e. (Q3-Q1) or (P75-P25). The semi-interquartile range can also
be used. This is half of the interquartile range, i.e. 0.5(Q3-Q1) or 0.5(P75-P25).
Variations of the above measures may also be used - the middle 80% of the
observations will be (P90-P10), etc.
Example: 7 participants in bike race had the following finishing times in minutes:
28,22,26,29,21,23,24. What is the semi-interquartile range?

19
Variance
The most reliable measure of dispersion/variability is one that takes all the
observations into account and is based on the average deviation from the central
value. Since the variance possesses both these characteristics, it is the most
commonly used measure of dispersion/variability in statistical analysis.

The variance is the sum of all the squared differences between each sample value
and the sample mean divided by (n - 1): And if we use the ∑ notation:

∑(𝑋𝑖 − 𝑋̅)2
𝑆2 =
𝑛−1

Example 1: Ungrouped data

Table 6: Ages (years) of 7 second-hand cars.


Age (x) ̅
x-𝒙 ̅)2
(x-𝒙
13 1 1
7 -5 25
10 -2 4
15 3 9
12 0 0
18 6 36
9 -3 9
∑ 𝑥 = 84 ∑(x-𝑥̅ )2 = 84
84
n = 7; 𝑥̅ = = 12
7

84
𝑆2 =
6

=14 years2 (note the unit used)

Example 2: 7 participants in bike race had the following finishing times in minutes:
28,22,26,29,21,23,24. Find the variance of the observation

Since the variance is a measure of the average squared deviation from the arithmetic
mean, it is expressed in squared units. Consequently, the practical application is often

20
problematic and a measure was found that is expressed in the same unit as the original
observations, namely the standard deviation.

Standard deviation

 The standard deviation (s) is simply the square root of the variance:
 It measures the average distance of how far your data values are from the
mean.
 It cannot be negative and never zero unless all the data entries are the same.
 It is greatly affected by outliers

2
√ ∑(𝑋𝑖 − 𝑋̅)
𝑆=
𝑛−1

Therefore, the standard deviation of the above variance is 3.74 years (note the unit
used).

Interpretation of the standard deviation

The standard deviation is a relative constant measure of dispersion around the central
value. If the frequency distribution is approximately symmetrical, about 68% of all the
observations will fall within one standard deviation from the central value. Just so,
about 95.4% of all the observations will fall within two standard deviations from the
central value, and 99.7% of the observations within three standard deviations from the
central value.

21
Coefficient of variation

Sometimes it is preferable to express the variation in the data set as a percentage of


𝑆
the mean. For this purpose the coefficient of variation is used: 𝑉 = 𝑥̅ : For example,

this formula will translate to

2.55
= 0.04
60.45

Therefore, the average of the data set is 60.45 ppm with a variation of 4%.

Note The more variation there is in the observed values, the larger is the standard
deviation for the variable in question. Thus the standard deviation satisfies the basic
criterion for a measure of variation and like said, it is the most commonly used measure
of variation. However, the standard deviation does have its drawbacks. For instance,
its values can be strongly affected by a few extreme observations.

22

You might also like