Data Analysis
Data Analysis
1. INTRODUCTION
(a) Data related to the state of the atmospheric environment. These include
observations of rainfall, sunshine, solar radiation, air temperature, humidity,
1
and wind speed and direction;
(b) Data related to the state of the soil environment. These include observations
of soil moisture, i.e., the soil water reservoir for plant growth and development.
The amount of water available depends on the effectiveness of precipitation or
irrigation, and on the soil’s physical properties and depth. The rate of water
loss from the soil depends on the climate, the soil’s physical properties, and
the root system of the plant community. Erosion by wind and water depend
on weather factors and vegetative cover;
(f) Information related to the distribution of weather and agricultural crops, and
geographical information including the digital maps.
(a) Original data files, which may be used for reference purposes (the daily
register of observations, etc.), should be stored at the observation site; this
applies equally to atmospheric, biological, crop, or soil data;
2
(b) The most frequently used data should be collected at national or regional
agrometeorological centers and reside in host servers for network accessibility.
However, this may not always be practical since unique agrometeorological
data are often collected by stations or laboratories under the control of
different authorities (meteorological services, agricultural services,
universities, research institutes). Steps should, therefore, be taken to ensure
that possible users are aware of the existence of such data, either through some
form of data library or computerized documentation, and that appropriate data
exchange mechanisms are available to access and share these data;
(c) Data resulting from special studies should be stored at the place where the
research work is undertaken, but it would be advantageous to arrange for
exchanges of data between centers carrying out similar research work. At the
same time, the existence of these data should be publicized at the national
level and possibly at the international level, if appropriate, especially in the
case of longer series of special observations;
3
include meteorological, phenological, edaphic, and agronomic information. The
database management and processing, quality controlling, archiving, timely accessing,
and dissemination are all important components that will make the information
valuable and useful in agricultural research and operational programs.
Having been stored in a data center, the data are disseminated to users. Great
strides have been made in the automation age to make more data products available to
the user community. The introduction of electronic transfer of data files via Internet
using file transfer protocol (FTP) and the World Wide Web (WWW) has advanced this
information transfer process to a new level. The WWW allows users to access text,
images, and even sound files that can be linked together electronically. The WWW’s
attributes include the flexibility to handle a wide range of data presentation methods
and the capability to reach a large audience. Developing countries have some access
to this type of electronic information, but limitations still exist in the development of
their own electronically accessible databases. These limitations will diminish as the
cost of technology decreases and its availability increases.
(a) The data, such as the daily register of observations and charts of recording
instruments, should be carefully preserved as permanent records. They should be
readily identifiable and include the place, date, time of each observation, and the units
used.
(b) These basic data should be sent to analysis centers for operational uses, e.g.
local agricultural weather forecasts, agricultural meteorological information service,
plant-protection treatment, irrigation, etc. The summaries (weekly, 10-day or
monthly) of these data should be made regularly from the daily register of
observations according to demand of users and then distribute to interested agencies
and users.
(c) The data should be recorded by a standard format so that they could be
readily transferred to data centers suitable for subsequent automatic processing, so the
observers should record all measurements complying with some rules. The data can
be transferred to data centers by many ways, such as mail, telephone, telegraph, and
fax or Internet, and comsat, in which Internet and comsat are a more efficient
approach. After reaching the data centers, data should be identified and processed
by special program for facilitating to other users.
4
and the current Climatological Guide. Every code of measurement must be checked
to make certain if the measurement is reasonable. If the value is unreasonable, it
should be corrected immediately. After being scrutinized, the data could be
processed further for different purposes.
Large amounts of data are typically required for processing, analysis, and
dissemination. It is extremely important that data are in a format being both easily
accessible and user friendly. This is particularly true as many data become available
in electronic format. Some software process data in a common form and disseminate
to more users, such as NetCDF (network common data form). It is software for
array-oriented data access and a library that provides an implementation of the
interface (Sivakumar, et al., 2000). The NetCDF software was developed at the
Unidata Program Center in Boulder, Colorado, USA. The freely available source
can be obtained by anonymous FTP from ftp://ftp.unidata.ucar.edu/pub/netcdf/ or
from other mirror sites.
NetCDF package supports the creation, access, and sharing of scientific data.
It is particularly useful at sites with a mixture of computers connected by a network.
Data stored on one computer may be read directly from another without explicit
conversion. NetCDF library generalizes access to scientific data so that the methods
for storing and accessing data are independent of the computer architecture and the
applications being used. Standardized data access facilitates the sharing of data.
Since the NetCDF package is quite general, a wide variety of analysis and display
applications can use it. The NetCDF software and documentation may be obtained
from the NetCDF website at http://www.unidata.ucar.edu/packages/netcdf/.
5
3. DISTRIBUTION OF DATA
(a) Raw or partially processed operational data supplied after only a short delay
(rainfall, potential evapotranspiration, water balance, sums of temperature). These
may be distributed:
6
Researchers invariably know exactly what agrometeorologica1 data they require
for specific statistical analyses, modeling, or other analytical studies. Many
agricultural users are often not only unaware of the actual scope of the
agrometeorological services available, but have only a vague idea of the data they
really need. Frequent contact between agrometeorologists and professional
agriculturists, and enquiries through professional associations and among
agriculturists themselves or visiting professional websites, can help enormously to
improve the awareness of data needs. Sivakumar (1998) presents a broad overview
of user requirements for agrometeorological services. Better applications of the type
and amount of useful agrometeorological data available and the selection of the type
of data to be systematically distributed can be established on that bases. For
example, when both the climatic regions and the areas in which different crops are
grown are well defined, an agrometeorological analysis can illustrate which crops are
most suited to each climate zone. This type of analysis can also show which crops
can be adapted to changing climatic and agronomic conditions. These analyses are
required by the agricultural users, they can be distributed by either geographic region,
crop region, or climatic region.
4. DATABASE MANAGEMENT
7
system with the following considerations:
(b) The outputs must be adapted for an operational database in order to support
specific agrometeorological applications at a national/regional/global level; and,
Personal computer (PC) is able to produce products formatted for easy reading
and presentation generated through simple processors, databases, or spreadsheet
applications. However, some careful thought needs to be given to what type of
product is needed, what the product looks like, and what it contains, before the
database delivery design is finalized. The greatest difficulty often encountered is
how to treat missing data or information (WMO-TD N° 1236, 2004). This process is
even more complicated when data from several different data sets such as climatic and
agricultural data are combined. Some software for database management, especially
the software for climatic database management, provide convenient tools for
agrometeorological database management.
CLICOM provides tools to describe and manage the climatological network (i.e,
stations, observations, instruments, etc.). It offers procedures to key entry, check and
archive climate data, and compute and analyze the data. Typical standard outputs
include monthly or 10-day data from daily data; statistics such as means, maximums,
minimums, standard deviations; tables; and graphs. Other products, requiring more
elaborated data processing, include water balance monitoring, estimation of missing
precipitation data, calculation of the return period, and preparation of the CLIMAT
message.
8
inadequate. However, CLICOM systems are beginning to yield positive results and
there is a growing recognition of the operational applications of CLICOM.
There are a number of constraints that have been identified over time and
recognized for possible improvement in future versions of the CLICOM system.
Among the technical limitations, the list includes (Motha, 2000):
For the agronomic and natural components in agrometeorology, these tools have
9
taken the name Land Information Systems (LIS) (Sivakumar et al., 2000). In both
GIS and LIS, the key components are the same: i.e., hardware, software, data,
techniques, and technicians. However, LIS requires detailed information on the
environmental elements such as meteorological parameters, vegetation, soil, and water.
The final product of LIS is often the result of a combination of numerous complex
informative layers, whose precision is fundamental for the reliability of the whole
system.
10
commonly used. The two parameters, a and b, are directly related to the average
amount of precipitation per wet day. They can, therefore, be determined with the
monthly means for the number of rainy days per month and the amount of
precipitation per month, which are obtained either from compilations of climate
normal or from interpolated surfaces.
5. AGROMETEOROLOGICAL INFORMATION
The following are some of the more frequent types of information which can be
derived from the basic data:
(a) Air temperature
(i) Temperature probabilities;
(ii) Chilling hours;
(iii) Degree days;
(iv) Hours or days above or below selected temperatures;
(v) Interdiurnal variability;
(vi) Maximum and minimum temperature statistics; and,
(vii) Growing season statistics. Dates when threshold values of temperature
for various kinds of crops growth begin and end.
(b) Precipitation
(i) Probability of specified amount during a period;
(ii) Number of days with specified amounts of precipitation;
(iii) Probabilities of thundershowers;
(iv) Duration and amount of snow cover;
(v) Date of beginning and ending of snow cover; and,
(vi) Probability of extreme precipitation amounts.
(c) Wind
11
(i) Wind rose;
(ii) Maximum wind, average wind speed;
(iii) Diurnal variation; and,
(iv) Hours of wind less than selected speed.
(e) Humidity
(i) Probability of specified relative humidity; and,
(ii) Duration of specified threshold of humidity with time.
(g) Dew
(i) Duration and amount of dew;
(ii) Diurnal variation of dew;
(iii) Association of dew with vegetative wetting; and,
(iv) Probability of dew formation with season.
12
(v) Leaf area index;
(vi) Above ground biomass;
(vii) Crop canopy temperature;
(viii) Leaf temperature; and,
(ix) Crop root length.
The remarks set out here are intended to be supplementary to Chapter 5, "The
use of statistics in climatology", of the WMO Guide to Climatological Practices and
to WMO Technical Note No. 8l, “Some methods of climatological analysis”, which
contain advice generally appropriate and applicable to agricultural climatology.
It must not be forgotten that advice on the long-term agricultural planning, on the
selection of the most suitable farming enterprise, on the provision of proper
equipment, and on the introduction of protective measures against severe weather
conditions all depend to some extent on the quality of the climatological analyses of
the agroclimatic and related data, and, hence, on the statistical methods on which
these analyses are based. Another point which needs to be stressed is that one is
often obliged to compare measurements of the physical environment with biological
data, which are often difficult to quantify.
13
provide a user-friendly interface with self-prompting analysis selection dialogs.
Many software packages include electronic manuals which provide extensive
explanations of analysis options with examples and comprehensive statistical advice.
Some commercial packages are rather expensive, but there are some free
statistical analysis software which can be downloaded from the web or can be made
available upon request. One example of freely available software is INSTAT, which
was developed with applications in agrometeorology in mind. It is a general purpose
statistics package for PCs which was developed by the Statistical Service Centre of
the University of Reading, England. It uses a simple command language to process
and analyze data. The documentation and software can be downloaded from the web.
Data for analysis can be entered into a table or copied and pasted from the clipboard.
If CLICOM is used as the database management software, then INSTAT, which was
designed for use with CLICOM, can readily be used to extract the data and perform
statistical analyses. INSTAT can be used to calculate simple descriptive statistics
including: minimum and maximum values, range, mean, standard deviation, median,
lower quartile, upper quartile, skewness, and kurtosis. It can be used to calculate
probabilities and percentiles for standard distributions, normal scores, t-tests and
confidence intervals, chi-square tests, and non-parametric statistics. It can be used to
plot data, for regression and correlation analysis and analysis of time series.
INSTAT is designed to provide a range of climate analyses. It has commands for
10-day, monthly, and yearly statistics. It calculates water balance from rainfall and
evaporation, start of rains, degree days, wind direction frequencies, spell lengths,
potential ET according to Penman, and crop performance index according to FAO
methodology. The usefulness of INSTAT for agroclimatic analysis is illustrated in
the publication on the Agroclimatology of West Africa: Niger. The major part of the
analysis reported in the bulletin was carried out using INSTAT.
(a) For the purpose of meeting national and regional requirements, studies on a
macroclimatic scale are useful, and may be based mainly on data from
synoptic stations. For some atmospheric parameters with little spatial
variation--e.g., duration of sunshine over a week or 10-day period--such an
analysis is found to be satisfactory;
(b) In order to plan the activities of an agricultural undertaking, or group of
undertakings, it is essential, however, to change over to the mesoclimatic or
topoclimatic scale, i.e., to take into account local geomorphological features
and to use data from an observational network with a finer mesh. These
14
complementary climatological series of data may be for much shorter periods
than those used for macroclimatic analyses, provided they can be related to
some long reference series;
(c) For bioclimatic research, the physical environment should be studied at the
level of the plant or animal or the pathogenic colony itself. Obtaining
information about radiation energy, moisture, and chemical exchanges
involves handling measurements on the much finer scale of microclimatology.
(d) For research on impact of changing climate, past long-term historical and
future climate scenarios should be supposed and extrapolated.
15
The climatic elements do not act independently on the biological life-cycle of
living things: an analytical study of their individual effects is often illusory; handling
them all simultaneously, however, requires considerable data and complex statistical
treatment. It is often better to try to combine several factors into single agroclimatic
indices, considered as complex parameters, which can be compared more easily with
biological data.
1 2 3 5 6 7
Group Group limits Mid-mark Frequency Cummulative Relative
boundaries or class xi fi frequency cumulative
interval Fi frequency(%
)
1 879.5-1029.5 880-1029 954.5 2 2 4
2 1029.5-1179.5 1030-1179 1104.5 8 10 20
3 1179.5-1329.5 1180-1329 1254.5 15 25 50
4 1329.5-1479.5 1330-1479 1404.5 4 29 58
5 1479.5-1629.5 1480-1629 1554.5 10 39 78
6 1629.5-1779.5 1630-1779 1704.5 8 47 94
7 1779.5-1929.5 1780-1929 1854.5 2 49 98
8 1929.5- 2079.5 1930-2079 2004.5 0 49 98
9 2079.5- 2229.5 2080-2229 2154.5 1 50 100
Total: 50 -
16
The table has columns showing limits defining classes and another column giving
lower and upper class boundaries that in turn give rise to class widths or class
intervals, yet another column gives the mid-marks of the classes, and another column
gives the totals of the tally known as the group or class frequencies.
Another column has entries that are known as the cumulative frequencies. They are
obtained from the frequency column by entering the number of observations with
values less than or equal to the value of the upper class boundary of that group.
The pattern of frequencies obtained by arranging data into classes is called the
frequency distribution of the sample. The probability of finding an observation in a
class can be obtained by dividing the frequency for the class by the total number of
observations. A frequency distribution can be represented graphically with a
two-dimensional histogram, where the heights of the columns in the graph are
proportional to the class frequencies.
Frequency distribution groupings have the disadvantage that when using them, certain
information is lost, such as the highest observation in the highest frequency class.
a standard deviation σ . Any set of data that tends to give rise to a normal curve is
said to be normally distributed. The normal distribution is completely characterized
17
by its mean and standard deviation. Sample statistics are functions of observed
values that are used to infer something about the population from which the values are
drawn. The sample mean and sample variance, for instance, can be used as
estimates of population mean and population variance, respectively, provided the
relationship between these sample statistics and the populations from which the
samples are drawn is known. In general, the sampling distribution of means is less
spread out than the parent population. This fact is embodied in the central limit
theorem which states that if random samples of size n are drawn from a large
population( hypothetically infinite), which has mean µ and standard deviation σ ,
then the theoretical sampling distribution of X has mean µ and standard deviation
σ
. The theoretical sampling distribution of X can be closely approximated by
n
the corresponding Normal curve if n is large. Thus, for quite small samples,
particularly if we know that the parent population is itself approximately Normal, we
can confidently apply the theorem. If we are not sure that the parent population is
Normal, we should, as a rule, restrict ourselves to applying the theorem to samples of
size ≥ 30 .
The standard deviation of a sampling distribution is often called the standard error of
σ
the sample statistic concerned. Thus σ X = is the standard error of X .
n
In order to compare different distributions having different means and different
standard deviations with one another they need to be transformed. One way would
be to center them about the same mean by subtracting the mean from each observation
in each of the populations. This will move each of the distributions along the scale
until they are centered about zero, which is the mean of all transformed distributions.
However, each distribution will still maintain a different bell-shape.
The Z.score
A further transformation is done by subtracting the mean of the distribution from each
observation and dividing by the standard deviation of the distribution, a procedure
known as standardization. The result is a variable Z, known as a z -score and having
the standard normal form:
X −µ
z=
σ
This will give identical bell shaped curves with Normal Distribution around zero
mean and standard deviation equal to unit.
The z-scale is a horizontal scale set up for any given normal curve with some mean
18
µ and some standard deviation σ . On this scale, the mean is marked 0 and the unit
measure is taken to be σ , the particular standard deviation of the normal curve in
question. A raw score X can be converted into a z-score by the above formula:
85 into a z-score,
we write
X − µ 85 − 80 5
z= = = = 1.25.
σ 4 4
The meaning here is that the X-score lies one standard deviation to the right of the
mean. If we compute a z-score equivalent of X=74, we get
X − µ 74 − 80 − 6
z= = = = −1.5.
σ 4 4
The meaning of this negative z-score is that the original X-score 74 lies one and
one-half standard deviations (that is, six units) to the left of the mean. A z-score tells
how many standard deviations removed from the mean the original x-score is, to the
right (if z is positive) or to the left (if z is negative).
There are many different normal curves due to the different means and standard
deviations. However, for a fixed mean µ and a fixed standard deviation σ , there is
exactly one normal curve having that mean and that standard deviation.
convert the x into a z-score. The number indicated is the desired area. If z turns
out to be negative, just look it up as if it were positive. If the data are normally
19
distributed, then it is probable that at least 68 percent of data in the series will fall
within ± 1σ of the mean, that is z = ±1 . Also, the probability is 95 percent that all
data fall within ± 2σ of the mean, or z = ±2 , and 99 percent within ± 3σ of the mean
or z = ±3 .
probability that the height of a stalk taken at random will be between 35 and 40 cm.
To solve this problem, we must find the area under a portion of the appropriate
normal curve, between x=35 and x=40. (See Figure below). It is necessary to
convert these x-values into z-scores as follows.
X −µ 35 − 38 5
For X=35: z = = = ≅ −0.67.
σ 4.5 4.5
X −µ 40 − 38 2
For X=40: z = = = ≅ 0.44.
σ 4.5 4.5
From the mean out to z=0.44, the area (on the right side of the region in Figure 12-15)
from Table 2 is 0.1700.
From the mean out to z=-0.67, the area (on the left side of the region in Figure 12-15)
from Table 2 is 0.2486.
It is clear from Figure 12-15 that the shaded area on both sides must be counted, and
so we add areas to obtain 0.1700+0.2486=0.4186.
Thus, the probability that a stalk chosen at random will have height X between 35 and
40 cm is 0.4186. In other words, we would expect 41.86% of the paddy field’s rice
stalks to have heights in that range.
Elements that are not normally distributed may easily be transformed mathematically
to the normal distribution, an operation known as normalization. Among the
moderate normalizing operators are the square root, the cube root, and the logarithm
for positively skewed data such as rainfall. The transformation reduces the higher
20
values by proportionally greater amounts than smaller values.
3.0 − 14.2
Z= ≈ −2.4 .
4.7
The probability of finding a variety smaller than -2.4 standard deviations is the
cumulative probability to this point; from our table, we can see that it is 0.0082, which
is very small indeed. Now, what is the probability of finding one longer than 20 mm?
Again, converting to standard normal form:
20.0 − 14.2
Z= ≈ 1.2
4.7
Because the total area under the normal distribution curve is 1.00, the probability of
obtaining a measurement of 1.2 standard deviations or greater than the mean is the
same as 1.0 minus the cumulative probability of obtaining anything smaller.
Standard Normal Distribution Table will give us the cumulative probability up to 1.2,
which is 0.8849. Therefore, the probability of finding a specimen longer than 20 cm
is 1.0000-0.8849=0.1151 or slightly greater than one chance out of ten. Now,
compute the probability of finding, at random, a stalk whose length falls in the size
range from 15 to 20 cm:
15.0 − 14.2
for 15 cm Z= ≈ 0.2
4.7
20.0 − 14.2
for 20 cm Z= ≈ 1.2
4.7
21
The Gumbel double exponential distribution is the most used for describing extreme
values. An event which has occurred m times in a long series of n independent trials,
m
one per year say, has an estimated probability p = ; conversely the average
n
n
interval between recurrences of the event during a long period would be ; this is
m
defined as the return period T where:
1
T=
p
For example, if there is a 5 percent chance that an event will occur in any one year,
then its probability of occurrence is 0.05. This can be expressed as an event having a
return period of five times in 100 years or once in 20 years. This means that the
event is more likely that over a long period of say 200 years, ten events of equal or
greater magnitude would have occurred.
For a valid application of extreme value analysis, two conditions must be met:
First, the data must be independent, i.e., the occurrence of one extreme is not linked to
the next.
Second, the data series must be trend free and the quantity of data must be big, usually
not less than 15 values.
(a) Threshold values of daily maximum and minimum temperatures, which can be
used to estimate the risk of excessive heat or frost and the duration of this risk;
(b) Threshold values of ten-day water deficits, taking into account the reserves in
22
the soil. The quantity of water required for irrigation can then be estimated.
(c) Threshold values of relative humidities from hour or 3-hour observations.
µ – population mean
X w – weighted mean
X h – harmonic mean
Me – median
Mo – mode
23
such cases, adjustments are applied to the series so as to fill in any gaps (see WMO
Technical Note No. 81). Sivakumar, et al., (1993) illustrate the application of
INSTAT in calculating descriptive statistics for climate data and discuss the usefulness
of the statistics for assessing agricultural potential. They produce tables for
available stations of monthly mean, standard deviation, maximum and minimum for
rainfall amounts, and for the number of rainy days. Descriptive statistics are also
presented for maximum and minimum air temperatures.
(a) The arithmetic mean is the most used measure of central tendency, defined as:
1 n
X = ∑ xi
n i =1
i = 1,2,...n (1)
This means adding all data in a series and dividing their sum by the number of data.
The mean of the annual precipitation series from table 1 is:
X =
∑ x i = 69449 / 50 = 1388.9 mm (2)
n
The arithmetic mean may be computed using other labour saving methods such as the
grouped data technique (WMO-No.100), which estimates the mean from the average
of the products of class frequencies and their mid-points.
Another version of the mean is the weighted mean, which takes into account the
relative importance of each variate by assigning it a weight. An example of the
weighted mean is when making areal averages such as yields, population densities or
areal rainfall over non-uniform surfaces. The value for each sub-division of the area
is multiplied by the sub-division area, and then the sum of the products is divided by
the total area. The formula for the weighted mean is expressed as:
∑n x i i
Xw = i ≡1
k
. (3)
∑n i ≡1
i
For example, the average yield of maize for the five districts in Ruvuma Region of
Tanzania was respectively, 1.5 tons, 2.0 tons, 1.8 tons, 1.3 tons, and 1.9 tons per
hectare. The respective areas under maize were, 3000, 7000, 2000, 5000, and 4000.
Substituting n1=3000, n2=7000, n3=2000, n4=5000, and n5=4000 into equation (3),
we have the over-all mean yield of maize for these 21,000 hectares of land.
24
3000(1.5) + 7000(2.0) + 2000(1.8) + 5000(1.3) + 4000(1.9) 33800
Xw = = = 1.6
3000 + 7000 + 2000 + 5000 + 4000 21000
Another measure of the mean is the harmonic mean defined as n divided by the sum
of the reciprocals or multiplicative inverses of the numbers
n
Xh = n
1
∑ Xi
i =1
If five sprinklers can individually water a garden in 4 hours, 5 hours, 2 hours, 6 hours,
and 3 hours, respectively, the time required for all pipes working together to water the
garden is given by
1
t = Xh = 46 minutes and 45 seconds.
n
25
6.5. Fractiles
Fractiles such as quartiles, quintals, and deciles are obtained by first ranking the
data in ascending order and then counting an appropriate fraction of the integers in the
series (n+1). For quartiles, we divide n+1 by four, for deciles by ten, and for
1
percentiles by a hundred. Thus if n = 50 , the first decile is the [n + 1] th or the
10
7
5.1th observation in the ascending order, the 7th decile is the [n + 1] th in the rank
10
th
or the 35.7 observation. Interpolation is required between observations. The
median is the 50th percentile. It is also the fifth decile and the second quartile. It
lies in the third quintile. In agrometeorology, the first decile means that value below
which one-tenth of the data falls and above which 9-tenths lie.
S=
∑ (x i − x)2
n −1
It has the same units as the mean; together they may be used to make precise
probability statements about the occurrence of certain values of a climatological series.
The influence of the actual magnitude of the mean can be easily eliminated by
expressing S as a percentage of the mean to get a dimensionless quantity called the
coefficient of variation:
s
Cv = × 100
x
26
For comparing values of s between different places, this can be used to provide a
measure of relative variability, for such elements as total precipitation.
7. DECISION MAKING
Two main lines of attacking the problem of statistical inference are available. One is
to devise sample statistics which may be regarded as being suitable estimators of
corresponding population parameters. For example, we may use the sample mean
X as an estimator of the population mean µ , or else we may use the sample
median Me . Statistical estimation theory deals with the issue of selecting best
estimators.
Once the null hypothesis has been clearly defined, we may calculate what kind of
samples to expect under the supposition that it is true. Then if we draw a random
sample, and if it differs markedly in some respect from what we expect, we say that
the observed difference is significant; and we are inclined to reject the null hypothesis
and accept the alternative hypothesis. If the difference observed is not too large, we
might accept the null hypothesis; or we might call for more statistical data before
coming to a decision. We can make the decision in a hypothesis test depending upon
a random variable known as a test statistic, such as z-score used in finding confidence
intervals, and we can specify critical values of this, which can be used to indicate not
only whether a sample difference is significant but also the strength of the
significance.
27
Null Ho: p=0.5 (i.e. the coin is fair)
We call the probability of wrongly rejecting a null hypothesis the level of significance
( α ) of the test. We select the value for α first, before carrying out any
experiments; the values most commonly used by statisticians are 0.05, 0.01, and 0.001.
The level of significance α =0.5 means that our test procedure has only 5 chances in
100 of leading us to decide that the coin is biased if in fact it is not.
It is fairly clear that if bias exists, a large sample will have more chance of
demonstrating its existence than a small one. And so, we should make n as large as
possible, especially if we are concerned with demonstrating a small amount of bias.
Cost of experimentation, time involved in sampling, necessity of maintaining
statistically constant conditions, amount of inherent random variation, and possible
consequences of making wrong decisions are among the considerations on which the
sizes of sample to be drawn depend.
We can make the decision in a hypothesis test depending upon a random variable
known as a test statistic such as z or t as used in finding confidence intervals. Its
sampling distribution, under the assumption that Ho is true, must be known. It can
be normal, binomial, or other sampling distributions.
Assuming that the null hypothesis is true, and bearing in mind the chosen values of n
and alpha, we now calculate an acceptance region of values for the test statistic.
Values outside this region form the rejection region. The acceptance region is so
chosen that if a value of the test statistic, obtained from the data of a sample, fails to
fall inside it, then the assumption that Ho is true must be strongly doubted. In
general, we have a test statistic X, whose sampling distribution, defined by certain
parameters such as η and σ , is known. The values of the parameters are specified
in the null hypothesis Ho. From integral tables of the sampling distribution we
obtain critical values X1, X2 such that
P[X1<X<X2] =1- α .
28
These determine an acceptance region, which gives a test for the null hypothesis at the
appropriate level of significance ( α ).
The general decision rule, or test of hypothesis, may now be stated as follows:
(b) Accept Ho if the sample value of X lies in the acceptance region [X1, X2].
(Sometimes, especially if the sample size is small, or if X is close to one of the critical
values X1 and X2, the decision to accept Ho is deferred until more data is collected.)
The n trials of the experiment may now be carried out, and from the results, the value
of the chosen test statistic may be calculated. The decision rule described in Step 6
may then be applied. Note: All statistical test procedures should be carefully
formulated before experiments are carried out. The test statistic, the level of
significance, and whether a one-or two-tailed test is required, must be decided before
any sample data is looked at. To switch tests in mid-stream, as it were, leads to
invalid probability statements about the decisions made.
If the critical region occupies both extremes of the test distribution, it is called a
two-tailed test. If the critical region occurs only at high or low values of the test
statistic, such a test is called one-tailed.
This leads to a two-tailed test. The critical region containing 5% of the area of the
normal distribution is split into two equal parts, each containing 2.5% of the total area.
If the computed value of Z falls into the left-hand region, the sample came from a
population having a smaller mean than our known population. Conversely, if it falls
into the right-hand region, the mean of the sample’s parent population is larger than
the mean of the known population. From the standardized normal distribution table
(Table.), we find that approximately 2.5% of the area of the curve is to the left of a Z
value of -1.9 and 97.5% of the area of the curve is to the left of +1.9.
Once the null hypothesis has been clearly defined, we may calculate what kind of
samples to expect under the supposition that it is true. Then, if we draw a random
sample, and if it differs markedly in some respect from what we expect, we say that
29
the observed difference is significant; and we are inclined to reject the null hypothesis
and accept the alternative hypothesis. If the difference observed is not too large, we
might accept the null hypothesis; or we might call for more statistical data before
coming to a decision. We can make the decision in a hypothesis test depending upon
a random variable known as a test statistic such as z or t as used in finding confidence
intervals, and we can specify critical values of this which can be used to indicate not
only whether a sample difference is significant but also the strength of the
significance.
population, while each of the sample characteristics such as sample mean, X and
sample standard deviation S is called a sample statistic.
Any one of the statistics mean, median, mode, and mid-interquartile range would
seem to be suitable for use as estimators of the population mean µ . In order to pick
out the best estimator of a parameter out of a set of estimators, three important
desirable properties should be considered. These are unbiasedness, efficiency, and
consistency.
have a sampling distribution, with mean E(b) = β and standard deviation S.D.(b) =
σ b . Here the parameter β is the unknown and our purpose is to estimate it. Using
the remarkable fact that many sample statistics we use in practice have a Normal or
approximately Normal sampling distribution, we can obtain from the tables of the
Normal integral, the probability that a particular sample will provide a value of b
within a given interval ( β – d) to ( β + d).
30
This is indicated in the diagram below. Conversely, for a given amount of
probability, we can deduce the value d. For example, for 0.95 probability, we know
d
from standard Normal tables that = 1.96 . In other words, the probability that a
σb
we get the 95% confidence interval for β , namely the interval [b-1.96 σ b , b+1.96 σ b ].
the z-score, is the number obtained from tables of the sampling distribution of b.
This z-score is chosen so that the desired percentage confidence may be assigned to
the interval; it is now called the confidence coefficient, or sometimes the critical value.
The end points of a confidence interval are known as the lower and upper confidence
limits. The probable error of estimate is half the interval length of the 50%
confidence interval, i.e., 0.674 σ .
The most commonly required point and interval estimates are for means, proportions,
differences between two means, and standard deviations. The following table gives
all the formulae needed for these estimates. The reader should note the standard
form of b ± z . σ b for each of the confidence interval estimators.
For the formulae to be valid, sampling must be random and the samples must be
independent. In some cases, σ b will be known from prior information. Then, the
sample estimator will not be used. In each of the confidence interval formulae, the
confidence coefficient z may be found from tables of the Normal integral for any
desired degree of confidence. This will give exact results if the population from
which the sampling is done are Normal; otherwise, the errors introduced will be small
if n is reasonably large ( n ≥ 30 ). A brief table of values of z is as follows:
31
Confidence 50% 60% 80% 86.8% 90% 92% 93.4% 94.2 95% 95.6% 96% 97.4% 98%
level
Confidence 0.674 0.84 1.28 1.50 1.645 1.75 1.84 1.90 1.96 2.01 2.05 2.23 2.33
coefficient z
What should we do when samples are small? It is clear that the smaller the sample,
the smaller amount of confidence we can place on a particular interval estimate.
Alternatively, for a given degree of confidence, the interval quoted must be wider than
for larger samples. To bring this about, we must have a confidence coefficient which
depends upon n. We shall use the letter t for this coefficient, and give confidence
interval formulae for the population mean µ , and for the difference of two population
means.
The reader will note that these are the same as for large samples, except that t replaces
z. When the sample estimators for σ x and σ are used, the correct values
x1 − x 2
for t are obtained from what is called the Student t-distribution. For convenience,
they are related not directly to sample sizes, but to a number known as ‘degrees of
freedom’; we shall denote this by υ .
υ: 3 4 5 7 9 10 15 20 25 30
90% 2.35 2.13 2.02 1.89 1.83 1.81 1.75 1.72 1.71 1.70
95% 3.18 2.78 2.57 2.36 2.26 2.23 2.13 2.09 2.06 2.04
99% 5.84 4.60 4.03 3.50 3.25 3.17 2.95 2.85 2.79 2.75
Degrees of freedom υ
32
X − µ0
Z=
σ n
The observations in the sample were selected randomly from a normal population
whose variance is known.
A random sample size n is drawn from a Normal population having unknown mean
µ and known Standard Deviation σ . The objective is to test the hypothesis
Ho: µ = µ ′ ; i.e., the assumption that the population mean has value µ ′ .
X − µ′
The variate Z = has a standard Normal distribution if Ho is true. We may
σ n
use Z (or X ) as the test statistic.
Example 2.
Suppose the shelf life of 1 liter bottles of pasteurized milk is guaranteed to be at least
400 days, with standard deviation 60 days. If a sample of 25 bottles is randomly
chosen from a production batch, and after testing the sample mean shelf life of 375 is
calculated, should the batch be rejected as not meeting the guarantee?
Alternative hypothesis H1: η < 400 (one-sided: we are only interested in whether or
X − 400
Z=
60 25
33
the test statistic.
Step 5. For a one-tailed test, standard normal tables give Z = -1.65 as the lowest value
to be allowed before Ho must be rejected, at the 5% significance level. The
acceptance region is therefore [-1.65, infinite].
(a) Reject the production batch if the value of z calculated from the sample is less
than -1.65.
(b) Accept the batch otherwise.
375 − 400
Z= = −2.083 .
60 25
Example 2.
From the data given, it is clear that the heat treated seeds had an earlier start in growth.
However, we may consider the wider question as to whether heat treated seeds are
significantly faster germinating generally than untreated seeds.
Step 1. Let µ A, µ B be the germination period population means for heat treated and
34
Alternative hypothesis H1: µ A > µ B .
We were asked specifically whether the heat treated seeds were faster germinating
than the untreated seeds, so we use the one-sided alternative hypothesis.
We are not given any information other than the two sample means. Even if we were
told the individual students’ results, we could not use the paired comparison test –
there would be no possible reason for linking the results in pairs.
σ2 σ2
σ′= +
nA nB
( x A − x B ) − (µ A − µ B )
z= .
σ′
x A − xB
z= .
σ′
35
(a) If the sample value of z>1.65, conclude that heat treated seeds germinate
significantly earlier (at the 5% level) than untreated seeds.
(b) If Z ≤ 1.65 , the germination rates of both heat treated and untreated seeds
may well be the same.
1 1 1 1
Therefore σ ′ = σ + ) = 12 + ≅ 2.96
n A nB 30 36
x A − x B 52 − 47
z= = ≅ 1.69.
σ′ 2.96
Decision: The heat treated seed is just significantly earlier germinating at the 5% level
than the untreated seed.
The uncertainty introduced into estimates based on samples can be accounted for by
using a probability distribution which has a wider spread than the normal distribution.
One such distribution is the t-distribution, which is similar to the normal distribution,
but dependent on the size of sample taken. When the number of observations in the
sample is infinite, the t-distribution and the normal distribution are identical. Tables
of the t-distribution and other sample based distributions are used in exactly the same
manner as tables of the cumulative standard normal distribution, except that two
entries are necessary to find a probability in the table. The two entries are the
desired level of significance ( α ) and the degrees of freedom ( υ ) defined as the
number of observations in the sample minus the number of parameters estimated from
the sample.
X − µ0
Then for the test statistic we use t = which has Student-t distribution with
S n
n-1 degrees of freedom.
Example 1
A farmer was found to be selling pumpkins that looked like ordinary pumpkins except
36
that these were very large, the average diameter for ten samples being 30.0 cm. The
mean and standard deviation of pumpkins is 14.2 cm and 4.7 cm, respectively. It is
intended to test whether the pumpkins that the farmer is selling are ordinary
pumpkins.
We hypothesize that the mean of the population from which the farmer’s pumpkins
was taken is the same as the mean of the ordinary pumpkins by the null hypothesis
H 0 : µ1 = µ 0
H 0 : µ1 ≠ µ 0 stating that the mean of the population from which the sample was
drawn does not equal the specified population mean. If the two parent populations
are not the same, we must conclude that the pumpkins that the farmer was selling
were not drawn from the ordinary pumpkin population, but from the population of
some other genus. We need to specify levels of probability of correctness, or level of
significance, denoted by α . Let us take a probability level of 5%; we are willing to
risk rejecting the hypothesis when it is correct 5 times out of 100 trials. We must
have the variance of the population against which we are checking. We may now set
up a formal statistical test in the following manner:
H 0 : µ1 = µ 0
H 0 : µ1 ≠ µ 0
α = 0.05
X − µ0
Z=
σ n
Working through the grass example, the outline takes the following form:
1. H 0 : µ of grass = 14.2 mm
H 1 : µ of grass ≠ 14.2 mm
2. α level = 0.05
30 − 14.2
3. Z = = 10.6
4 .7 10
The computed test value of 10.6 exceeds 1.9, so we conclude that the means of the
two populations are not equal, and the grass must represent some genus other than
that of ordinary pumpkins.
combining these pairs of estimates, to obtain single unbiased estimates of µ and σ2.
The process of combining estimates from two or more samples is known as pooling.
The correct ways to pool unbiased estimates of means and variances, to yield single
unbiased estimates, are
n1 x1 + n 2 x 2
Means: µˆ =
n1 + n 2
38
2 2
(n1 − 1) s1 + (n2 − 1) s 2
Variances: σ2 =
n1 + n2 − 2
Example:
A soil scientist made six determinations of the strength of dilute sulfuric acid. His
results showed a mean strength of 9.234 with standard deviation 0.12. Using acid
from another bottle, he made eleven determinations, which showed mean strength
8.86 with standard deviation 0.21. Obtain 95% confidence limits for the difference
in mean strengths of the acids in the two bottles. Could the bottles have been filled
from the same source?
Working:
9.234-8.86=0.374.
1 1 1 1
σˆ x1− x 2 = s. + = s. + ,
n1 n2 6 11
2 2
(n − 1) s1 + (n2 − 1) s 2
where s = 1 2
n1 + n2 − 2
and so s = 0.2782
Therefore
17
σˆ x1− x 2 = 0.2782 = 0.141196
66
With 15 degrees of freedom, the confidence coefficient is t=2.13 for 95% confidence.
Therefore the required limits for µ1 − µ 2 are
39
( x1 − x 2 ) ± t.σˆ X 1− X 2 = 0.374 ± 2.13x0.141196 = 0.374 ± 0.300748
Thus, the 95% confidence limits for the difference in mean strengths of the acids in
the two bottles are 0.0733 and 0.6747. That means that we are 95% confident that
the difference in mean strengths of the acids in the two bottles lies between 0.0733
and 0.6747.
7.9. The Paired Comparison Test and The Difference Between Two Means Test
The yields from two varieties of wheat were compared. The wheat was planted on
25 test plots. Each plot was divided into two equal parts, one part was chosen
randomly and planted with the first variety, and the other part was planted with the
second variety of wheat. This process was repeated for all the 25 plots. When the
crop yields were measured, the difference in yields from each plot was recorded (2nd
variety minus first variety). The sample mean plot yield difference was found to be
3.5 ton/ha, and the variance of these differences was calculated to be 16 ton/ha.
(a) Does the 2nd variety produce significantly higher yields than the 1st variety?
(b) Test the hypothesis that the population mean plot yield difference is as high as
5 tons/ha.
(c) Obtain 95% confidence limits for the population mean plot yield difference.
It is clear that there is a good deal of variation in yields from plot to plot. This
variation tends to confound the main issue, which is to determine whether yields are
increased by using a second variety. This has been eliminated by considering only
the change in yields for each plot. If the second variety has no effect, the average
change will be zero.
These kind of data, where results are combined in pairs, each pair arising from one
experimental unit or having some clear reason for being linked in this way, are
analyzed by the paired comparison test. Each pair provides a single comparison as a
measure of the effect of the treatment applied (e.g., growing a different variety). Let
D denote the difference in a given pair of results. D will have Normal distribution
with mean µ and Standard Deviation σ (both the parameters are unknown in this
case).
Step 1.
Null hypothesis Ho: µ = 0 ( i.e. the yield of the two wheat varieties are the same).
40
Alternative hypothesis H1: µ > 0 i.e. Second variety yields are higher than first
variety yields).
The parameters of a population are rarely known. In our case, σ is not given, so
we must estimate it from the sample data.
Step 5. Acceptance region: the critical level of t at the 0.05 level of significance (one-
tailed test) is the same as the upper 90% confidence coefficient. As given in Table.
With 24 degrees of freedom, this value is 1.71. The acceptance region is, therefore,
all values of t from –infinity to 1.171.
(a) If the value of t calculated from the sample is greater than 1.71, we may
conclude that the second wheat variety gives higher yields than the first variety.
(b) If the value of t<1.71, we may not reject (at the 5% level) the hypothesis that
the observed increases in yield in the second wheat variety were due to chance
variation in the experiment.
D−0 3.5 − 0
From the sample data t = = = 2.375
S n 6 2
Decision: since 2.375>1.71, we conclude at the 5% level that the second variety
significantly produces higher yields than the 1st variety.
A sampling result, which is frequently used in inference tests, is one concerning the
distribution of the difference in means of independent samples drawn from two
different populations. Let a random sample of size n1 be drawn from a population
having mean µ1 and Standard Deviation σ X ; and let an independent sample of size
41
n2 be drawn from another population having mean µY and Standard Deviation σ Y .
Consider the random variable D = X − Y ; i.e., the difference in means of the two
samples. The theorem states that
σX 2
σX2
Var (D) = +
n1 n1
It seems reasonable that the sample variances will range more from trial to trial if the
number of observations used in their calculation is small. Therefore, the shape of the
F-distribution would be expected to change with changes in sample size. The
degrees of freedom idea comes to mind, except in this situation the F-distribution is
dependant on two values of γ, one associated with each variance in the ratio. Since
the F- ratio is the ratio of two positive numbers, the F- distribution cannot be negative.
If the samples are large, the average of the ratios should be close to 1.0.
We may hypothesize that two samples are drawn from populations having equal
variances. After computing the F- ratio, we then can ascertain the probability of
obtaining, by chance, that specific value from two samples from one normal
population.
If it is unlikely that such a ratio could be obtained, we regard this as indicating that the
samples come from different populations having different variances.
S1 S
For any pair of variances, two ratios can be computed ( and 2 ).
S2 S1
If we arbitrarily decide that the larger variance will always be placed in the numerator,
the ratio will always be greater than 1.0 and the statistical tests can be simplified.
Only one-tailed tests need be utilized, and the alternative hypothesis actually is a
statement that the absolute difference between the two sample variances is greater
than expected if the population variances are equal. This is shown in figure XX, a
typical F-distribution curve in which the critical region or area of rejection has been
shaded.
42
A typical F-distribution γ 1 = 10 and γ 2 = 25 degrees of freedom, with critical
region (shown by shading), which contains 5% of the area under the curve. Critical
value of F=2.24.
The variances of the two samples may be computed by (3.8), when the F-ratio
between the two may be calculated by
2
S1
F= 2
S2
2 2
where S1 is the larger variance and S 2 is the smaller. We now are testing the
hypothesis
H 0 : σ 21 = σ 2 2
against H 1 : σ 21 ≠ σ 2 2
The null hypothesis states that the parent populations of the two samples have equal
variances: the alternative hypothesis states that they do not. Degrees of freedom
associated with this test are ( n1 − 1 ) for γ 1 and ( n 2 − 1 ) for γ 2 . The critical value
43
of F with γ 1 = 9 and γ 1 = 9 degree of freedom and a level of significance of 5%
The value of F calculated from (3.26) will fall into one of the two areas shown
on Fig 3.23. If the calculated value of F exceeds 3.18, the null hypothesis is rejected
and we conclude that the variation in porosity is not the same in the two groups. If
the calculated value is less than 3.18, we would have no evidence for concluding that
the variances are different (determine at α 0.05 if variances are the same).
The next step in the procedure is to test equality of means. The appropriate test is
(3.23)
x1 − x 2
t=
Sp ( n11 ) + ( n12 )
where the quantity Sp, is the pooled estimate of the population standard deviation,
based on both samples. The estimate is found from the pooled estimated variance,
given by
(n1 − 1) S 21 + (n2 − 1) S 2 2
Sp =
2
n1 + n2 − 2
where the subscripts refer, respectively, to the sample from area A and area B of the
district.
44
because for many of them--climatic factors in particular--it is impossible to design
accurate experiments, since their occurrence cannot be controlled. There are two
sets of circumstances in which, more particularly, the correlation and simple
regression method can be used:
(a) In completing climatological series having gaps. Comparisons of data for different
atmospheric elements (e.g. precipitation, evapotranspiration, duration of sunshine)
allow estimates of the missing data to be made from the other measured elements;
(b) In comparing climatological data and biological or agronomical data, e.g., yields,
quality of crops (sugar content, weight of dry matter, etc.).
If the number of pairs is small, the sample correlation coefficient between the two
series is subject to large random errors, and in these cases numerically large
coefficients may not be significant.
n−2
t=r
1− r2
and t is compared to the tabulated value of Student’s t with n-2 degrees of freedom.
7.12.2. Regression
After the strength of the relationship between two or more variables has been
45
quantified, the next logical step is to find out how to predict specific values of one
variable in terms of another. This is done by regression models. A single linear
regression model is of the form:
Y = a + bX
The least squares criterion requires that the line be chosen to fit the data so that the
sum of the squares of the vertical deviations separating the points from the line will be
a minimum.
The recommended formulae for estimating the two sample coefficients for least
squares are:
Example
Angstrom’s formula:
R/RA = a + b n/N
is used to estimate the global radiation at surface level (R) from the radiation at the
upper limit of the atmosphere (RA), the actual hours of bright sunshine (n), and the
day length (N). RA and N are taken from appropriate tables or computed; n is an
observational value obtained from the Campbell–stokes sunshine recorder.
46
(latitude 3˚ 14S, longitude 27˚ 17E, elevation 1250m)
n/N R/RA
(X) (Y)
J .660 .620
F .647 .578
M .536 .504
A .366 .395
M .251 .368
J .319 .399
J .310 .395
A .409 .442
S .448 .515
O .542 .537
N .514 .503
D . 602 .582
The regression explains r² = 95% of the variance of R/RA, and is significantly below
p =0.01.
There are cases where a scatter diagram suggests that the relationship between
variables is not linear. This can be turned into a linear regression by taking the
logarithms of the relationship if it is exponential or turning it into a reciprocal, if it is
square, etc. For example, when the saturation vapour pressure is plotted against
temperature, the curve suggests that a function like y = p.e bX could probably be
used to describe the function. This is turned into a linear regression ln
(y)=ln( p)+bX, where X is temperature function and y is the saturation vapour
pressure. An expression of the form y = aX 2 can be turned into a linear form by
1 X −2
taking the reciprocal = .
y a
47
2000
Tons/ha 1000
Linear trend
500 line
0
y = 35.263x + 979.04
19 1
19 4
19 7
19 0
19 3
19 6
79
6
6
6
7
7
7
19
R2 = 0.5344
48
month, the potential evapotranspiration of a certain month, or the difference between
precipitation and potential evapotranspiration for a given month.
In stepwise regression, a simple linear regression for the yield is constructed on each
of the variables and their coefficients of determination found. The variable that
produces the largest r2 statistic is selected. Additional variables are then brought in
one by one and subjected to a multivariate regression with the best variable to see
how much that variable would contribute to the model if it were to be included. This
is done by calculating the F statistic for each variable. The variable with the largest
F statistic, that has a significance probability greater than the specified significance
level for entry, is included in the multivariate regression model. Other variables are
included in the model one by one. If the partial F statistic of a variable is not
significant at a specified level for staying in the regression model, it is left out. Only
those variables that have produced significant F statistics are included in the
regression. More deeply explanations could be found in Draper and Smith (1981).
49
and crop cultivars.
Data are commonly collected as time series that is observations made on the same
variable at repeated points in time. INSTAT provides facilities for descriptive
analysis and display of such data. The goals of time series analysis include
identifying the nature of the phenomenon represented by the sequence of observations
and predicting future values of the times series. Moving averages are frequently
used to 'smooth' a time series so that trends and other patterns are seen more easily.
Sivakumar, et al., (1993) present a number of graphs showing the five-year moving
averages of monthly and annual rainfall at selected sites in Niger. Most time series
can be described in terms of trend and seasonality. When trends, seasonal or other
deterministic patterns, have been identified and removed from a series, the interest
focuses on the random component. Standard techniques can be used to look at its
distribution. The feature of special interest, resulting from the time series nature of
the data, is the extent to which consecutive observations are related. A useful
summary is provided by the sample autocorrelations at various lags, the
autocorrelation at lag m being the correlation between observations m time units apart.
In simple applications this is probably most useful for determining whether the
assumption of independence of successive observations used in many elementary
analyses is valid. The autocorrelations also give an indication of whether more
advanced modeling methods are likely to be helpful. The cross correlation function
provides a summary of the relationship between two series from which all trend and
seasonal patterns have been removed. The lag m cross correlation is defined as the
correlation between x and y lagged by m units.
More than any other user of climatic data, the agrometeorologist may be tempted
to search for climatic periodicities, which would provide a basis for the management
of agricultural production. It should be noted that the Guide to Climatological
Practices (section 5.3) is more than cautious with regard to such periodicities and that,
although they may be of theoretical interest, they have been found to be unreliable,
having amplitudes which are too small for any practical conclusions to be drawn.
8. PUBLICATION OF RESULTS
8.2 Tables
Numerical tables of frequencies, averages, distribution parameters, return periods
of events, etc., should state clearly:
8.4 Graphs
Graphs are used to show, in a concise format, the information contained in
numerical tables. They are a useful adjunct to the tables themselves and facilitate
the comparison of results. Cumulative frequency curves, histograms, and
climograms give a better overall picture than the multiplicity of numerical data
obtained by statistical analysis. The scales used on the graph must be specified and
their graduations should be shown. Pub1ications intended for wide distribution
among agricultural users should not have complicated scales (e.g. logarithmic,
Gaussian, etc.) with which the users may be unfamiliar, and which might lead to
serious errors in interpreting the data. Furthermore, giving too much information on
the same graph and using complicated conventional symbols should be avoided.
8.5 Maps
To present concisely the results of agroclimatological analysis covering an area or
region, it is often better to draw isopleths or color classification from the data plotted
at specific points. The interpolation between the various locations can be used in a
digital map plotted by special plotting tools such as Graph, Grids, Surfer, and GIS.
Many climatic parameters useful to agriculture can be shown in this way, for example:
Depending on the scale adopted, this type of supplementary chart can be drawn
more or less taking geomorphological factors into account. However, the users of
the charts should be made aware of their generalized nature and, to interpret them
usefully, should know that corrections for local conditions must be made. This is
particularly important for hilly regions.
51
following guidelines are suggested. For a complete discussion on the matter, the
readers are referred to WMO/TD No. 1108 and to Paw U and Davis (2000).
First, it is essential to determine who the Users are. One category of Users may be
farmers who need daily information to assist them in day-to-day activities such as
sowing, spraying, and irrigating. Another category may be more interested in
long-term agricultural decisions such as crop adaptation to weather patterns, or
marketing decisions, or modelling.
Second, the Users’ requirements must be clearly established, so that the most
appropriate information is provided. This is possible only after discussing with them.
In most cases, they do not have a clear picture of the type of information which is best
suited for their purpose; here, the role of the Agrometeorologist is crucial.
Fourth, it is very important to consider the cost of the Agmet Bulletin that is proposed
to the Users, especially in developing countries where the financial burden is getting
worst.
Data in Pentads
The above Agmet Bulletin (Table 1) was developed to cater for all crops, ranging
52
from tomatoes to sugarcane. It is issued on a half-monthly basis and is sent to the
Users by post and is also available on the website. Bearing in mind the time taken to
collect the data, it would not be before, at least, the 20th that the Bulletin would reach
the Users. To provide farmers (tomato growers, for example) with data relevant to
their day-to-day activities, the Agmet Bulletin is supplemented by daily values of
rainfall and maximum and minimum temperatures, which are broadcast on radio and
television. Of course, data relevant to different geographical localities can be
included.
In AgMet Bulletins, extreme weather events, which are masked by the averaging
procedure involved in the calculation of the pentad, must be highlighted, probably in
the form of a footnote, to draw the attention of the Users. For example, from Table 1,
it can be seen that during the period 6-15 July, the maximum temperature was below
the normal by not more than 1.8°C. In fact, during the period 9-12 July, maximum
temperature was below the normal by 2.8 to 3.0 degrees Celsius; this can be of
importance to both animals and plants.
The presentation of data in this format, together with the broadcast of daily values on
the radio and on television, is very effective. It can be used by farmers interested in
day-to-day activities and by research workers and model builders. It is suitable for
all types of crops, ranging from tomatoes and lettuce to sugar cane and other deep root
crops.
The bulletin should include daily data, 10-daily means or totals, and deviation or
percent from average. In parameters, such as maximum and minimum temperature
and maximum and minimum relative humidity, absolute values of the decade based
on a long series of years are also recommended.
53
and,
k) Number of hours below 0οC.
This Weather Outlook, based on model output received early on Thursday the 11th
from WWC, was released in the afternoon on the 11th December and sent to the Users,
through the Farming Centres by e-mail and posted on the website. This Outlook was
neither broadcast on the radio nor on the television. The issue of such Weather
Outlook is important, but it must be carefully planned, otherwise, it can lead to
financial loses, as shown below.
Little rainfall was observed during the first two pentads of December 2003 and
farmers were starting to get worried. The indication that significant rain was
expected on Sunday the 14th (Table 2) had given great hope to the farmers and,
because it was a weekend, they made plans on Friday to do some field work on
Saturday and on Monday. Such plans are costly because it implies the booking of
manpower and of transportation, the buying of fertilizers, etc. But model output
received on Friday the 12th indicated that the probability of having rain during the
following five days was negligible, and in fact, it was not before the 31st of December
that significant rainfall was observed.
Here, it is not the validity of the Weather Outlook that is questioned. The point to be
noted is that no update of the Outlook could reach the farmers because the Farming
Centres were closed for the weekend. If, besides being sent by email and posted on
the website, the Outlook was broadcast on radio and television, the update version
would have reached the farmers and appropriate measures could have been taken.
To avoid similar incidents, it is advisable to decide on the methods of dissemination
54
of information.
Seasonal Forecast
An extract of a seasonal forecast issued, for a country situated in the southern
hemisphere, during the first half of October 2003 for summer 2003-2004 (Summer in
that country is from November to April) is shown: “The rainfall season may begin
by November. The summer cumulative rainfall amount is expected to reach the
long-term mean of 1400 millimeters. Heavy rainfall is expected in January and
February 2004.” This seasonal forecast was published in the newspapers and read
on the television.
The question is: who are qualified to interpret and use this forecast? Can it be
misleading to farmers? To show the problems which such forecast can create, real
data for the period October 2003 to January 2004 are presented in Table 3 for an
agricultural area.
Rainfall Amounts in millimeters (mm)
October November December January
2003 2003 2003 2004
First 1.8 4.1 5.2 176.4
Half
Second 12.8 35.7 12.8 154.1
Half
Table 3: Rainfall amounts recorded over an agricultural area during the period
October 2003 to January 2004. Out of the 35.7 mm of rainfall recorded during the
second half of November, 35.0 mm fell during the period 16-25
Given that October and the first half of November 2003 were relatively dry and that a
significant amount of rainfall was recorded during the second half of November, and
noting that the seasonal forecast opted for normal rainfall during summer and that the
rainfall season may start in November, the farmers thought that the rainy season was
on. Most of them started planting their crops during the last pentad of November.
Unfortunately, the rainfall during the second half of November was a false signal:
December was relatively dry. The rainy season started in January 2004.
To avoid seasonal forecasts to fall in the wrong hands, it is not advisable to have them
published in the newspapers; these seasonal forecast must be sent to specialists who
are trained to interpret them and should be supplemented by short-range weather
forecast.
Sooner or later, the financial situation in the SPC will not be able to sustain the issue
of costly AgMet Bulletin by local personnel. So in these SPC, the
agrometeorologists must think carefully about the cost-benefit of the AgMet Bulletin,
especially when developed countries are getting ready to propose their services for
55
free (for how long will these be free?).
Already, shipping bulletins, cyclone warnings, and aviation forecasts are being offered
for free on a global scale by a few developed countries. But, how long will these
services be free? Sooner or later, the SPC will have to pay for these services. It is
very important to keep the cost of the AgMet Bulletin to a minimum.
56
References
GUMBEL, E. J., 1959: Statistics of extremes. Columbia University Press, New York,
375 pp.
SIVAKUMAR, M.V.K. U.S. DE, K.C. SIMHARAY, and M. RAJEEVAN (Eds.) 1998.
User Requirements for Agrometeorological Services. Proceedings of an
International Workshop held at Pune, India, 10-14 November 1997.
THOM, H.C.S. 1966. Some methods of climatological analysis. WMO Technical Note
No. 81, Geneva, Switzerland.
WIERINGA, J., and LOMAS, J. 2001. Lecture Notes for Training Agricultural
Meteorological Personnel. WMO-No. 551, Geneva, Switzerland.
WIJNGAARD, J.B., KLEIN TANK A.M.G., and KONNEN G.P. 2003. Homogeneity
57
of 20th century European daily temperature and precipitation series, International
Journal of Climatology, 23, 679-692.
Paw U, K.T., Davis, C.A. (Eds. In Chief), 2000. Agricultural and Forest Meteorology.
Vol. 103, Elsevier.
58