AAE-223: Statistics for Economist 2
Lecture Notes 2
                          Assa Mulagha-Maganga
          Dept of Agricultural and Applied Economics, LUANAR
    Department of Mathematical Sciences (Statistics), Chancellor College
                              Summer 2022 (27 Oct 2022)
2     Analysis of variance
This lecture introduces the first in a series of lectures devoted to linear models. The topic of
this chapter, analysis of variance, provides a methodology for partitioning the total variance
computed from a data set into components, each of which represents the amount of the
total variance that can be attributed to a specific source of variation. The results of this
partitioning can then be used to estimate and test hypotheses about population variances and
means. In this chapter we focus our attention on hypothesis testing of means. Specifically,
we discuss the testing of differences among means when there is interest in more than two
populations or two or more variables. The techniques discussed in this chapter are widely
used in business or health sciences.
2.1    Introduction to analysis of variance
In this unit you will learn about analysis of variance, when and how it is applied in real life
scenarios. Analysis of variance is called ANOVA in its short form. Further we will look at
two common ways of how ANOVA is conduction; one-way ANOVA and two-way ANOVA.
The analysis of variance is used to examine or test hypothesis for more than two popu-
lations (or groups). For example, testing means of more than two groups or populations
(m-populations). Whereas the chi-square test can be used to test the differences among
several population proportions, the analysis of variance can be used to test the differences
among several population means. The null hypothesis is that the several population means
are mutually equal. The sampling procedure used is that several independent random sam-
ples are collected, one for each of the data categories (treatment levels). The assumption
underlying the use of the analysis of variance is that the several sample means were obtained
from normally distributed populations having the same variance.
                                               1
                                                                Statistics for Economists 2
The probability distribution used in this unit is the F distribution. It was named to honor Sir
Ronald Fisher, one of the founders of modern-day statistics. This probability distribution is
used as the distribution of the test statistic for several situations. It is used to test whether
two samples are from populations having equal variances, and it is also applied when we
want to compare several population means simultaneously.
What are the characteristics of the F distribution?
  1. The F distribution is continuous. This means that it can assume an infinite number
     of values between zero and positive infinity.
  2. The F distribution cannot be negative. The smallest value F can assume is 0.
  3. It is positively skewed. The long tail of the distribution is to the right-hand side. As
     the number of degrees of freedom increases in both the numerator and denominator
     the distribution approaches a normal distribution.
  4. It is asymptotic. As the values of X increase, the F curve approaches the X-axis but
     never touches it. This is similar to the behavior of the normal distribution.
To use ANOVA, we assume the following:
  1. The populations follow the normal distribution.
  2. The populations have equal standard deviations (a).
  3. The populations are independent.
When these conditions are met, F is used as the distribution of the test statistic
2.2    The ANOVA Test
How does the AN OVA test work? Recall that we want to determine whether the various
sample means came from a single population or populations with different means. We
actually compare these sample means through their variances. To explain, recall that on
earlier we listed the assumptions required for ANOVA. One of those assumptions was that
the standard deviations of the various normal populations had to be the same. We take
advantage of this requirement in the ANOVA test. The underlying strategy is to estimate
the population variance (standard deviation squared) two ways and then find the ratio of
these two estimates. If this ratio is about 1, then logically the two estimates are the same,
and we conclude that the population means are the same. If the ratio is quite different from
1, then we conclude that the population means are not the same. The F distribution serves
as a referee by indicating when the ratio of the sample variances is too much greater than 1
to have occurred by chance.
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 2
                                                                     Statistics for Economists 2
2.3    One-way analysis of variance
The one-way analysis of variance procedure is concerned with testing the difference among
k sample means when the subjects are assigned randomly to each of the several treatment
groups. The following notation will be used when samples from k different normally dis-
tributed populations having equal population variances are selected in order to test for the
equality of the means of the k populations:
The sample size, sample mean, and sample variance for the ith population are represented
by ni , x̄i , and s2i , respectively. The total sample size is n = n1 + n2 + · · · + nk . The overall
mean for all n sample values is represented by x̄. The population mean for the ith population
is represented by µi and the standard deviation for the ith population is represented by σi .
The between samples variation is measured by the between treatments mean square and is
represented by MSTR. The expression for MSTR is given by formula:
                                                      SST R
                                         M ST R =
                                                      k−1
The numerator of formula, SST R, is called the treatment sum of squares, and is computed
by using formula:
                    SST R = n1 (x̄1 − x̄)2 + n2 (x̄2 − x̄)2 + · · · + nk (x̄k − x̄)2
The within samples variation is measured by the error mean square and is represented by
M SE. The expression for M SE is given by
                                                      SSE
                                           M SE =
                                                      n−k
The numerator of formula, SSE, is called the error sum of squares, and is computed by using
formula below, where s1 , s2 , . . . sk are the sample variances.
                       SSE = (n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k
The denominator k − 1, is called the degrees of freedom for treatments and the denominator,
n − k, is called the degrees of freedom for error.
The sum of the treatment sum of squares and the error sum of squares is called the total sum
of squares. The total sum of squares is represented by SST and is given by: SST = SST R +
SSE. The total sum of squares may be computed directly by using formula SST = Σ(x− x̄)2
, where the sum is over all n sample values. The degrees of freedom for total is equal to
n − 1.
The results of the computations in the proceeding sections are usually conveniently displayed
in a one-way ANOVA table. The general structure of the one-way ANOVA table is given as:
Table 2.1: ANOVA Table
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 3
                                                                     Statistics for Economists 2
                      Source      df      SS     MS = SS/df            F statistic
                    Treatment    K -1   SSTR       MSTR                    F
                      Error      n-k     SSE       MSE
                       Total     n-1     SST
What are the Steps for Testing the Equality of Means Using the One-Way ANOVA Proce-
dure?
Step 1: State the null and alternative hypotheses as follows:
H0 : µ1 = µ2 = · · · = µk Ha : All k means are not equal.
Step 2: Use the F distribution table and the level of significance, α, to determine the
rejection region.
Step 3: Build the ANOVA table, and from the table determine the computed value of the
F ratio.
Step 4: State your conclusion. The null hypothesis is rejected if the computed value of the
test statistic falls in the rejection region. Otherwise, the null hypothesis is not rejected.
Example 7.1
Fifteen students at LUANAR are randomly assigned to three different schools, all of which
are concerned with developing a specified level of skill in agricultural economics. The achieve-
ment test scores at the conclusion of the instructional unit are reported in Table 7.2, along
with the mean performance score associated with each instructional approach. Use the anal-
ysis of variance procedure in Section to test the null hypothesis that the three sample means
were obtained from the same population, using the 5 percent level of significance for the test.
Table 7.2:
             Schools               Test    scores                   Total scores     Mean score
       Bunda Campus (A1)        86 79        81   70    84              400             80
        City Campus (A1)        90 76        88   82    89              425             85
            ODL (A3)            82 68        73   71    81              375             75
Solution
Step 1
H0 : µ1 = µ2 = µ3
H1 : Not all µ1 = µ2 = µ3
Step 2
Critical F (df = k − 1, n − k; α = 0.05) = F (2, 12; α = 0.05) = 3.89
The is obtained from the F -distribution tables with alpha level of 0.05 or 5% as;
                                                        P
                                                            x
Step 3 The overall mean of all 15 test scores is x̄ =   n
                                                                =   1200
                                                                     15
                                                                           = 80
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 4
                                                      Statistics for Economists 2
                                   Figure 1:
By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 5
                                                                   Statistics for Economists 2
                          P
                             x1i
The mean of n1 is x̄1 =     n
                                   =   86+79+81+70+84
                                             5
                                                        = 80
                          P
                             x2i
The mean of n2 is x̄2 =     n
                                   =   90+76+88+82+89
                                             5
                                                        = 85
                          P
                             x2i
The mean of n3 is x̄3 =     n
                                   =   82+68+73+71+81
                                             5
                                                        = 75
Therefore using the formular for MSTR we get,
                              SST R   n1 (x̄1 − x̄) + n2 (x̄2 − x̄) + n3 (x̄3 − x̄)
                  M ST R =          =
                              k−1                        k−1
                                      5(80 − 80) + 5(85 − 80) + 5(75 − 80)
                                    =
                                                         3−1
                                      250
                                    =
                                       2
                                    = 125
From the above, SST R is 250 and k − 1 is 2
The variance for each of the three samples is
                (86 − 80)2 + (79 − 80)2 + (81 − 80)2 + (70 − 80)2 + (84 − 80)2
        S12 =                                                                  = 38.5
                                            5−1
                (90 − 85)2 + (76 − 85)2 + (88 − 85)2 + (82 − 85)2 + (89 − 85)2
        s22 =                                                                  = 35.0
                                            5−1
                (82 − 75)2 + (68 − 75)2 + (73 − 75)2 + (71 − 75)2 + (82 − 75)2
        s23 =                                                                  = 38.5
                                            5−1
Then the
                                     (n1 − 1)S12 + (n2 − 1)S22 + (n3 − 1)S32
                        M SE =
                                                      5−1
                                     (4)(38.5) + (4)(35) + (4)(38.5)
                                   =                                 = 38.5
                                                 15 − 3
                                     448
                                   =
                                      12
                                   = 37.33
From the above, SSE is 44 and n − k is 12
Because the calculated F statistic of 3.35 is not greater than the critical F value of 3.89,
the null hypothesis that the mean test scores for the three schools in the population are all
mutually equal cannot be rejected at the 5 percent level of significance.
This information can be presented in ANOVA table as:
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 6
                                                             Statistics for Economists 2
               Source        df     SS     MS = SS/df          F statistic
             Treatment     3-1 =2   250    250/2=125       F= 125/37.33=3.35
               Error      15-3=12   448   448/12=37.33
                Total     15-1=14   698
Practical Activity
  1. Define the meaning of the terms response variable, factor, treatments, and experimental
     units.
     Solution
       •   factor = Independent variables in a designed experiment
       •   treatments = Values of a factor (or combination of factors)
       •   experimental units = The entities to which the treatments are assigned
       •   response variable = The variable of interest in an experiment; dependent variable
  2. Explain the assumptions that must be satisfied in order to validly use the one-way
     ANOVA formulas.
     Solution
       • Constant variance, normality, independence
  3. Explain the difference between the between-treatment variability and the within-
     treatment variability when performing a one-way ANOVA.
     Solution
       • SST = variability of the sample treatment means
       • SSE = variability within each sample
  4. Explain why we conduct pairwise comparisons of treatment means
     Solution
       • If the one-way ANOVA F test leads us to conclude that at least two of the treat-
         ment means differ, then we wish to investigate which of the treatment means
         differ and we wish to estimate how large the differences are.
  5. A consumer preference study compares the effects of three different bottle designs (A,
     B, and C) on sales of a popular fabric softener. A completely randomized design is
     employed. Specifically, 15 supermarkets of equal sales potential are selected, and 5 of
     these supermarkets are randomly assigned to each bottle design. The number of bottles
     sold in 24 hours at each supermarket is recorded. The data obtained are displayed in
     Table below. Let µA , µB , and µC represent mean daily sales using bottle designs A, B,
     and C, respectively. Test the null hypothesis that µA , µB , and µC are equal by setting
     That is, test for statistically significant differences between these treatment means at
 By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 7
                                                                  Statistics for Economists 2
      the .05 level of significance. Based on this test, can we conclude that bottle designs A,
      B, and C have different effects on mean daily sales?
                                     Table 1: Bottle designs
                                            A    B    C
                                           16 33 23
                                           18 31 27
                                           19 37 21
                                           17 29 28
                                           13 34 25
Solution
   • F = 43.36, Reject H0 : bottle design does have an impact on sales.
2.4    Two-way analysis of variance
Many response variables are affected by more than one factor. Because of this we must
often conduct experiments in which we study the effects of several factors on the response.
In this section we consider studying the effects of two factors on a response variable. The
two-way ANOVA compares the mean differences between groups that have been split on
two independent variables (called factors). The primary purpose of a two-way ANOVA is to
understand if there is an interaction between the two independent variables on the dependent
variable.
Suppose that an agricultural experiment consists of examining the yields per acre of 4 differ-
ent varieties of wheat, where each variety is grown on 5 different plots of land. Thus, a total
of 20 plots are needed. It is convenient in such case to combine the plots into blocks, say 4
plots to a block, with a different variety of wheat grown on each plot within a block. Thus
5 blocks would be required here. In this case, there are two classifications, or factors, since
there may be differences in yield per acre due to (1) the particular type of wheat grown or
(2) the particular block used (which may involve different soil fertility, etc.).
By analogy with the agricultural experiment example, we often refer to the two factors in an
experiment as treatments and blocks, but of course we could simply refer to them as factor
1 and factor 2.
Assuming that we have a treatments and b blocks, we construct Table where it is supposed
that there is one experimental value (such as yield per acre) corresponding to each treatment
and block. For treament j and block k, we denote this value by Xjk . The mean of the entries
in the jth row is denoted by x̄j. , where j = 1, . . . , a, while the mean of the entries in the kth
column is denoted by x̄.k , where k = 1, . . . , b. The overall, or grand, mean is denoted by x̄¯.
in symbols, this is shown as:
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 8
                                                                        Statistics for Economists 2
                                                       1X b
                                                x̄j. =       xjk
                                                       b k=1
                                                       1X a
                                                x̄.k =       xjk
                                                       a j=1
                                                ¯    1 Xab
                                                x̄ =        xjk
                                                     ab j,k
                        Block
                 1      2   ···           b
 Treatment 1    X11    X12 · · ·         X11     x̄1.
 Treatment 2    X21    X22 · · ·         X11     x̄2.
 Treatment a    Xa1    Xa2         ···   X11     x̄a.
                x̄.1   x̄.2        ···   x̄.b
The ANOVA procedure for a two-factor factorial experiment partitions the total sum of
squares (SSTO) into four components: the factor 1 sum of squares– SS(1), the factor 2 sum
of squares–SS(2), the interaction sum of squares–SS(int), and the error sum of squares–SSE.
The formula for this partitioning is as follows:
                         SST O = SS(1) + SS(2) + SS(int) + SSE
But in this lesson we will for now ignore the interaction: hence,
                                    SST O = SS(1) + SS(2) + SSE
Example
Table 3 gives fresh graduates daily earnings (in thousands of MK) of former students with
bachelor’s degrees from 5 colleges and for 3 class rankings at graduation. Test at the 5%
level of significance that the means are identical (a) for college populations and (b) for
class-ranking populations.
Table 2.2
 Class rank    Bunda    Chanco           Poly    Medicine     Nursing   Sample
                                                                         mean
   Top          20            18         16             14         12     16
  Middle        19            16         13             12          8     14
  Bottom        18            14         10             10         10     12
  Sample        19            16         13             12         10     14
   mean
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 9
                                                                         Statistics for Economists 2
Solution
    Step 1: The hypotheses to be tested are
                                     H0 : µ1 = µ2 = µ3 = µ4 = µ5
                               H1 : µ1 = µ2 = µ3 = µ4 = µ5 are not equal
    We define each of these sums of squares and show how they are calculated for the
    bakery demand data as follows (note that a = 3, b = 5) : Where µ refers to the various
    means for factor A (school) populations
    Step 2: Calculate SSTO, which measures the total amount of variability:
                     a X
                       b
           SST O =         (xjk − x̄¯)
                     X
                     j=1 k=1
                = (20 − 14)2 + (18 − 14)2 + (16 − 14)2 + (20 − 14)2 + (20 − 14)2
                + (19 − 14)2 + (16 − 14)2 + (13 − 14)2 + (20 − 14)2 + (20 − 14)2
                + (18 − 14)2 + (14 − 14)2 + (10 − 14)2 + (20 − 14)2 + (20 − 14)2
                = 36 + 16 + 4 + 0 + 4 + 25 + 4 + 1 + 4 + 16 + 16 + 0 + 16 + 16 + 36
                = 194
    Step 3: Calculate SS(a), which measures the amount of variability due to the different
    levels of factor a:
                                                     3
                                         SS(a) = b       (x̄j. − x̄¯)2
                                                     X
                                                     j=1
                          SS(a) = b[(x̄1. − x̄¯)2 + (x̄2. )2 + (x̄3. − x̄¯)2
                                = 5[(16 − 14)2 + (14 − 14)2 + (12 − 14)2
                                = 5(4 + 0 + 4)
                                = 40
    Step 4: Calculate SS(b), which measures the amount of variability due to the different
    levels of factor b (colleges):
                                         SS(b) = a       (x̄.k − x̄¯)2
                                                     X
 By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 10
                                                                       Statistics for Economists 2
            SS(b) = a[(x̄.1 − x̄¯)2 + (x̄.2 − x̄¯)2 + (x̄.3 − x̄¯)2 + (x̄.4 − x̄¯)2 + (x̄.5 − x̄¯)2 ]
                  = 3[(19 − 14)2 + (16 − 14)2 + (13 − 14)2 + (12 − 14)2 + (10 − 14)2 ]
                  = 3(23 + 4 + 1 + 4 + 14)
                  = 150
     Step 5: Calculate SSE, which measures the amount of variability due to the error:
                         SSE = SST − SSA − SSB = 194 − 150 − 40 = 4
     These results are summarized in Table 2.3. From F distribution table, F=3.84 for
     degrees of freedom 4 and 8 and ∝= 0.05. Since the calculated F=70, we reject H0 and
     accept H1 , that the population means of fresh graduates’ earnings for the 5 colleges
     are different.
     Table 2.3 Two-Factor ANOVA Table for First-Year Earnings
          Variation        sum of squares         Degree of           Mean square                   F
                                                  Freedom
        Expllained by         SSA = 40             b-1=4          MSA=150/4=37.5           MSA/MSE=70
         Schools (B)
          (between
          columns)
        Explained by          SSB=150               a-1=2           MSB=40/2=20             MSB/MSE=40
         ranking (A)
       (between rows)
           Error or            SSE=4            (a-b)(b-1)=8        MSE=4/8=0.5
         unexplained
            Total             SST=194             ab-1=14
     (b) The hypotheses to be tested for class rankings is given by
H0 : µ1 = µ2 = µ3
H1 : µ1 = µ2 = µ3 are not equal
Where µ refers to the various means for factor B (class-ranking) populations. From Table
2.3, we get that the calculated value of F = M SB/M SE = 40. Since this is larger than the
tabular value of F = 4.46 for df 2 and 8 and ∝= 0.05, we reject H0 and accept H1 , that the
population means of first-year earnings for the 3 class rankings are different. Thus, the type
of school and class ranking are both statistically significant at the 5% level in explaining
differences in first-year earnings. The preceding analysis implicitly assumes that the effects
of the two factors are additive (i.e., there is no interaction between them).
  By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 11
                                                                  Statistics for Economists 2
Activity:
  1. Table 2.4 gives the km per litre of petrol for 4 different filling stations in Lilongwe for
     5 days. Assume that the km per litre for each filling station is normally distributed
     with equal variance. Should the hypothesis of equal population means be accepted or
     rejected at the 5% level of significance?
      Table 2.4
      Filling station 1   Filling station 2   Filling station 3    Filling station 4
              12                  12                  16                   17
              11                  14                  14                   15
              12                  13                  15                   17
              13                  15                  13                   16
              11                  14                  14                   18
     Answer: Rejected
  2. Table 2.5 gives the miles per litre of petrol for each of 4 different filling stations and
     3 types of car (heavy, medium, and light) in a completely randomized design. Should
     the hypothesis be accepted at the 1% level of significance that the population means
     are the same for each (a) filling station? (b) Type of car?
     Table 2.5
      Type of     Filling station 1   Filling station 2   Filling station 3   Filling station 4
        Car
      Heavy               8                   9                   9                    10
      Medium              16                  15                  18                   17
       Light              24                  26                  28                   30
     Answer: (a) Yes (b)No
  3. Table 2.6 gives sales data for soap with each of 3 different packaging and 4 different
     varieties of groundnuts in a completely randomized design. Should the hypothesis be
     accepted at the 5% level of significance that the population means are the same for
     each (a) packaging? (b) variety?
     Table 2.6 Groundnut sales for of 3 package wrappings and 4 varieties
                      Parkaging 1     Parkaging 2    Parkaging 3
       Manipinta          87              78             90
      Chalimbana          79              79             84
        Kalisere          83              81             91
         CG7              85              83             89
     Answer: (a)No (b) Yes
  4. Table below gives the outputs of an experimental farm that used each of four fertilizers
     and three pesticides such that each plot of land had an equal probability of receiving
     each fertilizer-pesticide combination (completely randomized design).
 By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 12
                                                                Statistics for Economists 2
   Output with 4 fertilizers and 3 pesticides
                  Fertilizer 1   Fertilizer 2   Fertilizer 3   Fertilizer 4
    Pesticide 1       21             12              9              6
    Pesticide 2       13             10              8              5
    Pesticide 3        8              8              7              1
        a. Find the average output for each fertilizer X̄.j , for each pesticide X̄i. and for
                               ¯.
        the sample as a whole X̄
        b. Find the total sum of squares, SST, the sum of squares for fertilizer or factor
        A, SSA, for pesticides or factor B, SSB, and for the error or unexplained residual,
        SSE.
        c. Find the degrees of freedom for SSA, SSB, SSE, and SST.
        d. Find MSA, MSB, MSE, MSA/MSE, and MSB/MSE.
By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 13