Ke Selman 1977
Ke Selman 1977
Assessing whether K group means differ from 1971b; Games & Howell, 1976; Hochberg,
one another is a very frequent concern of psy- 1976; Howell & Games, 1973,1974; Keselman,
chological researchers. The analysis of variance 1976; Keselman, Murray, & Rogan, 1976;
(ANOVA) F test is a popular statistical pro- Keselman & Toothaker, 1974; Keselman,
cedure for assessing group differences; however, Toothaker, & Shooter, 1975; Petrinovich &
when K > 2, a significant F test would have to Hardyck, 1969; Rogan, Keselman, & Breen,
be probed further in order to locate specific dif- in press; Smith, 1971; Spj^tvoll & Stoline,
ferences among the group means. Tukey's 1973, Ury & Wiggins, 1975; Rogan & Kesel-
(Note 1) multiple comparison procedure man, Note 2). Many of these articles have con-
(MCP), also popularly referred to as the Tukey sidered the effect of violating the assumptions
wholly significant difference (WSD) test under which the WSD test was derived. The
(Miller, 1966), is a prevalently cited method importance of such studies to psychological re-
when the researcher's multiple comparison hy- searchers relates to the validity of using the
potheses are for pairwise differences and the test with data obtained in "real life" research
rate of Type I error is to be controlled for the settings, which will seldom satisfy the require-
set of all possible pairwise contrasts (e.g., ments of the derivation. The purpose of the
Games, 1971b; Glass & Stanley, 1973; Winer, present article is to review the literature per-
1971). That is, the error rate is controlled ex- taining to the WSD test and its derivational
perimentwise (Kirk, 1968; Ryan, 1959) or assumptions, with the primary intention of
familywise (Games, 1971b; Miller, 1966), in integrating in one paper recently published
that a Type I error for the set occurs when any modifications that greatly enhance the versa-
one of the pairwise contrasts is falsely rejected. tility of the test.
Interest in the operating characteristics of
the WSD test is evident from the articles pub- Definition of a Common Test Statistic
lished in the behavioral science and statistical
journals (Carmer & Swanson, 1973; Cicchetti, The WSD test, like the ANOVA F test, as-
1972; Einot & Gabriel, 1975; Games, 1971a, sumes that the observations (i = 1,.. .,«&) of
each of the K populations are independently
and normally distributed with equal variances
The research for this article was supported by the Re- (o-jfc2 = ov2 = a 2 ). In addition, the method was
search Board of the University of Manitoba. The au- derived under the restriction that the variances
thors gratefully acknowledge the many helpful com- of the sample means, <r2/w, be equal; therefore,
ments provided by the reviewers and the associate each sample mean must be based on an equal
editor.
Requests for reprints should be sent to H. J. Kesel- number of observations, n.
man, Department of Psychology, University of Mani- The WSD procedure rejects the hypothesis
toba, Winnipeg, Manitoba, R3T 2N2, Canada. that pk — w = 0, k^ k' when the absolute
1050
TUKEY MULTIPLE COMPARISON TEST 1051
Table 1
Estimated Standard Errors and Critical Values of the Multiple Comparison Procedures,
Using Student's t as the Test Statistic
Harmonic mean
Kramer
Spj0tvoll & Stoline
Hochberg [max (siP/ni,) + max
Behrens-Fisher (*iV»* + **•"/»*')*
Scheffe' [(A: - i)
/Voie. min = minimum; max = maximum; vw indicates the Welch (1949) solution for error df (see Footnote
2); F(a;K-i,N-K) is the upper lOOa percent point of the Snedecor /''distribution with parameters K — 1 and
TV - K.
were combined with various patterns of heter- selman, and Breen (in press) systematically
ogeneous variances. Combining unequal vari- evaluated the effects of varying degrees of
ances and unequal sample sizes affected the heterogeneity resulting from conditions of un-
harmonic mean procedure in a manner similar equal variances and unequal sample sizes on
to the effect of these assumption violations on the accuracy of the harmonic mean (Winer,
the ANOVA F test (Scheffe, 1959). The rates 1971), the Kramer (1956), and the Miller
of Type I error were less than the nominal value (1966) unequal group forms of the WSD test
when the smallest sample was paired with the when sampling from a normal and nonnormal
smallest variance (conservative test), but ex- distribution. The discrepancies between the
ceeded the nominal value when the smallest empirical and nominal significance rates of
sample was combined with the largest variance Type I error were found to vary markedly as a
(liberal test). function of the degree of heterogeneity. The
Similar results were obtained by Petrinovich Kramer unequal group form was recommended,
and Hardyck (1969) and Keselman and as it consistently resulted in empirical Type I
Toothaker (1974), who examined the harmonic rates of error deviating less from the nominal
mean procedure, and by Keselman, Toothaker, significance level than did either of the other
and Shooter (1975), who compared the har- two unequal nk forms. Also of importance was
monic mean and Kramer procedures for the the finding that the rate of Type I error was
combined effects of unequal group sizes, vari- seriously inflated, though sample sizes were
ance heterogeneity, and nonnormality. In ad- equal, when the degree of variance hetero-
dition, both procedures were generally robust geneity was large.
with respect to nonnormality. In summary, the studies comparing the
Although the above investigations have ex- harmonic mean (Winer, 1971), the Kramer
amined the combined effects of unequal sample (1956), and the Miller (1966) modifications
sizes and unequal variances on the empirical have clearly favored the use of the Kramer pro-
probability of a Type I error, other than cedure. However, the Kramer modification is
Howell and Games (1973), none have at-
tempted to quantify numerically and to vary 1
The degree of variance heterogeneity present in an
systematically the degree of heterogeneity. experimental paradigm can be indexed by a coefficient
This concern is of paramount importance, since of variance variation given by Box (1954). The coeffici-
Box (1954) has shown that in situations of un- ent of variance variation, C, is
equal variances and unequal sample sizes, the
degree of heterogeneity, as indexed by a co-
efficient of variance variation, is the major de-
terminant of the extent of bias in the estima- the standard deviation of the K unequal variances di-
tion of significance.1 Consequently, Rogan, Ke- vided by the average of the K variances, ff.2.
TUKEY MULTIPLE COMPARISON TEST 1053
nonetheless affected by combining unequal been proposed by Howell and Games (1974)
sample sizes with heterogeneous variances. and Games and Howell (1976). They suggest
Specifically, the modification provides a con- adopting the Behrens-Fisher statistic with
servative test when the smallest sample is ob- Welch's (1949) approximate / solution for <Lj?
tained from the population with the smallest Since it has been found that the Behrens-Fisher
variance, and a liberal test when the smallest solution statisfactorily controls the rate of
sample is paired with the largest population Type I error on any one contrast when sample
variance (Keselman, Toothaker, & Shooter, sizes and variances are unequal and also pro-
1975; Rogan, Keselman, & Breen, in press). vides a powerful statistical test (Mehta &
Recently, Spjjrftvoll and Stoline (1973) have Srinivasan, 1970; Wang, 1971), these authors
presented a modified form of the WSD test, have suggested its use in testing the multiple
which, in its mathematical derivation, is ap- comparison null hypotheses by noting that the
plicable to the unequal sample case. Their familywise rate of Type I error can be con-
modification uses only the smaller of the two trolled merely by using the WSD criterion of
n/cS and, since the procedure assumes homo- significance.
geneity of variance, the pooled within-cell The availability of the Spj^tvoll and Stoline
estimate, s2, or error variability. By their cri- (1973), Hochberg (1976), and Behrens-Fisher
terion, a contrast must exceed the lOOa percent modifications substantially enhances the appli-
point of the augmented Studentized range dis- cability of the WSD test to psychological data.
tribution, q' (see Scheffe, 1959), with parameters Prior to these presentations, psychological re-
K and N - K. If K > 2 and a < .05, the searchers might have been utilizing the har-
values of qa will be good approximations for monic mean approximation recommended by
the untabled qa' values (Spj^tvoll & Stoline, Winer (1971), since the Kramer (1956) pro-
1973; Ury & Wiggins, 1975). Because the cedure has only recently been popularized in
smallest sample size is used in both «4 positions the psychological literature, but, more likely,
in the t statistic (see Table 1), the procedure they were using the Scheffe (1959) MCP
should tend to inflate SE(xk-Zk'i and con- (which for pairwise contrasts can also be ex-
sequently should provide a conservative test. pressed as a t statistic), since its applicability
Another modification of the WSD test has for unequal sample sizes has been widely dis-
been offered by Hochberg (1976), who has ex- cussed (e.g., Glass & Stanley, 1970; Hays,
tended the Spjjrftvoll and Stoline (1973) pro- 1972; Petrinovich & Hardyck, 1969). The
cedure to include the case of heterogeneous choice between the Kramer and the other
variances. Thus, the Hochberg form of the modifications including the Scheffe procedure
WSD test is not restricted to equal sample can, in part, be facilitated by examining the
sizes, nor do the population variances have to information contained in Table 1.
be assumed to be equal. The Hochberg modi- Inspection of the denominators of the com-
fication uses the largest of the two standard mon test statistics and their critical values
error estimates of Xk and Xk' (st?/nh and indicates that the MCPs differ either in their
*».«/»*•), where s? = £ (Xrt - -?t)V(»* - 1). definition of SEat^^-) and/or in their critical
Like the method of Spj^tvoll and Stoline values. Consequently some differences between
(1973), this procedure should provide an in- the methods are predictable. For example, it
flated estimate of error variability, and con- can be seen that the Kramer and Scheffe pro-
sequently, a conservative test. A pairwise con- cedures are identical, yet the rate of Type I
trast must exceed the lOOa percent point of the error and the power will be lower when using
augmented Studentized range distribution the Scheffe MCP, since it sets a larger critical
q'(a-.K,N-K) to be statistically significant. The value. It is also expected that the modifications
values of q\a-,K,N-K) are again well approxi-
mated by the values of q(a;K,N-K) if K > 3
and a < .05 (Hochberg, 1976). The estimated vw, the Welch degrees of freedom, is given by
standard error and critical value of / are
enumerated in Table 1. faV**-)2'
The latest modification of the WSD test has -1 nic' — 1
1054 H. J. KESELMAN AND JOANNE C. ROGAN
given by Spjjzitvoll and Stoline (1973) and paired with the largest variances, the Hochberg
Hochberg (1976) will be more conservative modification maintained the rate of Type I
than the Kramer method and therefore less error well below the nominal significance level.
powerful as well. However, a uniformly pre- On the other hand, the Kramer values exceeded
ferable choice is still not apparent. That is, the nominal value at every degree of heterogen-
factors such as the degree of variance hetero- eity investigated, whereas the Spj0tvoll and
geneity, the extent of sample size imbalance, Stoline (1973) and the Scheffe (1959) tests
the number of treatment levels, the shape of remained robust, except for the two largest de-
the population, the magnitude of the nonnull grees of heterogeneity. The inflated rates of
treatment effects, can affect the rates of Type error for the three procedures ranged from
I error and power characteristics, particularly S.7%-21.1%, 8.1%-13.9%, and 6.6%-16.4%,
the magnitude of differences. Therefore, empir- respectively. Also, only for the two largest de-
ical investigation is necessary to supplement grees of heterogeneity were the Kramer and
the mathematical comparisons in order to at- Scheffe' tests sufficiently more powerful than
tempt a specification of "the recommended the Hochberg procedure, as hypothesized by
procedure." Hochberg (1976).
Given the assumptions that psychological Games and Howell (1976), in their simula-
data rarely, if ever, satisfy the homogeneity- tion study, compared the Behrens-Fisher,
of-variance assumption and that even when the Kramer, and multiple t test methods for vari-
data conform to this assumption the researcher ance heterogeneity, sample size inequality, and
will rarely have sufficient evidence concerning pairings of unequal sample sizes and hetero-
the population variances, investigations per- geneous variances. In addition to collecting
taining to the MCPs that do not require WSD family wise rates of Type I error, they ob-
homogeneity of variance would be most in- tained Type I error and power per comparison
strumental in delineating the best method. rates (rates that are set on just a single con-
Investigations presented by Rogan and Ke- trast; see Games, 197 Ib) on three pairwise
selman (Note 2) and Games and Howell (1976) differences. Their data indicated that only the
are most relevant to our search for the best Behrens-Fisher approach satisfactorily controls
method. both the familywise and per comparison rates
Rogan and Keselman (Note 2) investigated of Type I error. The familywise rates were
the hypothesis suggested by Hochberg (1976) within two percent of the five-percent nominal
that his modification would be more powerful value, regardless of whether the smallest sam-
than Scheff6's (1959) method for pairwise con- ple was associated with the smallest or largest
trasts when heterogeneity is low and sample variance. The per comparison estimates indi-
size is large. These authors collected empirical cated that only the Behrens-Fisher solution
rates of Type I error and power for the Kramer controlled the rate of error on each contrast,
(1956) and Spj^tvoll and Stoline (1973) whereas the per comparison probability of a
methods as well. The study assessed the magni- Type I error substantially varied from con-
tude of Type I error and power differences due trast to contrast for the Kramer and multiple
to (a) degree of variance heterogeneity, (b) t solutions. Given the satisfactory control of
degree of sample size imbalance, (c) pattern of Type I errors and the expected different power
variance heterogeneity, (d) pattern of mean values due to the Type I error differences, of
differences, (e) form of distribution, and (f) paramount concern to Games and Howell
direction of pairing of unequal sample sizes (1976) in their power analysis was an examina-
and heterogeneous variances. It was found that tion of the Behrens-Fisher solution when the
when the smallest sample was paired with the derivational assumptions of the WSD method
smallest variance, the rates of Type I error for were satisfied. That is, Games and Howell
all procedures were less than the nominal sig- (1976) investigated the effect of using the
nificance level, and the Kramer and particu- Behrens-Fisher solution when the usual WSD
larly the Hochberg procedures were more test would be most powerful, in order to de-
powerful than the Scheffe test. For those con- termine whether power differences would be of
ditions in which the smallest samples were a magnitude that would not favor adopting
TUKEY MULTIPLE COMPARISON TEST 1055
the uniform rule that irrespective of the state Einot, I., & Gabriel, K. R. A study of the powers of
of the population variances, one should always several methods of multiple comparisons. Journal of
the American Statistical Association, 1975, 70, 574-
use the Behrens-Fisher solution. The power 583.
differences were minimal (3%-4%) and there- Games, P. A. Inverse relation between the risks of Type
fore Games and Howell (1976) recommended I and Type II errors and suggestions for the unequal
the Behrens-Fisher solution. n case in multiple comparisons. Psychological Bulletin,
1971, 75, 97-102. (a)
Games, P. A. Multiple comparisons of means. American
Conclusions and Recommendations Educational Research Journal, 1971,5,531-565. (b)
Games, P. A., & Howell, J. F. Pairwise multiple compar-
The data from these two studies indicate that ison procedures with unequal N's and/or variances:
only the Hochberg and Behrens-Fisher modi- A Monte Carlo study. Journal of Educational Statis-
tics, 1976, 1, 113-125.
fications satisfactorily control the rate of Type Glass, G. V., & Stanley, J. C. Statistical methods in edu-
I error at the nominal significance level in the cation and psychology. Englewood Cliffs, N. J.:
presence of pairings of unequal sample sizes Prentice-Hall, 1970.
and heterogeneous variances. In terms of Hays, W. L. Statistics for the social sciences (2nd ed.).
Toronto, Canada: Holt, Rinehart & Winston, 1972.
absolute deviations, the Behrens-Fisher values Hochberg, Y. A modification of the T-method of multi-
were much closer to the nominal value. More- ple comparisons for a one-way layout with unequal
over, the mathematical and empirical in- variances. Journal of the American Statistical Associa-
formation pertaining to power characteristics tion, 1976, 71, 200-203.
Howell, J. F., & Games, P. A. The robustness of the
undoubtedly favors the Behrens-Fisher modi- analysis of variance and the Tukey WSD test under
fication. Based on the cited literature, we too various patterns of heterogeneous variances. Journal
recommend the Behrens-Fisher solution with of Experimental Education, 1973, 41, 33-37.
the Welch (1949) approximate / solution for Howell, J. F., & Games, P. A. The effects of variance
df, referring the value of \t\ to the Tukey heterogeneity on simultaneous multiple-comparison
procedures with equal sample size. British Journal of
familywise criterion of significance^^K^ Mathematical and Statistical Psychology, 1974, 27,
72-81.
Keselman, H. J. A power investigation of the Tukey
3
An empirical investigation was later conducted and multiple comparison statistic. Educational and Psy-
the data support the recommendation. The results can chological Measurement, 1976, 36, 97-104.
be obtained from the authors. Keselman, H. J., Murray, R., & Rogan, J. Effect of very
unequal group sizes on Tukey's multiple comparison
test. Education and Psychological Measurement, 1976,
Reference Notes 36, 263-270.
1. Tukey, J. W. The problem of multiple comparisons. Keselman, H. J., & Toothaker, L. E. Comparison of
Unpublished manuscript, Princeton University, 1953. Tukey's T-method and SchefK's S-method for vari-
2. Rogan, J. C., & Keselman, H. J. The e/ects of sample ous numbers of all possible differences of averages
size unbalance and variance heterogeneity on the ex- contrasts under violation of assumptions. Educational
tended-Tukey and Scheffe multiple comparison tests. and Psychological Measurement, 1974, 34, 511-519.
Paper presented at the meeting of the American Edu- Keselman, H. J., Toothaker, L. E., & Shooter, M. An
cational Research Association, New York, April evaluation of two unequal nt forms of the Tukey
1977. multiple comparison statistic. Journal of the American
Statistical Association, 1975, 70, 584-587.
Kirk, R. E. Experimental design: Procedures for the be-
References havioral sciences. Belmont, Calif.: Brooks/Cole, 1968.
Kramer, C. Y. Extension of multiple range tests to
Box, G. E. P. Some theorems on quadratic, forms ap- group means with unequal numbers of replications.
plied in the study of analysis of variance problems. I. Biometrics, 1956,12, 307-310.
Effect of inequality of variance in the one-way classi- Mehta, J. S., & Srinivasan, R. On the Behrens-Fisher
fication. Annals of Mathematical Statistics, 1954, 25, problem. Biometrics, 1970, 57, 649-655.
290-302. Miller, R. G., Jr. Simultaneous statistical inference. New
Carmer, S. G., & Swanson, M. R. An evaluation of ten York: McGraw-Hill, 1966.
multiple comparison procedures by Monte Carlo Petrinovich, L. F., & Hardyck, C. D. Error rates for
methods. Journal of the American Statistical Associa-
tion, 1973, 68, 66-74. multiple comparison methods: Some evidence con-
Cicchetti, D. Extension of multiple range tests to inter- cerning the frequency of erroneous conclusions.
action tables in the analysis of variance: A rapid ap- Psychological Bulletin, 1969, 71, 43-54.
proximate solution. Psychological Bulletin. 1972. 77, Rogan, J. C., Keselman, H. J., & Breen, L. J. Assump-
405-408. tion violations and rates of Type I error for the Tukey
1056 H. J. KESELMAN AND JOANNE C. ROGAN
multiple comparison test: A review and empirical Ury, H. K., & Wiggins, A. D. A comparison of three
investigation via a coefficient of variance variation. procedures for multiple comparisons among means.
Journal of Experimental Education, in press. British Jounral of Mathematical and Statistical Psy-
Ryan, T. A. Multiple comparisons in psychological re- chology, 1975, 28, 88-102.
search. Psychological Bulletin, 1959, 56, 26-47. Wang, Y. Y. Probabilities of the Type I errors of the
Scheff^, H. The analysis of variance. New York: Wiley, Welch tests for the Behrens-Fisher problem. Journal
19S9. of the American Statistical Association, 1971, 66, 605-
Smith, R. A. The effect of unequal group size on Tukey's 608.
HSD procedure. Psychometrika, 1971, 36, 31-34. Welch, B. L. Further note on Mrs. Aspin's tables and on
Spj0tvoll, E., & Stoline, M. R. An extension of the certain approximations to the tabled functions. Bio-
T-method of multiple comparison to include the cases metrics, 1949, 56, 293-296.
with unequal sample sizes. Journal of the American Winer, B. J. Statistical principles in experimental design
Statistical Association, 1973, 68, 975-978. (2nd ed.). Toronto, Canada: McGraw-Hill, 1971.
Steel, R. G. D., & Torrie, J. H. Principles and proced-
ures of statistics. New York: McGraw-Hill, 1966. Received June 7, 1976 •