7 Chance Variation
7 Chance Variation
Mark Elwood
                     https://doi.org/10.1093/acprof:oso/9780198529552.001.0001
                     Published: 2007      Online ISBN: 9780191723865         Print ISBN: 9780198529552
CHAPTER
  Abstract
  This chapter discusses the e ects of chance variation that can be assessed by applying statistical tests.
  It is divided into three parts: the application of statistical tests and con dence limits to a simple 2 x 2
  table; applications to strati ed and matched studies, and to multivariate analysis; and life-table
  methods for the consideration of the timing of outcome events in a cohort study or intervention study.
  Self-test questions are provided at the end of the chapter.
  Keywords: variation, statistical tests, confidence limits, multivariate analysis, stratification, life-table
  methods
  Subject: Public Health, Epidemiology
  Collection: Oxford Scholarship Online
     Although men atter themselves with their great actions, they are not so often the result of a great design
     as of chance.
In this chapter, we will discuss the e ects of chance variation that can be assessed by applying statistical
tests. The chapter falls into three parts: the application of statistical tests and con dence limits to a simple 2
× 2 table; applications to strati ed and matched studies, and to multivariate analysis; and life-table
methods for the consideration of the timing of outcome events in a cohort study or intervention study.
Reference will be made to the Appendix, in which summaries of the statistical methods presented in this
book are given. Appendix Table 15 shows the relationships between the value of a statistic and the
probability or P-value. The conversion between statistical results and probability values can also be done
using a Microsoft Excel spreadsheet, and useful commands are listed in Appendix Table 16.
         Part 1. Statistical tests and confidence limits in a simple table
         The third non-causal explanation for an association is that it is due to chance variation, or random number
         variation, the fall of the dice, bad luck, or whatever synonym you prefer. The science of statistics is
         concerned with measuring the likelihood (or probability) that a given set of results has been produced by this
         mechanism. In this chapter we shall look at the probability of chance variation being responsible for an
         observed association. This chapter is designed to act as a bridge between the rest of the text and a
         conventional basic course or text in biostatistics.
         The main objective of this section is to see how the results of statistical tests t in with other considerations
p. 226   in assessing causality. These principles     can be appreciated using simple examples, but also apply to
         results which use much more complex statistics. Many important papers now published use complex
         statistical methods that will not be familiar to the general reader. However, working through some simpler
         and widely used statistical methods can clarify most of the general principles of interpretation of statistical
         tests. Often, application of the simpler methods can greatly help understanding.
         We shall also present in more detail some relatively simple statistical methods, particularly a test for
         variation in a 2 × 2 table (the Mantel-Haenszel statistic) and variations of it that can deal with cohort data
         using person-time denominators, life-table methods, and matched studies. Because of the wide application
         and the excellent performance of these statistics, even in comparison with much more complex types, we
         will present these in su   cient detail for readers to be able to apply such statistics to their own work and to
         the key results from published papers.
         So far in this text we have discussed the measurement of the association between an exposure and an
         outcome, in terms of the relative risk, odds ratio, or attributable risk. We have discussed how to judge
         whether the estimate of association is acceptable, as being reasonably free of bias and confounding. The
         order of consideration of non-causal explanations, observation bias, confounding, and chance, is
         important. If there is substantial observation bias, there is no point in adding a statistical analysis to a
         biased result. The assessment of bias depends on consideration of the design and conduct of the study. Some
         analysis methods may help to deal with bias, for example by restricting the study to certain subgroups,
         using a particular comparison group, and so on. Once observation bias has been dealt with, it is appropriate
         to consider confounding. We may have a situation in which there could be severe confounding by a factor
         not included in the study design, in which case further data manipulation is not helpful. More frequently,
         there may be confounding which can be dealt with by strati cation or multivariate analysis, or has been
         dealt with by randomization, matching, or restriction in the study design. These steps reviewed so far take
         us to the point of having an estimate of the association, which is, in our best judgement, not compromised
         by bias and is adjusted as far as possible for confounding. It is this estimate of the association on which we
         now concentrate and to which we can now apply statistical tests.
p. 227
         Discrete versus continuous measures
         The statistical methods described here are limited to those applicable to discrete measures of exposure and
         of outcome, i.e. two or a small number of categories. This is for two main reasons. We are concerned mainly
         with disease causation and with the evaluation of interventions. In most applications, the outcome
         measures are qualitative: the onset of disease, death, recurrence, recovery, return to work, and so on. Even
         where the biological issues are quantitative, the practical issues are often qualitative, and it may be
         appropriate to convert continuous data to a discrete form. For example, in a comparison of agents used to
         control high blood sugar, the analysis may be based on a comparison of the change in blood glucose levels in
         The second reason is that introductory statistics courses and texts emphasize methods of dealing with
         continuous data: the normal distribution, t-tests, regression, analysis of variance, and so on. Thus methods
         applicable to discrete variables may be less familiar. Standard statistical texts should be consulted with
         regard to the analysis of data using continuous variables.
         The statistical method the reader is most likely to be familiar with is that of signi cance testing. The
         question is: is the di erence in outcome between the two groups of subjects larger than we would expect to
         occur purely by chance= Consider a simple intervention study (Ex. 7.1).
Ex. 7.1
         This study gives the success rate of the new treatment as 20 per cent. Even accepting the study design as
         being perfect with no bias, we would not interpret this as meaning that for all similar groups of subjects
         exposed to this intervention, the success rate would be 20.00 per cent. The 200 subjects chosen are a sample
         from the uncounted total of all possible subjects who could be given that intervention, and 20 per cent is the
         estimate of the success rate in that total group of potential subjects. On the basis of pure chance, we should
         understand that the next sample of 200 subjects would be likely to give a slightly di erent result. However,
         20 per cent is our best estimate of the true success rate. Similarly, in the comparison group, our best
p. 228   estimate is 10 per cent. The signi cance testing technique tests how likely it is that a di erence as      large
         as the one we have seen (or larger) could occur purely by chance, if the true situation is that both the
         intervention and the comparison groups have the same true success rate. This would occur on the null
         hypothesis that the e ect of the intervention is no di erent from that of the comparison therapy. This is
         sometimes referred to as the concept that the two groups of subjects are independent samples drawn from
         the same population. Therefore statistical tests test the hypothesis that the true success rate in the two
         groups is the same, and that the observed di erences are produced purely by chance variation around that
         common value. Our best estimate of the common value of the success rate is based on all the subjects in the
         study, and therefore is 15 per cent in this example.
p. 230
         less than 5 per cent (0.05), and so is conventionally accepted as ‘statistically signi cant’.
Ex. 7.2
Two appropriate statistical tests applied to a 2 × 2 table: arithmetic examples use the data shown in Ex. 7.1
         Another commonly used test is a test of di erence in proportions, testing whether the 20 per cent success rate
         in the intervention group is di erent from the 10 per cent success rate in the control group (Ex. 7.2b). This
         formulation yields a standardized normal deviate of 2.80. This also corresponds to a two-sided P-value of
         between 0.01 and 0.001 (Appendix Table 15, p. 000), or using Excel we can enter =2*(1−normsdist(2.8)) and
         this yields 0.005 (Appendix Table 16).
         We have applied two statistical tests to the same data. They should give the same results. In fact, the chi-
                                                                                                            2
         squared statistic on one degree of freedom is the square of the normal deviate: 2.8 = 7.84. The cut-o               point
                                             2
         at 5 per cent signi cance for χ on one degree of freedom is 3.84, which is the square of 1.96, the cut-o             for
         the normal deviate. As an alternative to calculating the chi-squared statistic from the formula shown in Ex.
         7.2, its square root, the chi statistic, can be calculated and looked up in tables of the normal deviate, which
                                                                                                                         2
         are often more detailed than tables of the chi-squared statistic. This relationship between χ and a
                                                                                       2
         standardized normal deviate holds only for the situation where χ has one degree of freedom.
         Studies which produce results that can be simpli ed to the 2 × 2 format exempli ed here can be dealt with
         using either of these techniques, and the statistical methods apply to both cohort and case–control designs.
         The statistic tests the departure of the data from the null hypothesis of no association; therefore it can be
         regarded as a test of the signi cance of a risk di erence (attributable risk) from the null value of 0, or of the
         di erence of the risk ratio (relative risk or odds ratio) from the null value of 1.
         The tests used above are two-sided tests; that is, they estimate the probability that if the null hypothesis
         were true, a di erence would occur which would be as large as or larger than that observed in either
         direction, i.e. the intervention group having either the higher or the lower success rate. A one-sided test
         The expected value in each cell in a 2 × 2 table need not be a whole number; it has a continuous distribution.
         However, the observed values must be whole numbers. In calculating the chi-squared statistic or normal
         deviate, a ‘continuity correction’ which allows for this can be used (see Ex. 7.2c); it is sometimes called a
         Yates’ correction after the statistician Frank Yates who introduced it. Its e ect is to reduce the calculated
         statistic somewhat; this reduction is greater when the number of observations in the table is small.
         However, its use is controversial; some statisticians argue that the use of a continuity correction gives a
         more accurate estimation of the P-value [1], while others disagree [2]. With reasonably large numbers, the
                                                                                                                         2
         continuity correction will make very little di erence. For the data in Ex 7.2, the continuity corrected χ
         statistic is 7.08, compared with the uncorrected value of 7.84, corresponding to P-values of 0.0078 and
         0.0051, respectively (Appendix Table 16). The test of comparison of proportions makes an assumption of
                                                                                                    2
         having reasonable numbers of subjects and gives the square root of the uncorrected χ value.
         Thus the issue of whether to use a continuity correction is related to how P-values are interpreted. If we
         reduce the stress on particular cut-o    values of the P-value, such as 0.05, the problem is put into
         perspective. If the use of a continuity correction changes a result from being less than to being greater than
         0.05, this merely shows that the true probability value is very close to 0.05 and should be interpreted
         accordingly. Also, if the di erence is substantial, it means the number of observations is small, and a better
         solution is to use an exact probability test. One of these, the Fisher test for 2 × 2 tables, is described in
         Appendix Table 4, but others are available for other situations and statistical advice should be sought. Many
                                                                                   2
         computer programs provide exact tests. An approximate rule is that χ statistics become unreliable where
p. 232   any expected      numbers in the tables are less than 5. This often occurs if tables with many cells are
         generated, and the solution may be to combine some categories. The same consideration applies where
         di erent results arise from di erent, but applicable, statistical tests, or even from the same test performed
         using a di erent calculator or computer program. All these serve to emphasize that we should guard against
         over-interpreting the precise value of the P-value, using it instead as a general measure of probability.
         Many statistical tests are derived from the principle that if we calculate the di erence between an observed
         value a and its expected value on the null hypothesis E, square that, and divide it by the variance V of the
                                                                                                2
         observed value, the quantity (the statistic) resulting will follow a χ                     distribution on one degree of freedom
         (1 d.f.), i.e.
          2                  2
         χ = (a = E) /V.
         As we shall see, this general formula can be applied to many di erent situations, using appropriate
         calculations to obtain E and V.
                  2
         The χ distribution on 1 d.f. is simply related to the normal distribution. If a variable χ follows a normal
                                             2                     2
         distribution, its square χ              will follow a χ       distribution on 1 d.f. Thus an equivalent formula to that given
         above is
                      2
         χ = √χ           = (α-E)/√V.
         That is, the di erence between the observed value a and its expected value E, divided by the standard
         deviation of a (i.e. the square root of the variance) gives a normal deviate, often referred to as chi, χ, or Z.
         A most useful test, the Mantel-Haenszel test, for a 2 × 2 table is shown in Ex. 7.3 [3]. It is of the above form,
                            2
         with the χ statistic being given by the squared di erence between one value in the table and its expected
         value, divided by the variance of the observed value. The value a is usually taken as the number of exposed
         cases; the expected number is derived simply from the totals in the margins of the table as N                      1   M 1/T. The
         variance is calculated from a formula based on the ‘hypergeometric’ distribution, also shown in Ex. 7.3.
         Further explanation is not essential here; it applies to a table where the ‘marginal totals’ are xed, and this
                                                                                      2                                     2
         assumption is also made in the Fisher test and the usual χ tests. From this, we can calculate χ or χ using the
p. 233   formulae                given above. For a continuity correction, we reduce the absolute value a – E by 0.5, before
         squaring. For the table in Ex. 7.1,
Ex. 7.3
         The Mantel-Haenszel test for a 2 × 2 table with reasonable numbers. This is applicable to both cohort and case–control data
         observed value of a = 40
                                                       2
         variance of a = N   1   N   0   M   1   M 0/T (T − 1) = 200 × 200 × 60 × 340/
Then
          2          2                   2
         χ = (a − E) /V = (40 − 30) /12.78 = 7.82
χ = 2.80
p. 234   and from Appendix Table 15 or 16, P = 0.005 (two-sided). These values are almost identical to those given by
                                                                              2
         the tests used in Ex. 7.2. If a continuity correction is used, χ = 7.06, P = 0.008.
         This test, as shown in Ex 7.3, was rst applied to case–control studies [3], and is also applicable to cohort
         studies and surveys using count data. A very similar formula, shown in Ex. 7.4, is used for cohort data with a
         person-time denominator. Later in this chapter we will discuss how this basic test, with some variations,
         can be applied to strati ed data, to matched studies, and to cohort studies taking into account follow-up
         time.
Ex. 7.4
         Person-time data: for cohort data using person-time denominators, formulae very similar to those shown in Ex. 7.3 are
         applicable. For a continuity correction, subtract 0.5 from the absolute value of a − E
         These formulae are ‘asymptotic’, i.e. they are derived by making assumptions which are valid only where
         reasonable numbers of observations are available; a guide to ‘reasonable’ would be that the smallest
         expected number in the 2 × 2 table on which the result is based should be greater than ve. This limitation
         does not apply to strati ed subtables, which can be smaller if only the summary estimates are to be used.
         Where numbers of observations are smaller, ‘exact’ tests should be used, such as Fisher‘s test for a simple 2
         × 2 table; there are several other exact tests for di erent situations [1,4].
         The concept of precision; confidence limits
         These tests of signi cance yield a single value, which is the probability of a di erence the same as or larger
p. 235   than that observed in the study occurring purely              by chance, on the null hypothesis that the outcome is
         the same in each of the groups being compared. Two sets of data representing trials of therapy are shown in
         Ex. 7.5; in both, the success rate is twice as high with the new therapy. In study A, the di erence seen would
         have occurred purely by chance about once in 200 occasions (P = 0.005). In study B, it would occur purely by
         chance on about 14 per cent of occasions (P = 0.14). Using the conventional P = 0.05 cut-o , the result in A is
         signi cant, whereas that in B is non-signi cant.
         Two simple comparative studies: hypothetical data. Statistics calculated without continuity corrections. With a continuity
         correction they are 7.08 and 1.35, respectively
         Reporting results by concentrating only on signi cance tests and P values at best makes poor use of the
         data, and at worst can be misleading. A super cial review of these two studies might conclude that they are
         inconsistent, as study A shows a signi cant bene t while study B shows no signi cant di erence. In fact,
         the two studies are consistent, showing exactly the same advantage for the new therapy. It is the precision
         of the studies, not the estimate of e ect, which di ers.
         The way to avoid this dependency on an arbitrary cut-o , and to use the information in the study more
         fully, is to calculate con dence limits for the result rather than the P-value. The concept is as follows. Any
         one study provides one estimate of the association, for example the relative risk shown in the study is an
         estimate of the true relative risk. This estimate has variability; another study of the same design would give
         a di erent estimate of this relative risk. What we are interested in is the true relative risk in the population
p. 236   from which these samples of participants have been drawn. Ex. 7.6 shows a
p. 237
         diagrammatic representation of the statistical test applied to the data from study B in Ex. 7.5. In hypothesis
         testing, a speci c value is stated as the prior hypothesis, usually as here a relative risk of 1, representing the
         null hypothesis of no association. The statistical test then assesses if the observed value of the relative risk
         is consistent or inconsistent with this prior value, using a preset level of probability. To be able to do this we
         must hypothesize what the true relative risk in the underlying population is, and how the risks observed in a
         multitude of small samples will be distributed.
         Ex. 7.6
         Exhibit 7.6, part A, shows the expected distribution of the observed relative risk in a multitude of samples of
         the same size as the one we have used, on the assumption that the null hypothesis applies and the true
         relative risk is 1.0. Therefore the distribution is centred on the null hypothesis value of 1.0. It has a normal
         distribution, but the scale of relative risk is logarithmic. This is because relative risk is a ratio measurement;
         relative risk values of 2.0 and 0.5 are di erent from the null value of 1.0 by di erent amounts on an
         arithmetic scale, but by the same amount on a logarithmic scale. The width of the normal curve is
         determined by the standard deviation of these estimates, which is estimated by the standard deviation of
         the observed relative risk value. We shall come to the issue of how we can calculate that in due course. Note
         that in the diagram the observed value of relative risk, 2.0, lies within the central part of the distribution and
         therefore is a value quite ‘likely’ to occur in taking a sample from such a distribution centred on 1.0. An
         arbitrary value of 5 per cent probability has been used as a de nition of ‘likely’. We conclude that although a
         relative risk of 2.0 has been observed, the relative risk in the population from which the sample was drawn is
         quite likely to be 1; this is what we mean when we say that the result is not statistically signi cant.
         However, it is clear that the distribution could be moved and centred on many other values, while still
         keeping our observed value in the central ‘acceptable’ region. If we take successively lower values for the
         population relative risk as the hypothesis, this is equivalent to moving the distribution to the left. The
         standard deviation does not change, since it does not depend on the central value. We can move the
         distribution to the left until it reaches the point shown in Ex. 7.6, part B. If we moved it any further, the
         observed value would be moved into the region de ning values with a probability less than 0.05, such that
         we should reject the hypothesis that the observed relative risk is consistent with the hypothesized value of
         the relative risk in the population. The value of the centre of the distribution, when the observed relative
         risk is at this cut-o     point, is then the lowest ‘acceptable’ value for the population relative risk; it is referred
p. 238   to as the lower 95 per cent con dence limit. Similarly, moving                    the distribution to the right gives
         successively higher values of the population relative risk, until we reach a value such that the observed
         relative risk reaches the critical point in the lower tail of the distribution (Ex. 7.6, part C). This gives the
         upper limit of the ‘acceptable’ population relative risk—the upper 95 per cent con dence limit. Any value
         between the two limits de ned in this way is ‘acceptable’ as a value for the relative risk in the population
         from which our sample is drawn. In this way we can de ne limits for the relative risk, and we can be 95 per
         cent con dent that these limits will include the true value.
         While the use of these diagrams can aid in the understanding of con dence limits, it does not provide a
         practical method for calculating them. However, the basic formulae can be derived from the diagram. Given
         reasonable numbers in the samples studied, these distributions are normal, and therefore the points which
         de ne the 5 per cent rejection region are located 1.96 standard deviations from the mean. This standard
         deviation is equal to the standard deviation of the logarithm of the estimated relative risk. We use natural
         (base e) logarithms. Thus, if we know the logarithm of the observed relative risk (ln RR) and its standard
         deviation (dev ln RR), we can calculate the 95 per cent con dence limits as follows:
95 per cent con dence limits of ln RR = (ln RR − 1.96 dev ln RR) and
         Of course limits other than 95 per cent can be calculated using di erent values of the normal deviate
         corresponding to that proportion, which can be obtained from a table of the normal distribution such as that
         given later in Ex. 7.15 or Appendix Tables 15 or 16. Thus, for 99 per cent two-sided con dence limits, we use
         a value of 2.58 instead of 1.96. Just as we have discussed one-sided and two-sided statistical tests, we can
         use one-sided or two-sided con dence limits, with the same issues being involved. The above example is
         based on two-sided limits.
Ex. 7.15
         Normal deviates corresponding to frequently used values for significance levels (Zα) and power (Zβ); and table of K where K = (Zα
             2
         + Zβ) . The value of Zα is the normal deviate corresponding to the one-sided test for (1 − power)
         The calculation of dev ln RR, the standard deviation of the logarithm of the relative risk, is not always
         simple. Formulae for the calculation of standard deviations for various study designs are given in Appendix
         Tables 1–3.
p. 239   In presenting study results, con dence limits are much more informative than P-values. Many leading
         clinical and epidemiological journals will no longer accept papers using only P-values. Con dence limits are
         particularly useful when the result of the study is non-signi cant. The presentation of con dence limits
         may guard against the common fallacy of interpreting studies like study B as showing ‘no di erence’
         between the intervention and control group. While the overall result is consistent with the null hypothesis,
         the con dence limits show that the study is also consistent with a relative risk of 5.0, a considerable e ect. A
         better interpretation of study B is that, while it has demonstrated a relative risk of 2.0, the 95 per cent
         con dence limits show that the result is consistent with anything from a small detrimental e ect of the
         intervention to a large bene cial e ect, and therefore the study is inconclusive rather than demonstrating
         that the intervention has no e ect.
         Approximate test-based limits
         In addition to the above procedures, a simple method which allows the standard deviation, and therefore
         con dence limits, to be calculated from a test statistic is sometimes useful [5], but it should be seen only as
p. 240   an approximate method (Ex. 7.7). It is derived from the logic that, for a normally distributed       variable,
         the di erence between the observed value and the expected value, divided by the standard deviation (the
         square root of the variance V) gives a normal deviate. This normal deviate is calculated as the test statistic χ
              2
         or √χ from any of the tests we have described; the χ statistic without a continuity correction should be used.
         Thus the Mantel-Haenszel analysis will give us ln RR, the logarithm of the RR (or OR), and χ. The expected
         value of RR or OR is 1, and so the expected value of its logarithm is zero. Therefore dev ln RR can be
         calculated as follows:
Thus for the data shown in Ex. 7.5, study A, the standard deviation is
         which gives limits of 1.23 and 3.25. As we already know from the P-value, these limits for study A do not
         include the value 1.0. For study B the limits are 0.79 and 5.10, and these are shown in Ex. 7.6; they do include
         the null hypothesis value.
         The test-based method is reasonably accurate for relative risks and odds ratios close to 1 (but not exactly 1,
         where it gives no result), which are based on fairly large numbers of observations. It can be applied both to a
         single table and to a strati ed data set, as it is based on the Mantel-Haenszel estimates of odds ratio or
         relative risk, and the summary chi statistic. These con dence limits are derived from exactly the same
         information as went into the P-values, and are only approximations. The calculation uses an assumption of
         equivalence of two di erent statistics, which only applies under the null hypothesis and gives reasonable
         results only when the odds ratio or relative risks are reasonably close to the null, between 0.2 and 5.0. The
         method is even less reliable with risk di erence measures [6]. It is better to use the formulae for standard
         deviation of ln RR and the con dence limits that are given in Appendix Tables 1–3, and such calculations are
         available in many computer programs.
p. 241
         Part 2. Statistical methods in more complex situations: Analyses with
         stratification
         The use of strati cation to control for a confounding variable was discussed in Chapter 6. The Mantel-
         Haenszel test is easily applied to strati ed data from any type of study. The test can be regarded as a test of
         the summary relative risk or odds ratio estimate.
                                                     2
         We recall that the Mantel-Haenszel χ test for a 2 × 2 table is calculated as (Ex. 7.3)
         where a is one of the values in the table, such as the number of exposed cases, E is its expected value, and V
                                                                                        2
         is its variance. If the data are strati ed into several 2 × 2 tables, a χ test for the association after
         strati cation is given by
          2                   2
         χ = (Σi ai − Σi E i) /Σi Vi
         where Σiai means ‘the sum of the values a in each table, represented by a 1, a 2, a 3, … over I tables, where I
                                                                                                                    2
         is the number of tables’, i.e. the values of a, E, and V are calculated in each table and summed, and χ is
         calculated using the three summations; it still has one degree of freedom irrespective of the number of
         subtables.
         An application to a simple case–control study is shown in Ex. 7.8; these data were shown previously in Ex.
         6.23 and relate to the association between melanoma and sunburn history. The single-factor table shows an
                                                                       2
         odds ratio of 1.40, which is statistically signi cant: χ = 4.94, P = 0.03 (Appendix Table 15, p. 552). However,
         part of the association is produced by confounding by tendency to sunburn. Adjustment for this by
                                                                                                                2
         strati cation gives an odds ratio of 1.19, which is not statistically signi cant; the summary χ = 1.16, P = 0.3,
         indicating that this or a more extreme result would occur in nearly one of three studies of this size if the null
         hypothesis were true.
Ex. 7.8
                                              2
         Calculation of the Mantel-Haenszel χ statistic from the case–control data shown in Ex. 6.23, p. 194.
         The same formula is applicable to cohort or intervention studies with observations on individuals, as shown
         in Appendix Table 2. For person-time data from a cohort study, the values of Ei and Vi are calculated using a
         formula adapted from that in Ex. 7.4, which is shown in Appendix Table 3.
         Assumptions in stratification: interaction
         The calculation of a summary measure of odds ratio or relative risk implies that this single estimate applies
p. 242   to all the di erent strata. This is usually a    reasonable a priori assumption, but it should be checked by
         examining the stratum-speci c ratios. While substantial variation in stratum-speci c odds ratios may be
         due only to small numbers, a summary estimate may disguise important real variation. With an ordered
         confounder, such as age, there may be a regular trend in the odds ratio estimates over strata of the
         confounder.
         where Li and Vi are the log odds ratio and its variance for each subtable, and L    s is the log of the summary
                                                                                  2
         odds ratio (e.g. the Mantel-Haenszel odds ratio). Given I subtables, χ has I − 1 degrees of freedom. The
         formulae for the variances in the di erent study designs are given in the Appendix. This is a test of the
         hypothesis that the odds ratios are homogeneous. Other tests can assess a linear trend in the odds ratios
         over the strata, which may sometimes be more relevant. However, substantial numbers of observations are
         required to detect heterogeneity in stratum-speci c relative risks.
         Exhibit 7.9 shows an example of variation of the odds ratio estimate. In this study 217 patients with in situ
         carcinoma of the cervix were compared with 243 population-based controls [7]. A signi cant association
         between the disease and smoking was seen, with an odds ratio of 6.6, which is highly signi cant (Table A).
         After subdivision into three categories by age, the Mantel-Haenszel age-adjusted odds ratio is 6.3, only
         slightly reduced compared with the crude odds ratio, and has 95 per cent con dence limits of 4.1 to 9.5. Thus
         age is not a major confounder, as is expected because frequency matching on age has been done. However,
         the odds ratios in each of the three strata of age are considerably di erent, being 27.9, 5.9, and 2.8 in
         successively older age groups; all these odds ratios are signi cantly di erent from 1, but the con dence
         limits show that the odds ratio in the youngest age group is signi cantly di erent from the overall Mantel-
         Haenszel estimate, and also signi cantly di erent from the other odds ratios. The chi-squared statistic
         testing for heterogeneity of the odds ratio, calculated as above, is 12.1 on 2 d.f., P = 0.01, and so there is
         statistically signi cant variation in the stratum-speci c odds ratios. The summary odds ratio of 6.3 is not
         an adequate description of the relationship of smoking with disease, and so in this study the results for the
         di erent age groups should be shown and discussed. Further strati cation to control for religious a          liation
         and number of sexual partners reduced all the odds ratios, but the signi cant variation in odds ratios in
         di erent age groups remained.
         Ex. 7.9
An example of interaction defined here as non-homogeneity of the odds ratio in di erent categories of a third modifying
p. 244   This analysis assumes a multiplicative model, as discussed in Chapter 3; the relative risk estimates are
         assumed to be constant. This multiplicative model is assumed in techniques such as the Mantel-Haenszel
         estimate of odds ratio, and in many multivariate analyses, such as those using the logistic regression model.
         The attractiveness of this model arises from both its mathematical advantages, as it is a relatively simple
         model, and the empirical evidence that it appears to apply to a wide range of biomedical applications.
         The situations where such a model can be tested are those in which there are two or more strong risk
         factors, and enough subjects with the various combinations of categories of these risk factors can be
p. 245   identi ed and their risks assessed directly. Important examples in medicine which have shown that a
         multiplicative model appears to be an appropriate description of the natural state of a airs include the
         interactions between several major risk factors for coronary heart disease, as shown in the Framingham
         Study and other major prospective studies [8,9], and the interactions between factors relating to several
         cancers, such as smoking and asbestos exposure in relation to lung cancer [10] and several others.
         There are also some examples where this model does not seem to be appropriate. One relatively simple
         alternative, also described in Chapter 3, is an additive model in which the risk of disease in subjects exposed
         to more than one factor is the sum of the excess risks conferred by each of the factors. Such a model implies
         that the absolute excess risk associated with one factor will be constant in di erent categories of other
         factors. This means that the relative risk or odds ratio must vary. For example, in Ex. 7.9 the odds ratio
         varies with age. However, we know that the baseline risk of in situ carcinoma of the cervix increases with age
         over the age range given in non-smokers; therefore it is possible that the decrease in odds ratio with
         increase in age may re ect a relatively constant attributable risk associated with smoking in di erent age
         groups. We cannot directly test this hypothesis from this case–control study alone, although it could be
         assessed by using information on the absolute risk in di erent age categories relevant to the same
         population. The appropriateness of these di erent models, and their relevance in both public health and
         biological terms, has been much discussed. The terms ‘interaction’, ‘synergism’, or ‘antagonism’ can all be
         used in these contexts, but they are meaningful only if the underlying model regarded as the non-
         interaction situation is described. Caution in the acceptance and interpretation of interaction is desirable, as
         it is di   cult to detect unless the risk factors concerned have large e ects and subjects representing a wide
         range of categories of the joint distribution of the factors involved are available. In the analysis, it is
         appropriate not only to show whether a model such as a multiplicative model ts the data, but also that
         alternative models do not.
         Ordered exposure variables and tests of trend
         Frequently the outcome variable has only two categories, while the exposure or intervention variable may
         have a number of ordered categories. For example, the incidence of lung cancer, a yes/no outcome variable,
         may be compared in a number of groups of individuals categorized by di erent levels of smoking, or the
         survival of a group of patients, a yes/no outcome variable, may be described in terms of a graded measure of
         the severity of the disease.
p. 246 To compute relative risks or odds ratios with several exposure categories, an arbitrary category is chosen as
         Data from a case–control study assessing risk factors for twin births [12] are shown in Ex. 7.10. The data
         show the numbers of cases (twin births) and controls (single births) by maternal parity, in four groups. An
         ordinary or global chi-squared statistic can be calculated for this 2 × 4 table; it has three degrees of freedom
         and tests the hypothesis of homogeneity, i.e. that the ratio of cases to controls and therefore the odds ratio
                                                            2
         is the same in each category. The result is χ = 91.2, P < 0.001 (Appendix Table 15 or 16). This statistic would
         be the same if the same data were in a di erent order, for example if the odds ratios with increasing parity
         were 1.0, 1.84, 1.36, and 1.17, although such an irregular pattern would make a direct e ect of parity less
         plausible. It is more informative to apply a linear trend test. Using the formula given in Appendix Table 7,
                                                                                2
         with scores of 0, 1, 2, and 3 for parity, the trend test yields χ = 88.0, d.f. = 1, P = 0.0001. The deviation from
                                                                                       2
         the linear trend is given by the di erence between these, giving χ = 3.2, d.f. = 2, P = 0.2; this is non-
p. 247   signi cant.       The interpretation is that there is a positive association between the frequency of twin births
         and greater parity of the mother, which is consistent with a linear trend.
Ex. 7.10
         Application of a test for trend to an ordered variable. Data from Elwood [12]. Formulae given in Appendix Table 7, p. 530.
         Such trend tests should be used with caution. Particularly where there are only a few categories of the
         exposure variable, a trend can be tted and may be signi cant even if inspection of the data shows no
         regular pattern over the ordered exposure categories. The tests are appropriate primarily where there is an a
         priori hypothesis of an approximately linear relationship between odds ratio or relative risk and the ordered
         exposure variable. In such circumstances, the test can be more powerful than the standard chi-squared test
         for homogeneity in an n × 2 table.
         The Mantel trend test can deal easily with strati ed data, so that the Mantel-Haenszel estimator of the odds
         ratio for each category can be calculated after strati cation, and the test then assesses a linear trend in these
         As shown in Chapter 6, in a case–control study controls can be chosen to be matched to the cases on certain
         confounding factors, and this technique will control matching if and only if the appropriate matched
p. 248   analysis is used, based on the matched sets of cases and controls. For matched case–control data with
         one control per case, the resultant analysis is simple, and the appropriate statistical test is McNemar‘s chi-
         squared test [13], which is shown in Ex. 7.11. This test is a special application of the Mantel-Haenszel test
         for subtables each representing one matched pair. Note that for the calculation of both the odds ratio and
         the statistic, the only contributors are the pairs which are disparate in exposure, i.e. the pairs where the case
         was exposed but the control was not, and those where the control was exposed and the case was not. On the
         null hypothesis the numbers of each of these will be the same.
Ex. 7.11
                                                                                                2
         Statistical test for 1:1 matched case–control studies: note that both the odds ratio and χ depend only on the numbers of
         discordant pairs. Data from Ex. 6.26, p. 199.
         For a matched case–control study in which a xed ratio other than one to one is used, such as two or three
         controls per case, formulae for the calculation of odds ratio estimates and statistical tests are given in
         Appendix Table 6. Matched cohort studies will not be discussed in detail, as they are less common. The
         measures of association (relative and attributable risk) are calculated in the same way as for an unmatched
         cohort study. For statistical tests, the methods for matched case–control data are applicable, although
         usually very similar results are given by analysing the data in the unmatched format.
         Statistical tests in multivariate analysis
         An introduction to multivariate analysis in the control of confounding was given in Chapter 6, and the
p. 249   multiple logistic model was described. In this              model, each dependent variable represents an exposure
         factor. For binary factors (those with only two categories), the odds ratio is estimated as the exponential of
         the coe    cient for that variable, if appropriate coding has been used. There are two main ways in which
         statistical tests are applied to such models. One is to estimate the signi cance of each coe                    cient by taking
         the ratio of its value to that of its standard error as a standardized normal deviate. The other is to assess the
         goodness of t of an entire model to the data set, comparing models by calculating a statistic representing
         The statistical aspects of the multivariate analysis that was previously presented in Chapter 6 (Ex. 6.27) are
         shown in Ex. 7.12. In this case–control study involving 83 cases and 83 controls [14], the overall model with
         no independent variables tted has n − 1 = 165 d.f. and a deviance of 230.1. The results of tting variables
         singly into the model are equivalent to those from cross-tabulations. Fitting the two binary variables, which
         together represent the number of moles, gives a change in deviance of 38.1 on 2 d.f. P < 0.001. Similarly, for
         sunburn, coded as one binary variable, the chi-squared statistic is 13.0 on 1 d.f., P < 0.001. Therefore both
         factors show signi cant associations in single-factor analyses. These chi-squared statistics are the same as
         those obtained from cross-tabulations (with no continuity correction). The coe                      cient from the model
         where only one factor is tted gives the crude odds ratio associated with that variable, and most computer
         programs for this analysis produce a standard error estimate of this coe                  cient, which can be used to
         calculate con dence limits.
Ex. 7.12
         Multivariate analysis: the analysis previously shown in Ex. 6.27, fitting a multiple logistic model to data from a case–control study
         of 83 patients with melanoma and 83 controls. Coe , fitted coe icient; Std error, Standard error of coe icient; OR, Odds ratio =
         exponential (coe icient); Limits, exp(coe ± 1.96 × std error); deviance, log likelihood statistic
         In this analysis there were three other relevant factors (number of freckles in three categories, hair colour in
         three categories, and usual skin reaction to sun exposure in four categories). Therefore the full model tted
p. 250   had 10 variables representing the number of moles, sunburn history, and the three other
p. 251
         factors, and this full model had a deviance of 157.7 with n − 10 − 1 = 155 d.f. The results from this full model
         give coe   cients whose exponentials give the odds ratio of each variable, controlled for the presence of all
         the other variables in the model. The overall e ect of a factor represented by a number of variables is best
         judged by comparing the deviance of the full model, with all factors tted, with that of the model with that
         one variable removed. This change in deviance shows the e ect of that single factor given that all the other
         factors are included. Thus, as shown in Ex. 7.12, the factor of number of moles remains highly signi cant in
         the presence of all other factors, as removing it from the full model gives a signi cant increase in the
         Multivariate analysis for individually matched studies uses ‘conditional’ models which take account of the
         matching, i.e. they consider matched sets of cases and controls as sets. They require skilled application. The
         results are presented in the same way as has been shown, with the same interpretation of the coe         cients.
         Further information is given in the literature [4,15].
         Some general points about the interpretation of published results can be emphasized. First, let us consider
         the situation where the results are reported as ‘statistically signi cant’. To interpret this, we must rst
         consider whether the results reported are free from problems of observation bias and confounding. The
         statistical signi cance of the result is in itself no protection against these problems. Indeed, an easy way to
         produce highly signi cant results is to use a design which is open to severe observation bias; for example
         biased recall between cases and controls in a retrospective study, or an intervention study using a subjective
         outcome measure made by someone involved in the intervention being assessed. When these issues have
         been dealt with, the next step is to know whether the statistical methods used are appropriate and correctly
         applied.
         The issue of multiple testing poses a particular problem in interpretation. The familiar statistical tests such
         as those that have been described above are designed for hypothesis testing, to be applied to one particular
p. 252   result that has    arisen in the course of a study designed to test that association. Where a study produces a
         large number of associations, such as a study of patient prognosis which assesses 20 factors and uses
         conventional 5 per cent signi cance level tests, we expect at least one of these factors to appear as
         statistically signi cant even if none of them in truth is related to the outcome. Greater problems arise in
         observational studies where very large numbers of factors can be assessed, or in explorations of very large
         data sets. For example, comparisons of all causes of mortality with occupational categories using death
         registrations and census data may involve comparisons of perhaps 100 categories of causes of death with
         several hundred possible occupations. Genetic and proteonomic studies may assess fairly small numbers of
         subjects, applying very large banks of markers, which will produce many associations by chance alone.
         Special statistical methods, some of which are quite complex, have been developed for these situations. In
         intervention trials, repeated testing may be planned, as discussed below.
         The issue of to what extent the existence of other comparisons should be taken into account when assessing
         a particular result is one of considerable dispute. It is valuable to distinguish between results from
         hypothesis testing studies that have been speci cally set up to test a particular hypothesis, and those from
         studies in which a large number of associations have been examined. The latter studies may be considered
         as having a hypothesis generation function, and the validity of particular results from the study will be
         uncertain until con rmatory evidence is available from further work. A further issue in the interpretation of
         positive results is publication bias, which is discussed in the next chapter.
p. 254
         This 23 per cent reduction in deaths is obviously clinically important, and is statistically signi cant. With
         these data, relative risks could equally easily be used.
Ex. 7.13
         Dangers of subgroup analysis: data from a large randomized trial of aspirin in treatment of a suspected acute myocardial
         infarction, showing total results, and a subgroup analysis by prior myocardial infarction, one of 26 subgroup analyses done. Also
         shown is an analysis of astrological birth sign. From ISIS-1 (Second international Study of Infarct Survival) Collaborative Group
         [16]
         A small number of subgroup analyses were planned in the protocol to the study, but many others were
         carried out, comparing groups subdivided by sex, age groups, di erent levels of blood pressure and heart
         rate, with or without cardiograph abnormalities, and so on. One result was that no bene cial e ect of
         aspirin was seen in patients who had had a previous myocardial infarction (odds ratio 1.02), and the e ect
         was correspondingly greater in those who had not had a previous infarction (odds ratio 0.71). This
         di erence is substantial, the con dence limits do not overlap, and a statistical test of heterogeneity shows
         that the di erence in e ects is highly signi cant (P = 0.002). This would seem to be clinically important, as
         it suggests that this aspirin is a useful therapy only for patients who have no previous history of myocardial
         infarction. However, this subgroup analysis was unplanned and was one of 26 such analyses, of which this
         was the only one which showed a statistically signi cant variation. Thus, despite its statistical signi cance,
         this is likely to be a chance nding, and in the discussion in the paper the investigators suggest that it is
         implausible, given other knowledge of the e ects of aspirin on heart disease. To demonstrate the dangers of
         unplanned subgroup analyses, the investigators also published the results from a subdivision of patients by
         astrological birth sign. As shown in Ex.7.13, this showed that patients with Gemini or Libra birth signs
         showed no bene t of aspirin (odds ratio 1.09), and the bene t was con ned to those with other birth signs
         (odds ratio 0.72). This di erence is also statistically signi cant with a similar P value for heterogeneity as
         was found in the other subdivision (P = 0.002).
         Therefore, to be reliable, a subgroup analysis should be de ned in advance, before the data are examined,
         giving a speci c hypothesis that can be tested. In contrast, the subgroup analysis by birth sign used one of
         many possible ways of grouping birth signs, and this choice was in uenced by the data. Such analyses have
         been referred to as post hoc, after the fact, data-driven, or the result of data dredging or shing.
         In intervention trials, it is often desirable to monitor the results at regular intervals, or continuously, so that
         the trial can be stopped as soon as a de nitive result is obtained, or when it is clear that no signi cant
         di erence will be seen. This introduces a problem of multiple testing, and speci c statistical methods have
p. 255   been developed, such as sequential trial designs [17]. An example       of these is shown in the trial discussed
         in Chapter 11. It also requires an independent data-monitoring group, as the investigators themselves should
         not know the ongoing results of the trial. Most major trials have an independent data-monitoring group,
         which will assess interim results and decide if the trial should be continued or stopped earlier than
         anticipated. This is only one of their functions; they act as an independent assessment group on all aspects
         of the trial design, and particularly the collection, coding, input, and interpretation of the data. They
         contribute to statistical and clinical decision-making, and to quality control including ethical safeguards.
         The statistical methods used take into account the fact that the data are being examined on many occasions.
         For example, during the randomized trial of clo brate in the treatment of heart disease, referred to in
         Chapter 6 (Ex. 6.14), on three occasions during the monitoring of the trial the death rate was lower in the
         treatment than in the placebo group, and would have been signi cantly lower if tested by the routine
         statistical test using a 5 per cent cut-o . However, the di erence was not signi cant when assessed by
         methods appropriate for repeated monitoring of a trial, and so the trial was continued; the nal results
         showed no signi cant di erence between the group randomized to clo brate and the comparison group.
         Several major randomized trials have been stopped early or modi ed after such monitoring. In the trial of
         prevention of neural tube defects discussed in detail in Chapter 11, the results passed the level of
         signi cance demanded by a sequential test when only about two-thirds of the original planned number of
         participants had been enrolled, and the trial was terminated early. The recent studies of beta-carotene are
         another good example. As was discussed in Chapter 6, case–control and cohort studies had shown
         associations between high intakes or high blood levels of beta-carotene and lower cancer and heart disease
         rates, and these results were supported by the antioxidant properties of beta-carotene, protecting against
         DNA damage and inhibiting carcinogens in experimental situations. A trial of various nutritional
         interventions, including beta-carotene, was carried out in almost 30 000 subjects in rural China, and
         demonstrated a 13 per cent reduction in total cancer mortality [18]. However, a randomized trial in a well-
         nourished lower risk group, 22 000 American doctors, showed no e ect [19], and a trial on 29 000 male
         smokers in Finland showed an 18 per cent increase in lung cancer incidence and an 8 per cent increase in
         total mortality in those randomized to beta-carotene [20]. Then in a large trial in the USA involving 18 000
         subjects who were smokers or ex-asbestos workers, interim results assessed by the monitoring committee
         showed a 28 per cent increase in lung cancer incidence and a 17 per cent increase in total mortality [21]. The
p. 256   trial was    terminated 20 months early, and the use of beta-carotene supplementation as one part of a
         further trial involving female health professionals in the USA was also stopped.
         Interpreting non-significant results
         The interpretation of the results that are reported as not showing statistical signi cance also raises several
         issues. Again, we must consider rst the issues of observation bias and confounding, assessing whether
         there are problems that could make the observed result smaller than the true result; these include the
         problem of random error. Again, we must assess if the statistical methods used are appropriate and
         correctly applied, although the issue of multiple testing is not important here. The main issue in
         interpreting non-signi cant results is to what extent we can accept that there is no true di erence between
         the groups being compared. A type 2, or beta, error is made if the result is non-signi cant when there is a
         Go back to Ex. 7.5 (p. 235) and look again at study B. This result is rather unsatisfactory. From 72 subjects,
         there is a higher success rate in the intervention compared with the comparison group, but this e ect is not
         statistically signi cant, and the con dence limits show that we cannot con dently decide whether the
         intervention is bene cial or not. The basic problem is that the study is too small. It does not provide a
         de nite answer. In technical terms, the study lacks power. To avoid committing ourselves to performing
         studies like study B, it would be helpful to be able to predict the power of a study. We shall now present some
         fairly simple mathematical aspects of this; but readers who wish to avoid these may go on to the section on
         dealing with a statistician (p. 264).
p. 257
         Factors a ecting the power of a study
         Several factors a ect the power of a study (Ex. 7.14). The rst is the strength of the association, for example
         the di erence in outcome rates between the two groups, or the relative risk or odds ratio. The larger the true
         di erence, the easier it will be to detect.
Ex. 7.14
p. 258   The next factors to be considered are the signi cance level and the power. The signi cance level is the cut-o
         point that will be used to determine whether the association found is regarded as statistically signi cant. It
         is most frequently at the P = 0.05 level, using a two-sided test. The signi cance level, or alpha (α), is the
         frequency with which a ‘signi cant’ result occurs when in truth there is no di erence; this type of error is
         called an alpha, (α), or type 1 error. In setting a signi cance level, we are deciding what risk we will accept
         that the study will show an (incorrect) signi cant result when the true situation is that there is no
         di erence. In some circumstances the direction of the e ect may be regarded as xed and tests used which
         assess only e ects in one direction, and a one-sided test may be used. If we are to apply a 5 per cent one-
         sided test rather than a 5 per cent two-sided test, fewer subjects will be required.
         The power of a study is its ability to demonstrate an association, given that the association exists. The
         frequency with which we will see no signi cant di erence, if in truth there is a di erence, is referred to as
         beta (β); this is a beta, or type 2, error. A more powerful study is one that is less likely to miss an e ect; the
         power of the study is 1 − β. If in reality a true association is present, our ability to recognize it will be greater
         with a larger sample, a stronger association, and a higher frequency of outcome. A typical value of power is
         80 per cent; the study is designed so that the chance of detecting a true di erence is 80 per cent, and we
         accept that we will miss the true di erence in 20 per cent of instances. For an exploratory study we may
         decide that missing a true di erence (i.e. obtaining a false-negative result) is unimportant, and we may be
         content with a lower power, requiring fewer subjects. If we need to be con dent that we will not miss a true
         e ect, we may need a power of 90 per cent or more, which will require many more subjects.
         For both signi cance level (α) and power (β = 1 − power), the correspond ing normal deviates (Zα and Zβ, go
                                                       2
         into the formula, usually in a term (Zα + Zβ) , which we call K for convenience. The deviates for commonly
         used values of α and β, and the corresponding values of K, are given in Ex. 7.15.
         These four factors are the components of the formulae generally used for sample size determination, which
         will be described below. Two other factors also a ect the necessary size. If confounder control by
         strati cation or multivariate methods is to be used, the study size required is increased. However, an
         individually matched study, where the analysis will use the matched sets, may need a smaller sample size
         for the same power. Non-di erential misclassi cation, as described in Chapter 5, will lead to a larger
         sample size being needed. This e ect can be assessed by calculating the observable odds ratio or risk
p. 259   di erence after allowing for misclassi cation, as shown in Chapter 5, and using it           for the determination
         of the size of the study needed. Methods of determining study size with allowance for these various factors
         are available, which is why the formulae given below should be used only as a guide, and expert advice
         should be sought for any major study. Finally, the study size given in the formulae refers to the number of
         study participants in each group being compared who complete the study and provide full data for analysis.
         In designing a study, allowance has to be made for non-response and possible incompleteness of data.
         The power of a study is increased by increasing the number of subjects in the study, in the intervention or
         case group, or in the comparison group, or in both. When a similar e ort is required to enrol a test or a
         comparison subject, studies with equal numbers in each group are optimal. A slight variation of this may be
         made on the basis of the projected e ects, as the information in the study depends on the number of
         subjects with the outcome in a cohort study, or the number with the exposure in a case–control study, and
p. 260   slight     modi cations of the ratio may be made to design a study where the number of outcome events or
         exposed subjects is likely to be the same. For example, more than half the patients in a trial may be allocated
         to the new therapy, on the basis of the anticipated bene t yielding equal numbers of deaths in each group. It
         may be easier to increase the size of the comparison group than of the intervention or case group because
         more potential subjects are available, data on them has already been collected, or the number of exposed
         subjects or cases is xed. The power of the study is increased by increasing the number of controls, but this
         From the four parameters of the expected frequency of the outcome in the control group, the di erence in
         outcome rates, the signi cance level, and the power, a calculation of the sample size to satisfy those criteria
         can be made. Some appropriate formulae are illustrated in Ex. 7.16. The formulae can obviously be used in
         other ways. It is often useful to calculate the power of the study from the sample size readily available, to
         show if a proposed investigation is worthwhile or if more ambitious methods need to be used. The following
         examples will show the application of these formulae.
            (1) For a clinical trial. Suppose that the mortality rate in 2 years from conventional therapy is 40 per
                  cent, and a new therapy would be useful if the rate fell to 30 per cent. How many patients do we
                  need= Setting a signi cance level of 0.05 one-sided, and a power of 80 per cent, yields K = 6.2 and n =
                  279, i.e. 279 subjects in each group. If we were content to detect a larger di erence, such as 20 per
                  cent mortality with the new therapy, then n = 62; note the large change in n for a substantial change
                  in (p   1 − p2).
           (2) For an epidemiological cohort study. We wish to test whether the breast cancer rate is increased in
                  oral contraceptive users, and estimate a 10-year cumulative incidence in unexposed women of 0.01 (1
                  per cent). We set signi cance at 0.05 two-sided and power at 90 per cent, and wish to be able to
                  detect a doubling of risk to 0.02; hence K = 10.5 and n = 3098.
           (3) A colleague hopes that a new therapy will increase the proportion of patients recovering from his
                  current 40 per cent to 60 per cent. He sees 100 patients each year whom he would like to enter into a
                  trial. Is it worth it=
                  Set signi cance at 0.05 one-sided. Hence Zα = 1.64; p     1 = 0.4, p 2 = 0.6, n = 50; therefore Zβ = 2.04 −
p. 261            1.64 = 0.40. From Ex. 7.15, the power is less
p. 262
                  than 70 per cent; for more accuracy, use Excel to give the one-sided probability of this value of 0.40
                  in the normal distribution (Appendix Table 16). This yields 0.35, i.e. a power of 1 − 0.35 or 65 per cent.
                  His study would miss a di erence of the size given on one occasion out of three, and is too weak. For
                  adequate power (e.g. 80 per cent), he would need 74 patients in each group; for 90 per cent, 103
                  patients. Thus a 2-year accrual period would be likely to be satisfactory if all subjects seen could be
                  entered into the study.
           (4) In a case–control study, we wish to be able to detect a doubling of risk (odds ratio = 2) associated
                  with a factor which is present in 10 per cent of the normal population from which the control series
                  will be drawn. Hence p    2 = 0.1 and p 1 is given by Ex. 7.16 as 0.182; if power = 80 per cent, and
                  signi cance level = 0.05 two-sided, K = 7.9 and n = 280. We need approximately 300 cases and 300
                 controls. If controls are easily found, we might wish to try a study with, say, three controls per case.
                 Keeping the other parameters the same, p = 0.14, q = 0.86, and n = 190; a study design with 200 cases
                 and 600 controls would give similar power to one with 300 cases and 300 controls.
Ex. 7.16
         The formulae given here are fairly simple. More complex sample size calculations are available in standard
         texts, websites, and computer programs [22–24]. The justi cation for presenting only a simple formulation
         here is that the general reader should use such formulae only as a broad guide. In practice, many quantities
         are unknown when a study is being contemplated; not only the likely di erence between the outcomes, but
         the extent of loss of information by drop-outs or missing data and the extent to which strati cation or other
         techniques will have to be used which will reduce the power of the study. The main usefulness of these
         sample size formulae is to indicate a minimum value for the number of subjects necessary for a particular
         study, or conversely the approximate power that is achievable with the numbers available. Such calculations
         may show that the numbers of subjects needed far exceed those readily available, or conversely that the
         power of the study is very low, perhaps 50 per cent or less. Such results should be taken to indicate that the
         study as envisaged is not a worthwhile endeavour and a di erent approach is necessary, such as moving
         from a single-centre to a multi-centre study, or addressing the question on a di erent set of subjects or in a
         di erent way. We need to emphasize that these calculations take only statistical sampling variation into
         account. There is no allowance for incomplete records, drop-outs, or errors in ascertainment, and no
         allowance for the requirements of more detailed analysis to adjust for confounding or assess results in
         subgroups.
p. 263   It is often helpful to compare the power produced by studies with di erent numbers of subjects, and to
         calculate a ‘power curve’, i.e. a graph showing the relationship between sample size and power within the
         other constraints of the study. This will help to choose the most e               cient design, which is the one that gives
         the most information for least cost in subjects, time, and nances.
         The major properties of the formulae are useful guides (Ex. 7.16). The number of patients required is
         inversely proportional to the square of the di erence in outcome rates; in other words, if the di erence to
         be detected is halved, the number of subjects required will be four times as large. The biological e ect of a
         factor will be expected to give a certain ratio of outcome events; the di erence in numbers corresponding to
         a xed ratio will be greater if the frequency of the outcome is higher, up to 50 per cent. Given that a new
         intervention reduces mortality by a third, it is easier to detect a di erence between 45 and 30 per cent
         mortality than between 15 and 10 per cent; often a study design can be made more e                      cient by selecting
         subjects who have a high risk of the outcome under investigation. There is little room to manoeuvre in
         terms of signi cance levels, and one should not be overly tempted to use one-sided signi cance levels
         unless these are clearly indicated. The remaining factor is the relationship between the number of subjects
         and power. A study with 80 per cent power requires about twice as many subjects as one with 50 per cent
         power, and a very powerful study with 95 per cent power requires about 60 per cent more subjects than one
         of 80 per cent power.
         As mentioned earlier, when a similar e ort is required to enrol a test or a comparison subject, studies with
         equal numbers in each group are optimal, but otherwise studies with more controls and cases can be used.
         Exhibit 7.16 shows that the power of a study with n subjects in each group is equalled by one with c controls
         per test subject, and n[1 = (1/c)]/2 test subjects; therefore an alternative to nding 100 cases and 100
         controls is to use, say, two controls per case and nd 75 cases and 150 controls. A little arithmetic will show
         the decreasing bene t of increasing the ratio of controls to cases; unless data for controls are very easy to
         obtain, for example by being available on a computer le, it is rarely worth using ratios greater than 5:1.
It is a requirement of the scienti c review process for research funding that investigators show that their
p. 264
         Dealing with a statistician about study design
         When designing a major study, statistical help should be requested in regard with sample size. However, the
         non-statistical reader should be prepared for the questions which the statistician will ask, which amount in
         principle to a prediction of the results of the study. In cohort studies and trials, we need to predict the
         frequency of the outcome in the control group and the size of the di erence between control and exposed groups
         which we regard as worth detecting. To arrive at this judgement it is worth thinking in operational terms; if
         the frequency of the outcome in the comparison series, such as patients treated on conventional therapy, is
         a certain percentage, to what would this have to change before we would wish to employ the new therapy?
         For a case–control design, the statistician will want to know the likely frequency of exposure in the
         comparison group, which may be obtained from literature or a pilot study, and the size of the association to
         be detected, best expressed as the odds ratio. The remaining questions are to de ne the signi cance level and
         the desired power of the study. For the rst, there is rarely a good reason to stray from the convention of the
         5 per cent level. The question of one-sided and two-sided tests has been alluded to. As to power, a
         frequently used starting point is 80 per cent in an exploratory study; in a study to reassess a nding or in
         other circumstances where it is important not to miss a true association, higher power is desirable.
         So far in this text we have presented the results of cohort and intervention studies in terms of the numbers
         of subjects with and without the outcome in question (‘count data’), or the numbers of subjects
         experiencing the outcome divided by the total person-years of observation (‘person-time data’). These
         methods are valid only if the outcome can be adequately expressed in this simple form, which is true only if
         the follow-up time period is the same for the di erent groups of subjects being compared, or if the risk of
         the outcome is constant over the time course of the study. Lifetable methods of analysis were developed to
         overcome these di    culties. They are used in intervention trials and cohort studies where the risk of the
         outcome events varies with the time interval since the intervention or exposure, and where the period of
         follow-up varies for di erent individuals in the study. All aspects of the scheme for critical appraisal
         described in this text are applicable to these studies.
p. 265
         Survival curves
         In cohort studies where the period of follow-up of individuals varies, the number of outcome events must be
         related to the person-time experience during which an event could occur. In the person-time type of
         analysis shown in Chapter 3 we assumed that the risk of the outcome was constant over time, and therefore
         groups of subjects could be adequately compared by looking only at their total person-time experience and
         the number of events occurring. In many situations this is manifestly not true. For example, after an acute
         myocardial infarction, the risk of death is initially high and then decreases rapidly. If we compare 50
         patients who have each been followed up for 1 year with 600 patients who have each been followed up for 1
Ex. 7.17
         Survival of 596 patients with breast cancer from date of confirmed diagnosis, calculated by the product limit method. For this
         graph, only deaths from breast cancer were counted; the few other deaths were not, the observations being censored at that
         point
         To construct such a curve, we need to know for each individual in the study the starting date of the period of
         observation, the nishing date of this period, and the outcome for that individual. The starting date in a
         clinical situation may be de ned as the date of diagnosis, the date of rst treatment, or in a randomized
         study the date of randomization. In an epidemiological cohort study the starting date may be the date of
          rst exposure to the exposure under investigation, such as working in a particular job or using a drug.
         In a simple example assessing survival in a group of patients, the survival curve is simply the proportion
         surviving at various points in time. The ‘curve’ has, in fact, a step-type pattern; if we start with 10 patients
p. 266   the survival rate is 100 per cent until one dies, then it remains at 90 per cent until another dies,                  and so
         on. As the number of patients enrolled in the study increases, each step becomes smaller and the curve
         appears smoother. Usually we do not know all the dates of death, as some patients will still be alive at the
         time of analysis and others may have been lost to follow-up, so that their current status will be unknown.
         The observations on these patients are censored. Censoring also occurs if the particular endpoint under
         study is precluded by another event; for example, if we wish to count only deaths from a certain cause, other
         deaths produce censored observations. We shall describe two commonly used methods for constructing a
         survival curve: the product-limit method (also known as the Kaplan-Meier estimate) and the actuarial or
         life-table method.
Consider the data in Ex. 7.18, which shows survival times for 20 subjects, and at this point ignore the
Ex. 7.18
         Hypothetical survival data arranged in order of increasing survival time for a clinical trial with 10 patients in each of two
         treatment groups. Note: Trials of this size are not recommended; this is for explanation only
         The calculations are shown in Ex. 7.19. Two deaths occurred in the rst month, so that the probability of
         surviving that month was 18/20 or 0.9, and the survival at 1 month, i.e. at the end of the rst month, was 0.9.
         No deaths occurred during the second month and during the third month there was one censored
p. 268   observation but still no deaths, so that the cumulative survival by the                   end of the third month was 0.9 × 1
         × 1 = 0.9. During the fourth month one death occurred; because only 17 subjects were under follow-up at the
         time the probability of survival was 16/17 or 0.94, and the cumulative survival at the end of the fourth
         month was 0.94 × 0.9 = 0.85. The survival curve changes only when a death occurs; hence a calculation need
         be made only when this happens. Therefore each relevant interval ends when one or more deaths occur. The
         resultant step-type curve is shown in Ex. 7.20.
Ex. 7.19
Ex. 7.20
Survival curves for the data shown in Ex. 7.18. The step pattern is the product-limit survival curve calculated in Ex. 7.19. The
dotted line is the actuarial curve calculated in Ex. 7.21
The estimation of survival by this method not only allows for censoring, but gives an e                      cient estimate since
in each section of the follow-up period the information available on all subjects observed during that
interval is used. The adjustment made for censored observations assumes that they are censored randomly,
i.e. we assume that the subjects whose data are censored at a certain time have a subsequent course similar
to that of the subjects remaining under follow-up. If this is not so, bias will result; for example, if those lost
to follow-up do worse than those followed, perhaps because very ill subjects no longer attend for treatment,
the calculated survival curve will overestimate the actual survival which would be correctly calculated if full
information were available. The product-limit method gives the survival curve that is the most likely
estimate, in statistical terminology the ‘maximum likelihood’ estimate of the true survival curve.
p. 269
         The actuarial or life-table method
         Because hand calculation of the product-limit estimate is tedious with a large set of data, the actuarial or
         life-table method can be used. This gives an approximation to the product-limit estimate, with the number
         of calculations being reduced by grouping the data into suitable intervals of follow-up, such as 1 year or 3
         months. Calculations for the data used in Ex. 7.18, with a 6-month interval, are shown in Ex. 7.21. For each
         time interval we calculate the probability of surviving that interval, which is given as 1 minus the probability
         of dying during that time interval. The probability of dying is given by the number of deaths during the
         interval divided by the number of subjects under follow-up during that interval. This denominator is
Ex. 7.21
         Calculation of actuarial estimate of survival curve from the data in Ex. 7.18. using 6-month intervals. This example is to show the
         method; the actuarial method is not advisable with such small numbers, but is useful with large sets of data
         Survival curves are usually shown with an arithmetic horizontal scale, and either an arithmetic or a
         logarithmic vertical scale. If the risk of death during each section of the follow-up period is in fact the same,
         a survival curve with an arithmetic vertical scale will show an exponential decreasing curve, while one with
         a logarithmic scale will show a straight line. Product-limit estimates, being actual calculations of survival
         rate at every event, are best shown by stepped patterns, while actuarial estimates are an approximation and
         can reasonably be shown as a series of points connected by straight lines. The intervals chosen for actuarial
         calculation need not be of equal length and are often chosen on a convenient arbitrary basis, such as annual
         intervals; but they should be short enough to keep the number of censored observations in any interval
         relatively small.
Accuracy of estimates
         The standard error of the survival rate Si at the end of a given interval is given by
                                                                                                                           Downloaded from https://academic.oup.com/book/16544/chapter/172348970 by University of Cambridge user on 06 September 2023
p. 271   using the notation shown in Ex. 7.19. The summation Σi is over all the intervals up to and including the last
         interval i. This formula is applicable to both product-limit and actuarial calculations and, although
         approximate, is adequate for most purposes. In the usual way, the 95 per cent two-sided con dence limits
         are given as the estimate Si ± 1.96 × standard error.
         Analyses of survival for groups of patients are often reported in terms of relative survival, which is total
         survival divided by the survival expected in a group of subjects of the same age and gender composition in
         that country and at that time. The expected values are derived from life tables routinely produced from vital
         statistics data. This is a way of taking account of the proportion of deaths that will be due to causes other
         than the disease of interest.
         For example, suppose that the survival at 5 years after diagnosis for a group of men diagnosed with prostate
         cancer at age 70 is calculated by the product-limit method shown in this chapter, and is 72 per cent. This
         estimate has variation depending on the number of men assessed, and so the 95 per cent con dence limits
         might be from 68 to 76 per cent. From life tables, the expected survival of a group of 70-year-old men over
         5 years is 83 per cent. We can regard this as being without sampling variation because it is based on the
         whole community and on large numbers. Thus the relative survival for the group is 0.72/0.83, i.e. 87 per
         cent, with con dence limits obtained by dividing the upper and lower limits by the same factor of 0.83,
         giving limits of 82 to 92 per cent. Similarly, the overall 5-year survival for a group of women with breast
         cancer diagnosed at age 60 might be 79 per cent, and the expected survival of a group of women of this age
         is 96 per cent, giving a relative survival of 82 per cent. Although the total survival of the men with prostate
         cancer was lower than that of the women with breast cancer (72 per cent compared with 79 per cent), their
         relative survival was higher (87 per cent compared with 82 per cent) as a greater proportion of their total
         mortality was due to the expected mortality in a group of that age and sex. The relative survival is taken as
         an indicator of the disease-speci c survival, i.e. the survival and corresponding mortality speci cally due to
         prostate cancer or breast cancer, respectively.
         Alternatively, disease-speci c survival can be calculated by the methods already shown. Using the product-
         limit method, each death from a speci c disease such as prostate cancer or breast cancer is counted as an
         event, and deaths from any other cause are regarded as censoring. The danger with this analysis is that it is
p. 272   open to biases in the allocation of deaths; there is always    the suspicion that subjects who have a
         diagnosis of a severe disease will tend to be certi ed as dying from that disease, even if their cause of death
         is in fact independent of it.
         In clinical trials and cohort studies with much smaller numbers of individuals, disease-speci c analyses
         may be appropriate, as attention can be paid to possible biases in outcome assessment and steps taken to
         ensure that the attribution of the cause of death is not biased by the exposure or intervention of the subject.
         Thus an independent panel, blind to the treatment allocation in a trial, may review the clinical and
         pathological information to classify causes of death. Where disease-speci c survival is regarded as the
         prime endpoint, it is always useful to do an analysis of total survival in addition.
         To compare the survival experiences of two (or more) groups of subjects we could select an arbitrary point
         in time and compare the survival at that point using a simple 2 × 2 table of deaths versus survivors at that
         point, or using the formula given above for the standard error of the actuarial estimate at that point. Such a
         To avoid both these di   culties, a non-parametric test is often used. Such a test compares the order in which
         the outcome events occur in each group of subjects, without reference to the precise timing. Thus the tests
         evaluate the whole curve, but make no assumptions about its shape.
         Although there are several di erent non-parametric tests, one method is simple and at the same time
         powerful and appropriate over a wide range of circumstances [25,26]. This is the log-rank test, and it will
         come as no surprise to the reader that this is an application of the Mantel-Haenszel test that has been
         discussed earlier. At each point in time at which a death occurs, a 2 × 2 table showing the number of deaths
         and the total number of subjects under follow-up is created, as shown in Ex. 7.22. For each such table, the
p. 273   observed deaths in one group, the expected deaths, and the variance of the expected number are
         calculated in the same way as was shown in Ex. 7.3. These quantities are summed over all tables, and the
         Mantel-Haenszel statistic is derived from the summations. The calculation for the data given in Ex. 7.18,
         comparing the two treatment groups, is shown in Ex. 7.23.
         Ex. 7.22
Ex. 7.23
         Calculation of log-rank statistic from the data in Ex. 7.18: the quantities ai, Ei, and Vi are calculated for each time period l; a time
         period ends when a death occurs, or two or more deaths occur together. The formula is shown in Ex. 7.22
         The log-rank test calculations produce the observed to expected ratio for each group, which relates the
         number of deaths observed during the follow-up period to the number expected on the null hypothesis that
         the survival curve for that group would be the same as that for the combined data. A subgroup of subjects
         with a mortality rate higher than the whole group will have an observed to expected ratio greater than 1.
         Therefore, in the data used in Ex. 7.23, in group A more deaths are observed (7) then expected (3.49).
         Examination of the survival curves, calculated for each group by the method shown in Ex.7.19, shows that
         there are more early deaths in group A.
         The log-rank method can be expanded to look at a number of di erent categories of subjects, which might
         correspond to graded severities of disease, stages of disease, dose regimens of treatment, and so on.
p. 274   Comparison between             the curves can be based on either an overall chi-squared statistic with (n − 1)
         degrees of freedom, where n is the number of groups compared, which assesses the total variation from the
         null hypothesis that survival is the same in each group, or by a trend statistic on one degree of freedom
         which assesses the linear component of a trend in di erence in survival experience over the ordered groups.
         Thus, Ex. 7.24 shows a further analysis of the survival of the breast cancer patients whose overall survival
         was shown earlier in Ex. 7.17. The patients have been divided into four groups in terms of the measurement
         of oestrogen receptor (ER) concentration in the tumour [27]. The graph shows that survival is better for
         those with a higher concentration of oestrogen receptors. The log-rank analysis for the four groups gives an
                     2
         overall χ of 60.4 on 3 d.f. (P < 0.0001), and the trend statistic, assessing whether survival changes in a linear
                                                                              2
         fashion with a change in receptor concentration, gives a χ of 48.6 on 1 d.f. (P < 0.0001).
Ex. 7.24
p. 275
         Control for confounding
         Since the log-rank method of analysis is derived from the Mantel-Haenszel statistical method, it can easily
         be adapted to the consideration of confounding variables. The logic is the same as that with the simpler
         types of data analysis already discussed, in that the experience of the groups to be compared is examined
         within categories of the confounding factor by generating survival curves for the groups to be compared for
         each subcategory of the confounder. The log-rank statistic allows the data to be summarized over levels of
         the confounding factor, giving an overall observed to expected ratio, and an overall chi-squared statistic
         which assesses the di erence between the groups being compared after control for the confounder. For
         example, the data shown in Ex. 7.24 show that survival in breast cancer patients is related to oestrogen
         receptor concentration. It is important to know whether this e ect is independent of the e ect of other
p. 276   prognostic factors such as staging. Therefore a              strati ed analysis in which the e ect of oestrogen
         receptor status was assessed within clinical stage categories, and the log-rank statistic was calculated over
         the strata, was done. This showed that the observed to expected ratios were little changed; the ratio for
         subjects with the highest receptor concentration was 0.44 before control for clinical stage and 0.45 after
         such control, and the trend statistic was 48.6 before controlling for stage and 47.3 afterwards. This shows
         that the prognostic e ect of oestrogen receptor quantity is independent of clinical stage [27].
Multivariate analysis
         Beyond this type of analysis, multivariate methods can be applied to survival data. Such methods are similar
         in principle to the logistic regression model described earlier in this chapter. A powerful method with wide
         applicability is the proportional hazards model developed by Cox [28]. This model allows the survival curve
         to take any form, i.e. the hazard function, or instantaneous mortality rate, is allowed to vary with the time
         from the start of follow-up. However, it is assumed that the relative e ect of any factor on the hazard
         function will be constant over time. Thus, if the hazard function at time t in the referent group of subjects is
         expressed as λ      0,t, the function for subjects with a factor x1 is given by
where b 1 is the coe cient related to x 1.=If x 1 has the values 0 or 1, the exponential of b 1 will be the ratio λ
1,t/λ 0,t, i.e. the ratio of the hazard function, which is a relative risk. With multiple factors, the model
becomes
As stated at the beginning of the chapter, this review of key statistical methods emphasizes that statistical
methods are one component of the process of assessing causation, along with the consideration of
observation bias and confounding. In subsequent chapters, we will consolidate and develop this theme.
p. 277
         Self-test questions (answers on p. 501)
           Q7.1 Two studies of the e ects of a new treatment on recovery from a disease are available. The rst
                  study gives a relative risk of 2.2, with a chi-square statistic on one degree of freedom of 9.6. The
                  second study gives a relative risk of 4.6, with a chi-square statistic of 1.8. How would you interpret
                  these results= (Use Appendix Table 15, or the Excel functions shown in Appendix Table 16).
           Q7.2 In a study of injury in di erent work groups, there were 16 injuries in 1 year in a group of 1000
                  employees working on special processes, while for a larger group of 10000 workers there were 40
           Q7.3 In a case–control study, 52 of 400 controls are exposed to the agent in question compared with 34
                  of 100 cases. Calculate the odds ratio, the Mantel-Haenszel chi-square statistic, and the 95 per
                  cent two-sided con dence limits. Use Ex. 7.3 and Appendix Table 1.
           Q7.4 A cohort study gives a result showing a relative risk of 2.25 with 95 per cent two-sided con dence
                  limits of 0.88 to 5.75. How do you interpret this?
           Q7.5 The background rate of smoking cessation over 1 year is assumed to be 10 per cent. A speci c
                  programme is thought worthwhile if it would double this to 20 per cent. It is assumed that the
                  cessation programme could not reduce the cessation rate. You want to design a study which has
                  high con dence (90 per cent) that you can detect an e ect of this magnitude. How many subjects
                  do you need? Use Exs. 7.15 and 7.16
           Q7.6 If the answer from the last question suggests a number beyond your resources, how could you
                  design the study to use a smaller number?
           Q7.7 You want to set up a case–control study to assess a factor that you expect to be present in 10 per
                  cent of the control sample. You want to be able to detect a relative risk of 2, with 80 per cent power
                  and using a two-sided 0.05 signi cance level. Cases and controls are equally di      cult to recruit.
                  What is the minimum number of subjects you require?
           Q7.8 If in the case–control study described in Q7.7, controls are much easier to recruit than cases, how
                  many cases are needed if four controls per case are recruited?
p. 278     Q7.9      In an individually matched case–control study there are 200 pairs in which the case is exposed
                  and the control is not, and 50 where the control is exposed and the case is not. What is the odds
                  ratio, the chi-square statistic, and the approximate 99 per cent two-sided con dence limits= Use
                  Exs. 7.11 and 7.7, (p. 248 and 239).
          Q7.10 In a case–control study, 50 of 100 cases and 20 of 100 controls in men have the exposure of
                  interest. In women, 60 of 100 cases and 10 of 100 controls are exposed. What are the results= Are
                  the results di erent for men and women?
           Q7.11 From the data in Ex. 7.18, calculate the survival curve for each of groups A and B. What is the
                  survival rate and 95 per cent con dence limits for each group at 18 months? Do these results agree
                  with the log-rank statistic calculated in Ex. 7.23?
         References
         1. Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research. (4th edn). Oxford: Blackwell Scientific,
         2002. 10.1002/9780470773666
         Google Scholar     Google Preview     WorldCat        COPAC       Crossref
         2. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods. Belmont, CA:
         Lifetime Learning Publications, 1982.
         Google Scholar      Google Preview    WorldCat    COPAC
         4. Rothman KJ, Greenland S. Modern Epidemiology (2nd edn). Philadelphia, PA: Lippincott–Raven, 1998.
         Google Scholar   Google Preview     WorldCat      COPAC
         5. Miettinen O. Estimability and estimation in case-referent studies. Am J Epidemiol 1976; 103: 226–235.
         WorldCat
         6. Greenland S. A counterexample to the test-based principle of setting confidence limits. Am J Epidemiol 1984; 120: 4–7.
         WorldCat
         7. Lyon JL, Gardner JW, West DW, Stanish WM, Hebertson RM. Smoking and carcinoma in situ of the uterine cervix. Am J Public
         Health 1983; 73: 558–562. 10.2105/AJPH.73.5.558
         Crossref
         8. Truett J, Cornfield J, Kannel W. A multivariate analysis of the risk of coronary heart disease in Framingham. J Chronic Dis
         1967; 20: 511–524. 10.1016/0021-9681(67)90082-3
         WorldCat      Crossref
         9. Pooling Project Research Group. Relationship of blood pressure, serum cholesterol, smoking habit, relative weight and
         ECG. abnormalities to incidence of major coronary events: final report of the Pooling Project. J Chronic Dis 1978; 31: 201–
         306. 10.1016/0021-9681(78)90073-5
         WorldCat      Crossref
         10. Saracci R. Asbestos and lung cancer: an analysis of the epidemiological evidence on the asbestos-smoking interaction. Int
         J Cancer 1977; 20: 323–331. 10.1002/ijc.2910200302
         WorldCat     Crossref
         11. Mantel N. Chi-square tests with one degree of freedom; extensions of the Mantel–Haenszel procedure. Am Stat Assoc J
         1963; 58: 690–700. 10.2307/2282717
         WorldCat      Crossref
         12. Elwood JM. Maternal and environmental factors a ecting twin births in Canadian cities. Br J Obstet Gynaecol 1978; 85:
         351–358. 10.1111/j.1471-0528.1978.tb14893.x
         WorldCat     Crossref
         13. McNemar Q. Note on the sampling of the di erence between corrected proportions or percentages. Psychometrika 1947;
         12: 153–157. 10.1007/BF02295996
         WorldCat      Crossref
p. 279    14. Elwood JM, Williamson C, Stapleton PJ. Malignant melanoma in relation to moles, pigmentation, and exposure to
         fluorescent and other lighting sources. Br J Cancer 1986; 53: 65–74. 10.1038/bjc.1986.10
         WorldCat      Crossref
         15. Clayton D, Hills M. Statistical Models in Epidemiology. Oxford: Oxford Scientific Publications, 1993.
         Google Scholar     Google Preview        WorldCat       COPAC
         16. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group). Randomised trial of intravenous
         streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2. Lancet 1988;
         332: 349–360. 10.1016/S0140-6736(88)92833-4
         WorldCat      Crossref
         18. Blot WJ, Li J-Y, Taylor PR, et al. Nutrition intervention trials in Linxian, China: supplementation with specific
         vitamin/mineral combinations, cancer incidence, and disease-specific mortality in the general population. J Natl Cancer Inst
         1993; 85: 1483–1492. 10.1093/jnci/85.18.1483
         WorldCat      Crossref
         19. Hennekens CH, Buring JE, Manson JE, et al. Lack of e ect of long-term supplementation with beta carotene on the
         incidence of malignant neoplasms and cardiovascular disease. N Engl J Med 1996; 334: 1145–
         1149. 10.1056/NEJM199605023341801
         WorldCat      Crossref
         20. The Alpha-Tocopherol Beta Carotene Cancer Prevention Study Group. The e ect of vitamin E and beta carotene on the
         incidence of lung cancer and other cancers in male smokers. N Engl J Med 1994; 330: 1029–
         1035. 10.1056/NEJM199404143301501
         WorldCat       Crossref
         21. Omenn GS, Goodman GE, Thornquist MD, et al. E ects of a combination of beta carotene and vitamin A on lung cancer and
         cardiovascular disease. N Engl J Med 1996; 334: 1150–1155. 10.1056/NEJM199605023341802
         WorldCat      Crossref
         22. Pocock SJ. Clinical Trials: A Practical Approach. Chichester: John Wiley, 1983.
         Google Scholar    Google Preview          WorldCat      COPAC
         23. Machin D, Day S, Green SB. Textbook of Clinical Trials (2nd edn). Chichester: John Wiley, 2006. 10.1002/9780470010167
         Google Scholar   Google Preview       WorldCat         COPAC       Crossref
         24. Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates AND Proportions (3rd edn). New York: John Wiley,
         2003. 10.1002/0471445428
         Google Scholar     Google Preview         WorldCat       COPAC      Crossref
         25. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep
         1966; 50: 163–170.
         WorldCat
         26. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of
         each patient. II: analysis and examples. Br J Cancer 1977; 35: 1–39. 10.1038/bjc.1977.1
         WorldCat       Crossref
         27. Godolphin W, Elwood JM, Spinelli JJ. Estrogen receptor quantitation and staging as complementary prognostic indicators
         in breast cancer; a study of 583 patients. Int J Cancer 1981; 28: 667–683. 10.1002/ijc.2910280604
         WorldCat       Crossref
p. 280   28.   Cox DR. Regression models and life-tables. J R Statist Soc Series B 1972; 34: 187–220.
           Downloaded from https://academic.oup.com/book/16544/chapter/172348970 by University of Cambridge user on 06 September 2023
WorldCat