Br. J. Surg. 1991, Vol.
78, July,
782-784 Statistical guidelines for
The British Journal of Surgery
G . D. Murray
A set of statistical guidelines is presented which should be ,followed by
Medical Statistics Unit,
Department of Surgery, Western
authors preparing manuscripts for submission to The British Journal
Infirmary, Glasgow G 7 7 6NT, UK of Surgery. Simple points relating to presentation are explained in full,
Correspondence to: and more complex issues are highlighted and discussed in an
Dr G . D. Murray accompanying review ar t icle.
In 1988 I published a paper in The British Journal of Surgery large extent these areas overlap; for example, the interpretation
which reviewed the standard of statistical presentation of may depend crucially on study design. However, any repetition
articles in three consecutive issues of the Journal’. The review should only serve to amplify the important points.
was relatively critical, highlighting problems in the areas of
study design, presentation of data, statistical analysis and Design
interpretation. Since that time I have served as the statistical Set out clearly the purpose of the study, identifying the a priori
consultant to the Journal, refereeing a proportion but by no primary and secondary hypotheses’.
means all of the manuscripts submitted for publication. Many Set out the reason for the choice of study design in terms of
of the manuscripts received by the Journal still suffer from major the purpose of the study’.
statistical flaws’, and a proportion of these flaws continue to Set out the reason for the choice of sample size, which should
evade the referees’ and editors’ scrutiny and appear in print. wherever possible be based on a power calculation. Report in
Some of these problems are relatively minor sins of omission, addition whether the sample size was determined in advance
where the analysis is almost certainly valid but where important or was chosen on the basis of repeated analysis of accumulating
details which would allow the reader to check the validity are data’.
ignored o r else stated ambiguously; for example, the ‘ f’ symbol Explain what measures were taken to control bias, whether
often appears in manuscripts with no indication of whether a this involved randomization, matching o r a n ‘adjusted’
standard deviation or a standard error is being reported. statistical analysis.
However, a proportion of the problems are more serious sins In a randomized study, set out the actual mechanism that
of commission, where inappropriate study design, analysis or was used, and describe any form of blinding that was employed.
interpretation throws considerable doubt onto the authors’ Many studies are described as being randomized when in fact
conclusions. allocation was based on unit number or date of entry to the
In this article I give a set of guidelines for the preparation study. Such techniques do not constitute randomization, and
of manuscripts which I hope will help to reduce the large can introduce serious bias between study groups.
number of sins of omission. It is debatable to what extent a set
of guidelines can help authors to avoid the sins of commission.
Medical statistics is a subtle blend of methodology, experience Presentation of data: graphical
and common sense (not unlike medicine!), and it is not simple Wherever space permits, use a graphical presentation to
to capture these nuances in a succinct set of guidelines. It is illustrate the main findings of a study.
possible, however, to warn against the more common bad When the sample sizes are not too large, base the graph on
practices that appear, and indicate the situations where even the individual data points.
the most numerate of clinicians should consider seeking When ‘before’ and ‘after’ data are available on the same
professional statistical advice. In the guidelines I have individual, either plot out both sets of data with lines linking
attempted to distinguish between the relatively simple points of values for the same individual’, o r else plot the changes.
presentation, which are set out in some detail, and the more If the data being presented are highly skewed, then consider
fundamental issues which are explored in an accompanying using a transformation, such as plotting the data on a
paper. The latter issues are more related to good research logarithmic scale.
methodology, and these frequently give rise to problems in If an analysis of the association between two continuous
manuscripts submitted to The British Journal of Surgery. variables is central to the paper, then always include a scatter
While I have set out these guidelines with my experience of plot of the two variables plotted against each other.
reading manuscripts and papers from The British Journal of When plotting survival curves or life tables, always indicate
Surgery very much to the fore, I have to acknowledge a debt the number of individuals at risk at the various times of
of gratitude to the authors of two excellent and more general follow-up.
sets ofguidelines for medical authors which have been published Avoid the use of ‘error bars’ showing one standard deviation
recently. Bailar and Mosteller3 have published a paper or standard error (see below).
amplifying and explaining the extremely terse section on
statistics in the Uniform requirementsfor manuscripts submitted Presentation of data‘ numerical
to biomedical journals prepared by the International Committee The use of a mean and standard deviation to describe a
of Medical Journal Editors4, and Altman et al. have recently distribution is only appropriate when the distribution is
published an updated set of guidelines5 based on a set published (roughly) symmetrical or ‘normal’.
earlier by the same authors in The British Medical Journal6. For For a skewed distribution, the median is better than the
readers looking for a more comprehensive discussion of many of mean as a measure of the centre of the distribution.
these issues I would recommend the excellent and eminently For a skewed distribution, the interquartile range is better
readable textbook by Bland’. than the standard deviation as a measure of spread. Another
possibility is to use the range itself, but this suffers from the
Guidelines disadvantages that it depends on the sample size, and is based
on the most extreme (and hence perhaps suspect) observations.
These are set out under the headings of my earlier review’, Do not confuse the standard deviation and the standard
namely design, presentation, analysis and interpretation. To a error. Standard deviation is a measure of spread. For a variable
782 0007-1323/91/070782 03 1991 Butterworth-Heinernann Ltd
Statistical guidelines for The British Journal of Surgery: G. D. Murray
with a normal distribution, roughly 95 per cent of the values t test as well as an assumption of independence requires the
will lie within two standard deviations of the mean. This range distributions being compared to be (roughly) normal, and
of mean plus or minus two standard deviations is the 95 per cent requires the spread (standard deviation) of the two groups to
normal range, and does not depend on the sample size. Standard be similar.
error is a measure of the precision of an estimated parameter, Strive to limit the number of statistical tests which are
frequently an estimated mean. The precision of an estimate performed, and if multiple comparisons are unavoidable, use
depends on the size of the sample on which the estimate is an appropriate adjustment to control the ‘type I’ (false positive)
based, so that a small sample leads to a large standard error error’. Every time that a hypothesis (‘significance’) test is
and a large sample leads to a small standard error. Under many performed, there is a risk of reaching a false positive conclusion.
common circumstances, one can be 95 per cent confident that With the conventional threshold of P=0.05, this risk is one of
the true value of a population parameter lies within two 5 per cent, i.e one in 20. If a large number of tests are performed,
standard errors of its estimated value. This range of estimate it therefore becomes very likely that some false positive results
plus or minus two standard errors is the 95 per cent confidence will be obtained, which makes it difficult to interpret the results
interval for the parameter. For technical details of how to obtain of multiple tests. The best approach is to identify u priori a
a confidence interval in general, see Gardner and Altman”. primary hypothesis, and to give little weight to any secondary
Avoid the use of the ‘k’symbol. The symbol is both comparisons that are made. Such secondary findings might well
confusing because it is often unclear whether a standard generate interesting hypotheses to be pursued in further studies,
deviation or a standard error is being given, and misleading but they cannot be regarded as being conclusive. If it is not
because it implies that the range of a value plus or minus one possible to identify a primary hypothesis, an alternative is to
standard deviation or error has some special significance. apply a correction for the multiple comparisons, and to demand
From the above it should be clear that in general plus or minus a far more extreme threshold for the P value before claiming
two standard deviations or errors should be used. The former that a difference has been established. For example, with the
for a 95 per cent normal range (for a normal population) and Bonferroni correction, the P value is multiplied by the number
the latter, under many circumstances, for a 95 per cent of comparisons being made, and a finding is only accepted as
confidence interval. being ‘significant’ if this adjusted value lies below 0.05.
Avoid spurious precision. Derived quantities are frequently However, this leads to tests which have very low sensitivity,
reported to many decimal places, e.g. ‘t = 35426’ or ‘ P =0.2564’ and represents a suboptimal approach to research.
or ‘40.74 per cent’ (derived from 1 1 of 27). Such spurious When reporting the results of a statistical test, quote the
precision detracts from the clarity of an exposition, and at least value of the test statistic as well as the P value. For example,
to me always suggests an uncritical transcription of a result from ‘t = 1.34, 16 d.f., P=0.20’ might be reported. Wherever possible
a computer printout to a manuscript. It is a clue to a statistical give the actual P value, to two decimal places, rather than
referee to look more critically at other aspects of the manuscript. quoting, say, P<O.O5. In particular, if the result is non-
Percentages based on sample sizes of below 100 should significant, always quote the actual P value rather than stating
always be given as integers, and for larger samples it suffices ‘ns.’. If the result is highly significant, it is sufficient to state
to give percentages to one decimal place. A sample mean should that P<O.OOl rather than quoting, say, P=0.00004.
normally be given to one more significant figure than was used Wherever possible present the results of a statistical
for the original measurements, and a standard deviation or error comparison by reporting a confidence interval as well as or
would be given to at most two more significant figures than was in place of the P value’.
used for the original measurements. Do not use statistical tests to compare the baseline
characteristics of study groups, but rather use adjusted analyses
to investigate the effects of such imbalances“. A minor
AnulysiA imbalance in a key prognostic factor can have a profound effect
Account for all subjects recruited into the study, and attempt on a treatment comparison, even when the imbalance falls well
to assess any bias which might have been introduced through short of ‘significance’. Also a vast imbalance in a factor which
poor response rates or drop-outs and losses to follow-up. is not related to outcome is of no relevance, even if the imbalance
Explain clearly which methods were used for which analyses. is highly significant.
Simple, standard methods do not need to be explained in detail, D o not investigate ‘subgroup’ effects by literally making
but it is important to know, for example, whether the ‘t test’ comparisons within subgroups of subjects. Test instead for an
mentioned in the methods section was paired or unpaired. interaction, i t . a treatment effect which varies systematically
Sometimes it is unclear which analyses were used for which with the value of a characteristic such as age or stage ofdisease’.
comparisons, for example if both the t test and the
Mann-Whitney test are mentioned in the methods section. This Interpretation
must be made explicit. Indeed it is often difficult to know where Do not confuse statistical significance with clinical relevance.
to set out the statistical methods, as a complex analysis does not Use confidence intervals to assess clinical relevance, especially
fit in well with the conventional ‘materials and methods, results when interpreting a ‘negative’ finding’.
and discussion’ structure of a medical paper. For example, in Do not place undue emphasis on a secondary analysis,
a randomized study, a chance imbalance might be discovered especially if the analysis was suggested by an inspection of the
in the age distribution of the study groups which needed to be data’.
incorporated into the analysis by way of a comparison Be very wary of implying causality in anything other than
‘adjusted’ for age. How does one justify the method, before a controlled randomized trial. Association does not imply
mentioning the ‘result’ of the age imbalance which required the causation.
adjustment? My own preference in such circumstances is to give Avoid any imprecise use of technical terms. For example, do
a broad overview in the methods section and then to set out the not discuss the ‘correlation’ between two variables when what
statistical methods in detail in the results section, as each is used is meant is ‘association’.
for the first time. Point out and discuss any weaknesses in study design or
Be aware of the assumptions underlying different tests, and execution. An honest evaluation of a study carries more weight
do not make any assumptions which cannot be supported by the than an attempt at concealment!
data. All statistical procedures are based on a number of
underlying assumptions. Most procedures assume that the
observations being analysed are independent, an assumption References
which would be violated for example in a study where three 1. Murray GD. The task of a statistical referee. Br J Surg 1988;
tissue samples were collected for each subject’. The two sample 75: 664-1.
Br. J. Surg., Vol. 78, No. 7. July 1991 783
Statistical guidelines for The British Journal of Surgery: G. D. Murray
2. Evans M. Presentation of manuscripts for publication in the 7. Bland M. An Introduction to Medical Statistics. Oxford: Oxford
British Journal of Surgery. Br J Surg 1989; 76: 1311-15. University Press, 1987.
3. Bailar JC, Mosteller F. Guidelines for statistical reporting in 8. Murray GD. Statistical aspects of research methodology. Br J
articles for medical journals: amplifications and explanations. Surg 1991; 78: 779-783.
Ann Intern Med 1988; 108: 26673. 9. Matthews JNS, Altman DG. Campbell MJ, Royston P. Analysis
4. International Committee of Medical Journal Editors. Uniform of serial measurements in medical research. Br Med J 1990; 300:
requirements for manuscripts submitted to biomedical journals. 23G5.
Br Med J 1988; 296: 401-5. 10. Gardner MJ, Altman DG, eds. Statistics in Confidence. London:
5. Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical British Medical Journal Publications 1989.
guidelines for contributors to medical journals. In: Gardner MJ, 11. Altman DG, Dore CJ. Randomisation and baseline comparisons
Altman DG, eds. Statistics in Confidence. London: British in clinical trials. Lancet 1990; 335: 149-53.
Medical Journal Publications 1989: 83-100.
6. Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical
guidelines for contributors to medical journals. Br Med J 1983;
286: 1489-93. Paper accepted 20 January 1991
784 Br. J. Surg., Vol. 78, No. 7, July1991