0% found this document useful (0 votes)
17 views10 pages

Student (1908)

The document discusses the significance of correlation coefficients derived from small sample sizes, addressing a question raised by Dr. Shaw regarding their reliability. It presents empirical methods to analyze the distribution of correlation coefficients and suggests that the distribution of these coefficients for small samples is not normal, which complicates their interpretation. The author conducts experiments using samples of varying sizes to illustrate these points and derive statistical properties relevant to correlation analysis.

Uploaded by

Oliver Guidetti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Student (1908)

The document discusses the significance of correlation coefficients derived from small sample sizes, addressing a question raised by Dr. Shaw regarding their reliability. It presents empirical methods to analyze the distribution of correlation coefficients and suggests that the distribution of these coefficients for small samples is not normal, which complicates their interpretation. The author conducts experiments using samples of varying sizes to illustrate these points and derive statistical properties relevant to correlation analysis.

Uploaded by

Oliver Guidetti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Probable Error of a Correlation Coefficient

Author(s): Student
Source: Biometrika , Sep., 1908, Vol. 6, No. 2/3 (Sep., 1908), pp. 302-310
Published by: Oxford University Press on behalf of Biometrika Trust

Stable URL: https://www.jstor.org/stable/2331474

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

and Oxford University Press are collaborating with JSTOR to digitize, preserve and extend
access to Biometrika

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
PROBABLE ERROR OF A CORRELATION
COEFFICIENT.

By STUDENT.

AT the discussion of Mr R. H. Hooker's recent paper "The correlation of the


weather and crops" (Journ. Royal Stat. Soc. 1907) Dr Shav made an enquiry
as to the significance of correlatioti coefficients derived fronm small numbers
of cases.

His question was answered by Messrs Yule and Hooker and Professor Edgeworth,
all of whom considered that Mr Hooker was probably safe in taking .50 as his
limit of significance for a sarnple of 21. They did not, however, answer Dr Shaw's
question in anty more general way. Now Mr Hooker is not the only statistician
who is forced to work with very smiall samples, and until Dr Shaw's question has
been properly answered the results of such investigations lack the criterion which
would enable us to make full use of them. The present paper, which is ani account
of somie sampling experimiients, has two objects: (1) to throw some light by empirical
methods on the problem itself, (2) to endeavour to interest mathematicians who
have both time and ability to solve it.

Before proceeding further, it may be as well to state the problem which occurs
in practice, for it is often confused with other allied questions.

A random sample has been obtained from an indefinitely large* population


and rt calculate'd between two variable characters of the individuals composing the
sample. We require the probability that R for the popuilation from which the sample
is drawn shall lie between any given limits.

It is clear that in order to solve this proble'm we must know two things: (1) the
distribution of values of r derived from samples of a population which has a given

* Note that the indefinitely large population need not actually exist. In Mr Hooker's ease his
sample was 21 years of farming under modern conditions in England, and included all the years about
which information was obtainable. Probably it could not actually have been made much larger
without loss of homogeneity, due to the mixing with farming under conditions not modern; but one
can imagine the population indefinitely increased and the 21 years to be a sample from this.
t Throughout the rest of this paper "r " is written for the correlation coefficient of a sample and R
for correlation coefficient of a population.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
BY STUDENT 303

R, and (2) the a' priori probability that R for the population lies between any given
limits. Now (2) can hardly ever be known, so that some arbitrary assumption
must in general be made; when we know (1) it will be time enough to discuss
what will be the best assumption to make, but meanwhile I nmay suggest two
more or less obvious distributions. The first is that any value is equally likely
between + I and - 1, and the second that the probability that x is the value is
proportional to 1 - x2: this I think is more in accordance with ordinary experi-
ence: the (listribution of 4 priori distribution would then l)e expressed by the
equation y = (1 - 2).

But whatever assumption be made, it will be necessary to know (1), so that


the solution really turns on the distribution of r for samples drawn from the satne
population. Now this has been determined for large samples with as much accuracy
as is required, for Pearson and Filon (Phil. Trans. Vol. 191 A, p. 229 et seq.) showed

thtat the standard deviation is - and of course for large samples the distribution
is sure to be practically normal unless r is very close to unity. But their method
involves approximations which are not legitimate when the sample is small.
Besides this the distribution is not then normal, so that even if we had the standard
deviation a great deal would still remain unknown.

In order to throw some light on this question I took a correlation table*


containing 3000 cases of stature and length of left middle finger of criminals,
and proceeded to draw samples of four from this populationt. This gave me
750 values of .r for a population whose real correlation was *66. By taking the
statures of one sample with the middle finger lengths of the next sample I was
enabled to get 750 values of r for a population whose real correlation was zero.
Next I combined each of the samples of four with the tenth sample before it and
with the tenth sample after it, thus obtaining two -sets of 750+ values from samples
of 8, with real correlation -66 and zero.

Besides this empirical work it is possible to calculate a priori the distribution


for samples of two as follows.

For clearly the only values possible are + 1 and - 1, since two points must
always lie on the regression line which joins them?.
Next consider the correlation between the difference between the values of one
character in two successive individuals, and the difference between the values of
the other character in the same individuals. It is well known to be the same as
that between the values themselves, if the individuals- be iu random order.

* Biometrika, Vol. I. p. 219. W. R. Macdonnell. t Biometrika, Vol. vi. p. 13. Student.


$ Not strictly independent, but practically sufficiently nearly so. This method was adopted in order
to save arithmetic.
? There are of course indeterminate cases when the values are the same for one character, but they
become rarer as we decrease the unit of grouping until with an infinitesimal unit of grouping the
statement in the text is true.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
304 Probable Frror of a Correlation Coefficient
Also, if an indefinitely large number of such differences be taken, it is clear
that the means of the distributions will have the value zero. Hence, if the
correlationi be determined from a fourfold division through zero we can apply
Mr Sheppard's* result that if A and B be the numbers in the large and the
small divisions of the table respectively cos A+rB= R, where R is the correlation

of the original systemn.

But if a pair of individuals whose difference falls in either of the small divisions
be considered to be a random sample of 2, their r will be found to be -1, while
that of a pair whose difference falls in one of the large divisions is + 1. Hence
the distribution of r for samples of 2 is AN at +1, and BN at -1,where A+B=1,
cos-I.R
and B=cs
7r

When R = 0, there is of course even division, h


cos'I '66
-1; when R = '66, B = = '271, therefore A= '729, and the mean is at
*729 -271 = '458. The s.D. = /l - (458)2 = '889. It is noteworthy that the mean
value is considerably less than R.

I have dealt with the cases of samples of 2 at some length, because it is possible

that this limiting value of the distribution with its mean of 2 sin-i R and its second
'Jr

moment coefficient of 1 - - sin' R) may fuirnish a clue to the distribution when


n is greater than 2.

Besides these series, I have another shorter one of 100 values of r from samples
of 30, when the real value is *66. The distributions of the various trials are given
in the table.

Several peculiarities will be noticed which are due to the effects of grouping,
particularly in the samples of 4. Firstly, there is a lump at zero; with such small
numbers zero is not an uncommon value of the product moment and then, whatever
the values of the standard deviations, r = 0.

Next there are five indeterminate cases in each of the distributions for samples
of 4. These are due to the whole sample falling in the same group for one variable.
In such a case, both the Standard Deviation and the product moment vanish and
r is indeterminate.

Lastly, with such small samples one cannot use Sheppard's corrections for the
Standard Deviations, as r often becomes greater than unity. So I did not use
the corrections except in the case of the samples of 30, yet on the whole the values
of the Standard Deviations are no doubt too large. This does not much affect the
values of r in the neighbourhood of zero, but there is a tendency for larger values

* Phil. Trans. A. Vol. cxcii. p. 141.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
By STUDENT 305

OO. T +g6. + I

Z6 +6 + OO+9+

96- +88. + ~ ~ ~ ~ ~ ~ ~ ~ ~ 9.+6 +


eg~~~~~~~~~~.+g ~ ~ ~ ~ ~ ~ ~ ~ ~ ~9. + _____6 O.9. +
+98 +- +g+

8. +89. + "06 98

______Z
_ _ _ OL.+99co+
_ _ _ _ 0 _ _ _ _ _
9z 9 + ___ 9.+9.+
______ 01 01~~~~~~~~~~~~~G 91& (I FL. 1 09. +9g. + co
/_______ - kbS9 9g. + F. +

gg. + Rej. + I 6__________

~~~~~?.+g~~~~~~~~~~~~~~~~~g~qO. + 99 o.+t. + n
- ~~~~~~~~~~~~~ o.o z.L
gy +
-____ 01 tq 4- le -q
f9~~~~~~~~~~~~~~~~~~~2O.01g,I -9.9
s r*+J)+ -
zt. + gr. + 01u-
_ _ _ _ _ 01 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
____ 0
~ ~ ~ ~~~~~~ ~~~~~~ + ~~~Z 0 9. --

X t. +gT. + __ _ __ _ __ _ __ _2 . +re. +

~VF+ 8. + C# o +Qg
_ _ _ _ _ 01 _ _ _ _ _ _ _ O & .+ 9 1. + 00~_4" 0 Jt

~~~~~ go.-~~~~~~~~~~~~~~~~~~~~~~~~j. I~~~~~~~g.olgj to


Zh T. +9. - al c
- q hO Oa. + ____ __0

X- . +90
01 I _
- OT.-690.
_ _ _ _ _ _ _ _ _ h6o -1 Zt
- k
_01 I-* 9t oLI(Piil
q .-gT - - CO I
GII Iq~Oo?. -to~. - 0
ko ko~ ~~~~~
_ _ _ _ _ _ _ ~~~~~~~~~~~~~~~~~~~~~ 01 ~ ~ ~~~~~~~8 .o - 2 ejo - 6 - -

_ _ _ _ _ 01 - T.-I6T

-g ho 0 .oi ho
8e_ _ _ _ _ -9 -

T9.01 a- - . g -

4~~~~~4 0 I - _ _ _ _ _ _ _~~~~~~~~~~~~~~~~~~~~~~~

9 Z - 42eg
01 .~~~~~~~~~~~~.. g9.-~~~~~~~~~9.o 9.69* -
CO 01 0
- I ...4 *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~0 o2- t. - k
- o Iq l) a
_____Z9 4 2. - 6L. -

ko ~~~~~~~~0) )
9.- k0*o ) +- 0
_ * ~~~~~~~~~~~~~~~~99 9

Biometrika vi 3

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
306 Probable Error of a Correlation Coefficient
to come too low, so that there is a deficiency of cases towards 1 and - 1. This
introduces an error into the Standard Deviation of all the series to somie extent,
but of course the mean is unaltered when there is no correlation.- The series for
samples of 4 are affected more than those from samples of 8, as the mean Standard
Deviation of samples of 4 is the smaller, so that the unit of grouping is compara-
tively larger.
The moment coefficients of the five distributions were determined, and the
following values found*:

Mean S.D. I2 A3 4 p 2

Saniples of 4 (r= 0 ) - *512 3038 - 1768 . 1P918


Samples of 8 (r= 0 ) - 3731 1392 - 0454 - 2336
Samples of 4(r= -66) *5609 4680 *2190 - 1570 *2152 2,245 4-489
Samnples of 8 (r -66) .6139 *2684 *07202 - 02634 02714 1 857 51232
Samples of 30(r= 66) 661 1001 01003 - 000882 000461 7713 4 580

Considering first the "no correlation" distributions I attempted to fit a Pearson


curve to the first of them. As might be expected, the range proved limited and
as symmetry had been assumed in calculating the moments, a Type II curve
$2 \ -272

resulted. The equation was Y=Y( 1076) ' the range of which is 2-074.

Now the real range is clearly 2, and only a very small alteration in /2 is
required to make the value of the index zerQ. Consequently the equat.ion
y= y0 (1 - 2)0 was suiggested. This means an even distribution of r betwe
and - 1, with S.D. = *5774 + *010 vice *5512 actual, 102 = *3833 + '0116 vi
L4 = *2000 + *016 vice *1768 and /2= 1P800 + *12 vice 1P918, all values as close as
could perhaps be expected considering tiat the grouping must make both
P2 and t4 too low.
Working from y = yO (1 - x2) for samples of 4 I guessed the formula

y = yo (1 -2) 2 and proceeded to ealculate the moments.


By using the transformation x = sin 0 we get y = yo cosnI4 9,
d= cos 9d6,

2 Jydxc 2yofI cosq-3 OdO,


{2~~~~~~~~

2f x2ydx 2yof cos"t3 OdO - '21y,f cosn-' OdO,.


and so on.
Whence

_2 _ _4 ____82_= 3 (n -I) 3 - 6
n-ql' (n -1) (n +l)' n n+lI n+l
In the eases of no correlation the moments were t
distributio6.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
BY STUDENT 307

Putting n = 8 we get the equatio


2= =7-1429 + 0050 inste
4 = = 0476 + 0038 , ,, 0454,
= -3780 + *0066 ,, ,, *3731,
/2= 3 - = 2333 + *012 ,, ,, 2336.
/ XI 2\021
The equation calculated from the actual moments is y =yo (1 - 9802 ) whence
the calculated range is 1 98, whereas it is known to be 2.
The following tables compare the actual distributions with those calculated from
the equations.
Di8tribution of r froM 8asnples of 4 compared with the equcation

ZQ= l-X)?

I t I t.- I 2 I p I + I I I I I+ + + + + +
-4 - 0~ 0 0 0 0 0 0 0 0 .20

I I I ~~~~~~+ + + + + +

Actual ... 64 45i 55t 67 59 62 63 58 60 64 51 411 54


Calculated 65 56 56 56 56 56 56 56 56 56. 56 56 65

Difference -1 -10. | -ii +11 +3 +6 +7 +2 +4 +8 -4| - 14 -11

From this we get x2 = 13 30, P = -34. It will however be no


has caused all the middle compartments to contain more
pointed out above.

Di8tribution of r fromt 8ample8 of 8 compared with the equation


750 x 15
y= 16 (1-a2)2.

- ~ ~~~ I 0 I--10 a a0 I0 < 8


?? I I 1 I1?1!1 I 1 I+ + + + + + +
.2 . .2. .2. .2 .2 . 0 0 0 0

I2, 10 lB I | I I I I+ I + + 1$+1 + a+ I + +
Actual ... 2 27 44 60 96 114 103 85 98| 65 37i 14 3
Calculated 4j 20k 43 67 87 100j 105 IOOk -* 87 67 43 120* _ 4_

Difference -2| +6k +1 -7 +9 +14 -2 -15i +llj -2 -5j -6 -ii|

whence x2 = 13-94, P = -30.


39-2

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
308 Probable Error of a Correlation Coefficient

In this case the grouping has had less influence and the largest contributions
to x2 (in the second, sixth, eighth, and twelfth compartments) are due to
differences of opposite sign on opposite sides, and may therefore be supposed to be
entirely due to random sampling.

My equation then fits the two series of empirical results about as well as could
be expected. I will now show that it is in accordance with the two theoretical

cases n " large" and n= 2, for =- which approximates sufficiently closely to

Pearson and Filon's when r =0 and it is large. Also when n is large 2


5/n
becomies 3 and the distribution is normal.

And if n 2, the equation becomes y = yo (1 -)-i * where


N
Yo =1
2f (1-X2)-ldx

Put x = sin 6. Then dx,= cos Add,


1r

YO = /Jsec OdO = oo = ,
i.e. there is no frequency except where (1 - x2)-1 is infinite, all the frequency is
equally divided between x= 1 and x =-1 which we know to be actually the case.
n-4

Consequently I believe that the equation y = yo (1 - 2) 2 probab


theoretical distribution of r when samples of n are drawn from a normally distri-
buted population with no correlation. Even if it does not do so, I am sure that it
will give a close approximation to it.

Let us consider Mr Hooker's limit of *50 in the light of this equation. For

21 cases the equation becomes $ = sin 6 and the proportion of the area lying
y = yo cos1l 6
beyond x = + -50 will be
ir

cosI8 1dO

JOS c8' OdO


0

I find this to be *02099, or we may expect to find one case in 50 occurring


outside the limits + *50 when there is no correlation and the sample numbers 21.

* If a Pearson curve be fitted to the distribution whose moment coefficients are hu2=1=/A4 and
,L3=O we have j2 = 1, #I= hence the curve must be of Type II. and the equation is given by

Y=YO (l- X2)m where a2= 2A2#2= 1 and m= 5j#2 9 or y=yo (l - X2)-l,

agreeing with the gereral formula.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
BY STUDENT 309

When however there is correlation, I cannot suggest an equation which will


accord with the facts, but as I have spent a good deal of tinme over the problem
I will point out some of the necessities of the case.
(1) With small samples the value certainly lies nearer to zero than the real
value of 1, e.g.

samples of 2: mean at 2 sin-' R,

samples of 4 (real value 66) 561 * + '011,


samples of 8 (real value 66) '614t + *065.
But with samples of 30 (real value 66) mean at *6609 + 0067 shows that the mean
value approaches the real value comparatively rapidly.

(2) The standard deviation is larger than accords with the formula

even if we give the mean value of r for samples of the size taken, e.g. for
samples of 2,

S D
S.D. = 41 _ (2 i 1s2
- sin~-1 R)

For samples of 4, calculated+ '39.57 + -0069; actual '4680,


8 *2355 + *0041; actual *2684.

But samples of 30 calculated *1046 + *0018, actual 1001, again show that with
samples as large as 30 the ordinary formula is justified.

(3) When there was no correlation the range found by fitting a Pearson curve
to the distribution was accurately 2 in the theoretical case of samples of two, and
well within the probable error for emnpirical distributions of samples of 4 and 8.
But when we have correlation this process does not give the range closely for the
empirical distribution (samples of 4 give 2-137, samples of 8 2 699, samples of
30 infinity) and the range calculated from samples of 2, which is

2 1/4 + 3A2 + 1822 -9/23


3+A2,
(where A2 = 1 - ( sin, R)) is always less than 2 except in the

i.e. when there is no correlation.

Hence the distribution probably cannot be represented by any of Prof Pearson's


types of frequency curve unless R= 0.

(4) The distribution is skew with a tail towards zero.

* The value must be slightly larger than this (perhaps even by '03) as Sheppard's corrections were
not used.
t Again higher, but not by more than *02.

r where r is taken as the mean value for the size of the sample. If we took the real

value R, the difference would be even greater.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms
310 Probable Error of a Correlation Coetjicient
(5) To sum up:-If y = k (x, R, n) be the equation, it must satisfy the following
requirements. If R =1, 1 is the only value of x which gives the value of y other
than zero. If n = 2, ? 1 are the only values of x to do so. If R = 0 the equation
ut-4

probably reduces to y = yo (1 - x2) 2

Conctusions.

It has been shown that when there is no correlation between two normally

distributed variables y = yo (1 - *2) 2 gives fairly closely the distribution of r found


from samples of n.

Next, the general problem has been stated and three distributions of r have
been given which show the sort of variation which occurs. I hope they may serve
as illustrations for the successful solver of the problem.

This content downloaded from


203.10.91.90 on Tue, 18 Mar 2025 02:54:38 UTC
All use subject to https://about.jstor.org/terms

You might also like