Understanding Correlation Basics
Understanding Correlation Basics
We can find many situations where two variables x and y are related in such a way that the changes
in one variable has influence on the changes in the other variable. Such a relationship is called
Correlation (or covariation). Correlation is used to describe the linear relationship with two
variables x and y.
If x and y increase or decrease together, the correlation is said to be positive. If the increase
(or decrease) in x corresponds to the decrease (or increase) in y, the correlation is said to be
negative. If there is no relationship indicated between the variables x and y, they are said to be
independent or uncorrelated.
Let x1 , x2 , x3 , xn be n values of variable x and y1 , y2 , y3 , yn be the corresponding values
of variable y. Then, the measure of correlation (called coefficient of correlation) is defined by the
relation
r
x x y y
i i
---(i)
n x y
where x is mean of x-series, y is mean of y-series, and x is standard deviation of x-series are
y is standard deviation of y-series.
Let xi x X i , deviation from the mean x and yi y Yi , deviation from the mean
y then we can write (i) as
r
X Y ---(ii)
i i
or r
xy
---(iii)
n x y x y
1 1
where xy
n
xi x yi y X iYi is called the covariance of x and y.
n
Note:
1 1
(1) x
n
xi , y yi
n
1 1 1 1
x2 xi x X i2 and y2 yi y Yi 2 are called variance of x
2 2
(2)
n n n n
and y respectively.
(3) Coefficient of correlation is also known as Karl Pearson’s coefficient of correlation.
(4) 1 r 1
(5) If r 1 , we say that x and y are perfectly correlated
(6) If r 0 , we say that x and y are non-correlated (or independent).
Problems
1. The following table gives the age (in years) of 10 couples. Calculate the covariance and the
coefficient of correlation between these ages.
age of husband (x): 23 27 28 29 30 31 33 35 36 39
age of wife (y): 18 22 23 24 25 26 28 29 30 32
1 1 1 1
Solution: n 10 x
n
xi 311 31.1
10
y
n
yi 257 25.7
10
xi yi X i xi x Yi yi y X i2 Yi 2 X iYi
23 18 -8.1 -7.7 65.61 59.29 62.37
27 22 -4.1 -3.7 16.81 13.69 15.17
28 23 -3.1 -2.7 9.61 7.29 8.37
29 24 -2.1 -1.7 4.41 2.89 3.57
30 25 -1.1 -0.7 1.21 0.49 0.77
31 26 -0.1 0.3 0.01 0.09 -0.03
33 28 1.9 2.3 3.61 5.29 4.37
35 29 3.9 3.3 15.21 10.89 12.87
36 30 4.9 4.3 24.01 18.49 21.07
39 32 7.9 6.3 62.41 39.69 49.77
202.9 158.1 178.3
1 1
x2
n
X i2 202.9 20.29
10
x 4.5044
1 1
y2
n
Yi 2 158.1 15.81
10
y 3.9762
1 1
Covariance of x and y is xy
n
X iYi 178.3 17.83
10
xy 17.83
The coefficient of variation between x and y is r 0.9955
x y 4.5044 3.9762
So, the correlation is almost perfect. i.e., in the given data, the ages of husbands and wives are
almost perfectly correlated.
2. Find the coefficient of correlation between industrial production and export (both in crores tons)
using the following data:
production (x): 55 56 58 59 60 60 62
export (y): 35 38 38 39 44 43 45
1 1 1 1
Solution: n 7 x
n
xi 410 58.5714
7
y
n
yi 282 40.2857
7
xi yi X i xi x Yi yi y X i2 Yi 2 X iYi
55 35 -3.5714 -5.2857 12.7549 27.9386 18.8773
56 38 -2.5714 -2.2857 6.6121 5.2244 5.8774
58 38 -0.5714 -2.2857 0.3265 5.2244 1.306
59 39 0.4286 -1.2857 0.1837 1.653 -0.5511
60 44 1.4286 3.7143 2.0409 13.796 5.3062
60 43 1.4286 2.7143 2.0409 7.3674 3.8776
62 45 3.4286 4.7143 11.7553 22.2246 16.1634
35.7143 83.4284 50.8068
1 1
x2
n
X i2 35.7143 5.102
7
x 2.2588
1 1
y2
n
Yi 2 83.4284 11.9183
7
y 3.4523
1 1
Covariance of x and y is xy
n
X iYi 50.8568 7.2653
7
xy 7.2653
The coefficient of variation between x and y is r 0.9317
x y 2.2588 3.4523
3. Psychological tests of intelligence and computational ability were applied to ten children.
Following is the record showing intelligence ratio (IR) and ability ratio (AR). Calculate the
coefficient of correlation.
IR (x): 105 104 102 101 100 99 98 96 95 94
AR (y): 101 103 100 98 95 96 104 97 97 96
1 1 1 1
Solution: n 10 x
n
xi 994 99.4
10
y
n
yi 987 98.7
10
xi yi X i xi x Yi yi y X i2 Yi 2 X iYi
105 101 5.6 2.3 31.36 5.29 12.88
104 103 4.6 4.3 21.16 18.49 19.78
102 100 2.6 1.3 6.76 1.69 3.38
101 98 1.6 -0.7 2.56 0.49 -1.12
100 95 0.6 -3.7 0.36 13.69 -2.22
99 96 -0.4 -2.7 0.16 7.29 1.08
98 104 -1.4 5.3 1.96 28.09 -7.42
96 97 -3.4 -1.7 11.56 2.89 5.78
95 97 -4.4 -1.7 19.36 2.89 7.48
94 96 -5.4 -2.7 29.16 7.29 14.58
124.4 88.1 54.2
1 1
x2
n
X i2 124.4 12.44
10
x 3.527
1 1
y2
n
Yi 2 88.1 8.81
10
y 2.9682
1 1
Covariance of x and y is xy
n
X iYi 54.2 5.42
10
xy 5.42
The coefficient of variation between x and y is r 0.5177
x y 3.527 2.9682
1 1 1 1
Solution: n 6 x
n
xi 120 20
6
y
n
yi 126 21
6
xi yi X i xi x Yi yi y X i2 Yi 2 X iYi
10 18 -10 -3 100 9 30
14 12 -6 -9 36 81 54
18 24 -2 3 4 9 -6
22 6 2 -15 4 225 -30
26 30 6 9 36 81 54
30 36 10 15 100 225 150
280 630 252
1 1
x2
n
X i2 280 46.6667
6
x 6.8313
1 1
y2
n
Yi 2 630 105
6
y 10.247
1 1
Covariance of x and y is xy
n
X iYi 252 42
6
xy 42
The coefficient of variation between x and y is r 0.6
x y 6.8313 10.247
Alternate formula to compute coefficient of correlation (r)
If z ax by and r is the coefficient of correlation between x and y, show that
z2 a 2 x2 b2 y2
a b 2abr x y or r
2 2 2 2 2
.
2ab x y
z x y
Consider zi z a xi x b yi y
zi z a 2 xi x b2 yi y 2ab xi x yi y
2 2 2
z z a 2 xi x b2 yi y 2ab xi x yi y
2 2 2
i
n z2 a 2 n x2 b2 n y2 2abrn x y
z2 a 2 x2 b2 y2
a b 2abr x y
2 2 2 2 2
or r .
2ab x y
z x y
Note:
1. ax by a x b y 2abr x y
2 2 2 2 2
4. Above formula is useful to compute r , x & y when ax by is given (like following problems)
Problems
1. If the variables x and y are such that (i) x y has variance 15, (ii) x y has variance 11 and (iii)
2x y has variance 29, find x , y and the coefficient of correlation between x and y.
x2 y2 11 4 9 11 1
From (ii), we get r 0.1667
2 x y 2 23 6
2. The standard deviation of x and y are 2 and 3 respectively. If the coefficient of correlation
between x and y is 0.4, find the standard deviations of x y and x y .
Rank Correlation
A group of n individuals may be arranged in order to merit with respect to some
characteristic. The same group would give different orders for different characteristics.
Considering the orders corresponding to two characteristics A and B, the correlation between these
n pairs of ranks is called the rank correlation in the characteristics A and B for that group of
individuals.
Let xi , yi be the ranks of the ith individuals in A and B respectively. Assuming that no two
individuals are bracketed equal in either case, each of the variables taking the values 1, 2, 3,…, n,
1 2 3 n n n 1 n 1
we have x y
n 2n 2
If X , Y be the deviations of x, y from their means, then
X x x x x
2 xi x xi2 n x 2 x xi
2 2 2 2 2
i i i
n n 1 n n 1 2n 1 n n 1 n n 1
2 2
n 1
n 2
2 n n 1
4 2 6 4 2
n n 1 2n 1 n n 1 n n 1
2 2
= n3 n
1
6 4 2 12
Similarly, Y i
2
12
n n
1 3
X iYi
1
2
X i2 Yi 2 di2
1 1 1
X iYi n3 n n3 n di2 n3 n di2
1 1 1
2 12 12 2 12 2
x2
1
n
X i2
1
12n
n3 n and y2 Yi 2
1
n
1
12n
n3 n
Hence the coefficient of correlation between these variables is
XY 12
1 3
n n di2
1
2 12
1 3
n n di2
1
2 6 di2
r i i
1 3
n x y
n
1 3
n n
1 3
n n
n n
12n 12
This is called the rank correlation coefficient and is denote by .
Problems
1. Ten competitors in a beauty contest are ranked by two judges A and B in the following order:
ID No. of competitors 1 2 3 4 5 6 7 8 9 10
Judge A 1 6 5 10 3 2 4 9 7 8
Judge B 6 4 9 8 1 2 3 10 5 7
6 di2 6 14
1 1 0.8333 .
n 3
n 504
3. Ten competitors in a beauty contest are ranked by three judges A, B and C in the following order:
ID No. of competitors 1 2 3 4 5 6 7 8 9 10
Judge A 1 6 5 10 3 2 4 9 7 8
Judge B 3 5 8 4 7 10 2 1 6 9
Judge C 6 4 9 8 1 2 3 10 5 7
6 d12 6 200
x, y 1 1 0.2121
n 3
n 990
6 d 22 6 214
y, z 1 1 0.29697
n 3
n 990
6 d32 6 60
z, x 1 1 0.636364
n 3
n 990
Since z, x is maximum, the pair of judges A and C have the nearest common approach.