Unit 14
Unit 14
Objectives
After completion of this unit, you should be able to :
• understand the meaning of correlation
• compute the correlation coefficient between two variables from sample
observations
• test for the significance of the correlation coefficient
• identify confidence limits for the population correlation coefficient from
the observed sample correlation coefficient
• compute the rank correlation coefficient when rankings rather than actual
values for variables are known
• appreciate some practical applications of correlation
• become aware of the concept of auto-correlation and its application in
time series analysis.
Structure
14.1 Introduction
14.2 The Correlation Coefficient
14.3 Testing for the Significance of the Correlation Coefficient
14.4 Rank Correlation
14.5 Practical Applications of Correlation
14.6 Auto-correlation and Time Series Analysis
14.7 Summary
14.8 Self-assessment Exercises
14.9 Key Words
14.10 Further Readings
14.1 INTRODUCTION
We often encounter situations where data appears as pairs of figures relating
to two variables. A correlation problem considers the joint variation of two
measurements neither of which is restricted by the experimenter. The
regression problem, which is treated in Unit 15, considers the frequency
distributions of one variable (called the dependent variable) when another
(independent variable) is held fixed at each of several levels.
Examples of correlation problems are found in the study of the relationship
between IQ and aggregate percentage marks obtained by a person in SSC
examination, blood pressure and metabolism or the relation between height
268
and weight of individuals. In these examples both variables are observed as Correlation
1988 50 700
1987 50 650
1986 50 600
1985 40 500
1984 30 450
1983 20 400
1982 20 300
1981 15 250
1980 10 210
1979 5 200
269
Forecasting Figure I: Scatter Diagram
Methods
The scatter diagram may exhibit different kinds of patterns. Some typical
patterns indicating different correlations between two variables are shown in
Figure II.
What we shall study next is a precise and quantitative measure of the degree
of association between two variables and the correlation coefficient.
270
Figure II: Different Types of Association Between Variables Correlation
where
� = � − �� = deviation of a particular X value from the mean ��
� = � − �� = deviation of a particular Y value from the mean ��
Equation (14.2) can be derived from equation (14.1) by substituting for ��
and �� as follows:
� �
σ� = �� Σ(X − X̄)� and �� = �� Σ(X − Ȳ)� (14.3)
Activity A
Suggest five pairs of variables which you expect to be positively correlated.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
Activity B
Suggest five pairs of variables which you expect to be negatively correlated.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………… 271
Forecasting A Sample Calculation: Taking as an illustration t he data of advertisement
Methods
expenditure (X) and Sales (Y) of a company for the 10-year period shown in
Table 1, we proceed to determine the correlation coefficient between these
variables:
Computations are conveniently carried out as shown in Table 2.
Table 2: Calculation of Correlation Coefficient
Sl.No X Y � = x − x� � =�−� x� �� xy
.
290
�� =
= 29
10
4260
�=
Y = 426
10
�� 28310
∴�= = = 0.976
√Σ� � �Σ� � √2740 × 306840
This value of r (= 0.976) indicates a high degree of association between the
variables X and Y. For this particular problem, it indicates that an increase in
advertisement expenditure is likely to yield higher sales.
You may have noticed that in carrying out calculations for the correlation
coefficient in Table 2, large values for � � and � � resulted in a great
computational burden. Simplification in computations can be adopted by
calculating the deviations of the observations from an assumed average rather
than the, actual average, and also scaling these deviations conveniently. To
illustrate this short cut procedure, let us compute the correlation coefficient
for the same data. We shall take U to be the deviation of X values from the
assumed mean of 30 divided by 5. Similarly, V represents the deviation of Y
values from the assumed mean of 400 divided by 10.
The computations are shown in Table 3.
272
Table 3: Short cut Procedure for Calculation of Correlation Coefficient Correlation
S.No X Y U V UV �� ��
1. 50 700 4 30 120 16 900
2. 50 650 4 25 100 16 625
3 50 600 4 20 80 16 400
4. 40 500 2 10 20 4 100
5. 30 450 0 5 0 0 25
6. 20 400 -2 0 0 4 0
7 20 300 -2 -10 20 4 100
8. 15 250 -3 -15 45 9 225
9. 10 210 -4 -19 76 16 361
10. 5 200 -5 -20 100 25 400
Total -2 26 561 110 3,13
����
Σ�� − �
�=
(∑�)� (��)�
�Σ� � − �Σ� � −
� �
(��)(��)
561 − ��
�=
(��)� (��)�
�110 − �3136 −
�� ��
566.2
=
10.47 × 55.39
= 0.976
We thus obtain the same result as before.
Activity C
Use the short cut procedure to obtain the value of correlation coefficient in
the above example using scaling factor 10 and 100 for X and Y respectively.
(That is, the deviation from the assumed mean is to be divided by 10 for X
values and by 100 for Y values.)
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
Once r has been calculated, the chart can be used to determine the upper and
lower values of the interval for the sample size used. In this chart the range of
unknown values of p is shown in the vertical scale; while the sample r values
are shown on the horizontal axis, with a number of curves for selected sample
sizes. Notice that for every sample size there are two curves. To read the 95%
confidence limits for an observed sample correlation coefficient of 0.8 for a
sample of size 10, we simply look along the horizontal line for a value of 0.8
(the sample correlation coefficient) and construct a vertical line from there till
it intersects the first curve for n =10. This happens for p = 0.2. This is the
lower limit of the confidence interval. Extending the vertical line upwards, it
again intersects the second n =10 line at p = 0.92, which represents the upper
274
confidence limit. Thus the 95% confidence interval for the population Correlation
Rank for 1 2 3 4 5 6 7 8 9 10
variable X
Rank for 3 1 4 2 6 9 8 10 5 7
variable Y
276
Correlation
�−2
� = �� �
1 − ���
10 − 2
= 0.697�
1 − (0.697)�
=2.75
Referring to the table of the t-distribution for n-2 = 8 degrees of freedom, the
critical value for t at a 5% level of significance is 2.306. Since the calculated
value of t is higher than the table value, we reject the null hypothesis
concluding that the performances in Mathematics and Physics are closely
associated.
When two or more items have the same rank, a correction has to be applied to
∑d�� . For example, if the ranks of X are 1, 2, 3, 3, 5, ... showing that there are
�
two items with the same 3rd rank, then instead of writing 3, we write 3 � for
each so that the sum of these items is 7 and the mean of the ranks is
unaffected. But in such cases the standard deviation is affected, and therefore,
a correction is required. For this, ∑d�� is increased by (� � − �)/12 for each
tie, where t is the number of items in each tie.
Activity D
Suppose the ranks in Table 4 were tied as follows: Individuals 3 and 4 both
ranked 3rd in Maths and individuals 6, 7 and 8 ranked 8th in Physics.
Assuming that other rankings remain unaltered, compute the value of
Spearman's rank correlation.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
277
Forecasting Correlation analysis is used as a starting point for selecting useful
Methods independent variables for regression analysis. For instance a construction
company could identify factors like
• population
• construction employment
• building permits issued last year which it feels would affect its sales for
the current year.
These and other factors that may be identified could be checked for mutual
correlation by computing the correlation coefficient of each pair of variables
from the given historical data (this kind of analysis is easily done by using an
appropriate routine on a computer). Only variables having a high correlation
with the yearly sales could be singled out for inclusion in a regression model.
Correlation is also used in factor analysis wherein attempts are made to
resolve a large set of measured variables in terms of relatively few new
Categories, known as factors. The results could be useful in the following
three ways :
1) to reveal the underlying or latent factors that determine the relationship
between the observed data,
2) to make evident relationships between data that had been obscured before
such analysis, and
3) to provide a classification scheme when data scored on various rating
scales have to be grouped together.
Another major application of correlation is in forecasting with the help of
time series models. In using past data (which is often a time series of the
variable of interest available at equal time intervals) one has to identify the
trend, seasonality and random pattern in the data before an appropriate
forecasting model can be built. The notion of auto-correlation and plots of
auto-correlation for various time lags help one to identify the nature of the
underlying process. Details of time series analysis are discussed in Unit 20.
However, some fundamental concepts of auto-correlation and its use for time
series analysis-are outlined below.
One could construct from one variable another time-lagged variable which is
twelve periods removed. If the data consists of monthly figures, a twelve-
month time lag will show how values of 'the same month but of different
years correlate with each other. If the auto-correlation coefficient is positive,
it implies that there is a seasonal pattern of twelve months duration. On the
other hand, a near zero auto-correlation indicates the absence of a seasonal
pattern. Similarly, if there is a trend in the data, values next to each other will
relate, in the sense that if one increases, the other too will tend to increase in
order to maintain the trend. Finally, in case of completely random data, all
auto-correlations will tend to zero (or not significantly different from zero).
The formula for the auto correlation coefficient at time lag k is:
∑��� � �
��� (X � − X)(X ��� − X)
r� =
∑���� (X� − �
X)�
where
�� denotes the auto-correlation coefficient for time lag k k denotes the length
of the time lag n is the number of observations
X, is the value of the variable at time t and
X is the mean of all the data
Using the data of Figure IV the calculations can be illustrated.
13 + 8 + 15 + ⋯ + 12 100
�� = = = 10
10 10
(13 − 10)(8 − 10) + (8 − 10)(15 − 10) + ⋯ + (14 − 10)(12 − 10)
�� =
(13 − 10)� + (8 − 10)� + ⋯ + (14 − 10)� + (12 − 10)�
−27
= = −0.188
144
279
Forecasting For k = 2, the calculation is as follows :
Methods
∑��
��� (X � − 10)(X ��� − 10)
r� =
∑��
��� (X � − 10)
�
14.7 SUMMARY
In this unit the concept of correlation or the association between two
variables has been discussed. A scatter plot of the variables may suggest that
the two variables are related but the value of the Pearson correlation
coefficient r quantifies this association. The correlation coefficient r may
assume values between -1 and 1. The sign indicates whether the association
is direct (+ve) or inverse (-ve). A numerical value of r equal to unity indicates
perfect association while a value of zero indicates no association.
Tests for significance of the correlation coefficient have been described.
Spearman's rank correlation for data with ranks is outlined. Applications of
correlation in identifying relevant variables for regression, factor analysis and
in forecasting using time series have been highlighted. Finally the concept of
auto-correlation is defined and illustrated for use in time series analysis.
c) What are the 95% confidence limits for the population correlation
coefficient?
d) Test the significance of the correlation coefficient using a t-test at a
significance level of 5%.
3) The following data pertains to length of service (in years) and. the annual
income for a sample of ten employees of an industry:
Length of service in years (X) Annual income in thousand
rupees (Y)
6 14
8 17
9 15
10 18
11 16
12 22
14 26
16 25
18 30
20 34
Compute the correlation coefficient between X and Y and test its
significance at levels of 0.01 and 0.05.
4) Twelve salesmen are ranked for efficiency and the length of service as
below:
Salesman Efficiency (X) Length of Service (Y)
A 1 2
B 2 1
C 3 5
D 5 3
E 5 9
F 5 7
G 7 7
H 8 6
I 9 4
J 10 11
K 11 10
L 12 11
�
for n data points.
Scatter Diagram: An ungrouped plot of two variables, on the X and Y axes.
Time Lag: The length between two time periods, generally used in time
series where one may test, for instance, how values of periods 1, 2; 3, 4
correlate with values of periods 4, 5, 6, 7 (time lag 3 periods).
Time-Series: Set of observations at equal time intervals which may form the
basis of future forecasting.
283