Regression and Correlation
Regression and Correlation
Introduction
By this time our reader is confident in statistical data analysis, even if at a rudimentary
level. We now go a state farther and try to understand how to test the statistical significance of
data at the same time as testing the relationship between any two variables of our investigation. In
knowing how variables are related with one another, we will then look for methods of
understanding the strength of their relationship. In the following discussion we shall also learn
how to predict the value of one variable given the trend of the other variable. The two dual statistics
which assist us in all these tasks are called Regression and Correlation.
If any variable changes and influences another, the influencing variable is called an
independent variable. The variable being influenced is the dependent variable, because its size and
effects depend on the other independent variable. the independent variable is usually called an
exogenous variable, because its magnitude is decided outside by other factors which are out of
control of the investigator, and the dependent variable is an endogenous variable. Its value depends
on the vicissitudes of the experiment at hand; it depends on the model under the control of the
researcher.
In referring to the relationship between the dependent variable and the independent
variable, we always say that the dependent variable is a function of the independent variable. The
dependent variable is always denoted by the capital letter Y, and the independent variable by the
capital letter X. (Remember not to confuse these ones with lower case letters because the lower
case letters mean other things, as we shall see later.) Therefore, in symbolic terms we write :-
Y is a function of X Y = f(X)
Whenever Y changes with each change in X we say that there is a functional relationship between
Y and X.
Assumptions of the Linear Regression/Correlation model
1. There must be two populations, and each of these must contain members of one
variable at a time, varying from the smallest member to the largest member. One
population comprises the independent variable and the other the dependent
variable.
2. The observed values at each level or each value of the independent variable are one
selection out many which could have been observed and be obtained, We say that each
observation of the independent variable is Stochastic - meaning probabilistic and could occur
by chance. This fact does not affect the model very much because, after all the independent
variable is endogenous.
3. Since the independent variable is stochastic, the dependent variable is also stochastic.
This fact is of great interest to the observers and analysts, and forms the basis of all
analysis using these two statistics. the stochastic nature of the dependent variable lies within
the model or the experiment, because it is the subject matter of the investigations and the
analyses of researchers under any specific circumstances.
4. The relationship being investigated between two variables is assumed to be linear.
This assumption will be relaxed later on when we shall be dealing with non-linear
regression correlation in the succeeding chapters.
5. Each value of the dependent variable resulting from the influence of the independent
variable is random, one of the very many near-equal which could have resulted from the effect
of the same level (or value) of the independent variable.
6. Both populations are stochastic, and also normal. In that connection, they are
regarded as bi-variate normal.
The name Regression was invented by Sir Francis Galton (1822 - 1911), who, when
studying the natural build of men observed that the heights of fathers are related to those of their
sons. Taking the heights of the fathers as the independent variable , he observed that the heights
of their sons tend to follow the trends of the heights of the fathers. He observed that the heights of
the sons regressed about the heights of the fathers. Soon it came to mean that any dependent
variable regressed with the independent variable.
In this discussion we are interested in knowing how the values of Y regress with the values
of X. This is what we have called the functional relationship between the values of Y and those of
X. The explicit regression equation which we shall be studying is Y = a + b X.
In this equation, “ a ” marks the intercept, or the beginning of things, where the dependent
variable might have been found by the independent variable before investigation. This is not
exactly the case, but we state it this way for the purposes of understanding. The value “ b ” is called
the regression coefficient. When evaluated, b records the rate of change of the dependent variable
with the changing values of the independent variable. The nature of this rate of change is that when
the functional relationship is plotted on a graph, “ b ” is the magnitude of the slope of this
Regression Line.
Most of us have plotted graphs of variables in an attempt to investigate their relationship.
If the independent variable is positioned along the horizontal axis and the dependent variable along
the vertical axis, the stochastic nature of the dependent variable makes all the observations of the
same variable take a scatter on the graph. This scattering of the values of the independent variable
is the so-called scatter diagram or in short the Scattergram. Closely related variables show scatter
diagrams with points of variable interaction tending to regress in one direction, either positive or
negative. Unrelated points of interaction do not show any trend at all. See figure 7 - 1.
Since the scatter diagram is the plot of actual value of Y which have been observed to exist
for every value of X, the locus of the conditional means of Y can be approximated by eye through
the scatter diagram. In our earlier classes the teachers might have told us to observe the dots or
crosses on the scatter diagram and try to fit the curve by eye. However this is unsatisfactory, it is
not accurate enough. Nowadays there are accurate mathematical methods and computer packages
for plotting this line with great degrees of estimation accuracy and giving various values of
ensuring that the estimate is accurate. The statistical algorithm we are about to learn helps
understand these computations and assess their accuracy and efficacy.
Figure 7 - 1 : Scatter Diagrams can take any of these forms
In a scatter diagram, the least squares line lies exactly in the center of all the dots or the
crosses which may happen to be regressing in any specific direction. ( See Figure
7 - 1). The distances between this line and all the dots in the scattergram which are higher than the
scattergram balance those which are lower, and the line lies exactly in the middle. This is why it
is called the conditional mean of Y. The dots on the scatter diagram are the observed values of Y
and those along the line plot those values which for every dot position on the scattergram define
the mean value of Y defined by the corresponding values of X. the differences between the higher
points and the conditional mean line are called the positive deviations, and between the lower
points and the conditional mean line are called the negative deviations. Now let us use a simple
example to concretize what we have just said.
Example
Alexander ole Mbatian is a maize farmer in Maela area of Narok. He records the maize
yields in debes ( Tins, equivalents of English Bushels) per hectare for various amounts of a
certain type of fertilizer which he used in kilograms per hectare for each of the ten years from
1991 to 2000. The values on the table are plotted on the scatter diagram which appears as
Figure 7 - 2. It looks that the relationship between the number of debes produced and the amounts
of fertilizer applied on his farm is approximately linear, and the points look like they fall on a
straight line. Plot the scatter diagram with the amounts of fertilizer (in Kilograms per
hectare) as the independent variable, and the maize yield per hectare and the dependent
variable.
Solution
Now we need a step-by step method of computing the various coefficients which are
used in estimating the position of the regression line. For this estimate we first of all need to
estimate the slope of the regression line using this expression :-
b
X X Y Y
i i
X X
2
i
Year n X Y
1991 1 6 40
1991 2 10 44
1991 3 12 46
1991 4 14 48
1991 5 16 52
1991 6 18 58
1991 7 22 60
1991 8 24 68
1991 9 26 74
2000 10 32 80
Where :
Xi = Each observation of the variable X ; and in this case each value of the fertilizer in
kilograms used for every hectare.
X = The mean value of X.
Yi = Each observation of the variable Y ; and in this case each value of the maize yield in
Debes per hectare.
Y = The mean value of Y.
n
The Regression/ correlation statistic involves learning how to evaluate the b-coefficient using the
equation b
X X Y Y , and to compute various results which can be obtained from
i i
X X
2
i
this evaluation.
Figure 7 - 1: The Scatter Diagram of Maize produced with Fertilizer Used
The steps which we shall discuss involves the analysis of the various parts of this equation
and performing the instruction in the equation to obtain the value of the coefficient “ b ”. Once the
coefficient has been obtained the corresponding other coefficient in the regression equation “ a ”
X i
X xi and Yi
Y yi . the numerator for the b expression is therefore x
i yi ,
These are the values for the X, Y and XY in deviation form, and the summation sign is of
course an instruction to add all the values involved. Accordingly, the equation for the b-coefficient
is : b
xy i i
. Use Table 7 - 2 and fill in the values to calculate b.
x i
2
TABLE 7 - 2 : CALCULATIONS TO ESTIMATE THE REGRESSION EQUATION FOR THE MAIZE
PRODUCED (DEBES) WITH AMOUNTS OF FERTILIZER USED
YEAR
X i Y X i X Y
i Y
x i
2
xi yi
YIELD
FERTILIZER
x
(KG.) PER HA.
i
y i
(DEBES
1 6 40 - 12 - 17 144 204
2 10 44 -8 - 13 64 104
3 12 46 -6 - 11 36 66
4 14 48 -4 -9 16 36
5 16 52 -2 -5 4 10
6 18 58 0 1 0 0
7 22 60 4 3 16 12
8 24 68 6 11 36 66
9 26 74 8 17 64 136
10 32 80 14 23 196 322
Means
X,Y 18 57
Solution (Continued).
Using the values in Table 7 - 2 the solution for the b coefficient is sought in the
following manner :-
b
xyi i
956
1. 66 This is the slope of the regression line.
x i
2
576
Then the value of “a” which statisticians also call “ b0 ” ( because it is theoretically the
a Y bX 57 166
. 18 .
57 29 .88 27 .12 This is the Y-Intercept.
The estimated regression equation is therefore :-
Yi 27 .12 1. 66 X i
The meaning of this equation is that if we are given any value of fertilizer application “ X i ” by
Ole Mbatian, we can estimate for him how much maize he can expect (in debes per hectare) of
that level fertilizer application using this regression equation. Assume that he chooses to apply 18
kilograms of fertilizer per hectare. The maize yield he expects during a normal season (everything
else like rainfall, soil conditions, etc., and other climatic variables remaining constant) estimated
using the regression equation will be :-
The symbols for the calculated values of Y as opposed to the observed values of Y vary
from textbook to textbook. In your Textbook (King’oriah 2004) we use “ Yc ” or
Ycalculated. Here we are using “ Yi ” pronounced “ Y-hat ” Other books use “ Ye ” meaning
“Yestimated” , and so on; it does not make any difference.
estimated by extending Table 7 - 2 and using the estimated values of Y against the actual values
of Y to find each deviation of the actual value from the estimated value which is found on the
regression line.. All the deviations are then squared and the sum of squared-errors of all the
deviations are calculated. Once this is done, the variance of the constants and their standard errors
of estimate of any one of the two constants can be easily computed. Let us now carry out this
exercise to demonstrate this process. In Table 7 - 3, the estimated values of y are in the column
labeled Yi . The deviations of the observed values of Y from the estimated values of Y are on the
2
column labeled “e i ”. Their squared values are found in the column labeled ei . These, and a few
others in this table are the calculations required to calculate the standard error of estimating the
constants b 0 = a and b 1 = b.
The estimated values Yi are found in Table 7 - 3 on the third column from the right. These
have been computed using the equation Yi 27 .12 1. 66 X i . Simply substitute the
observed values of X i into the equation and solve the equation to find the estimated values of
Yi . Other deviation figures in this table will be used later in our analysis. The variance of the
intercept is found using the equation:
47.3056
s 2
e X
i
2
i
2
n k n x
b0 2 .
i
Other values can be found in Table 7 - 3. Let us now use the equation :-
s 2
e X
i
2
i
2
40.3056
3816
3. 92
n k n x 10 2 10576
b0 2
i
s 2
e i
2
40.3056
0 . 01
b1
n k xi 2
10 2576
Having found the variances of these constants their standard errors are obviously the square roots
of these figures.
Let us now test the number of standard errors that each of the two constants estimate the slope of
the regression line. This means we compute each of their t-values and compare it with a critical t
at 5% alpha level. If the t values resulting from these constants exceed the expected critical value
t then we conclude that each of them is significant. The calculated t value for these parameters is
:-
b0 b 0 27.12 0
t0 13. 7
Sb 1. 98
0
b1 b1 1. 66
t1 16 . 6
Sb 01.
1
Since both and t = 2.306, with 8 degrees of freedom at 5% level of significance we conclude
that both values of the intercept and the slope are significant at 5% level.
predicted values of Yi and the observed values Y, to the squared error between the actual observed
47.31
0 . 0290
y i
2
1634
R 2
1
e i
2
1
47.31
1 0 . 0290 0 . 9710
y i
2
1634
This is a very strong relationship. About 97.1% of the changes in the maize yield (in debes per
Hectare) within Mr. Mbatian’s farm is explained by the quantities of fertilizers per acre applied on
his farm. It also means that the regression equation which we have defined as
Yi 27 .12 1. 66 X i explains about 97.1% of the variation in output. The remaining 3% or
thereabouts, (approximately 2.9%), is explained by other environmental factors on his farm which
have not been captured in the model.
Figure 7 - 2 : The estimated regression line
In any analysis of this kind, the strength of the relationship between X and Y is measured
by means of the size of Coefficient of Determination. This coefficient varies between Zero value
of no relationship at all to 1.0000 value of a perfect relationship. The example we have on Mr.
Mbatian’s farm is that of a near perfect relationship, which actually shows the observed data
clinging very closely to the regression line that we have constructed; as shown in Figure 7 - 2.
The other value which is used very frequently for theoretical work in Statistics is the
Correlation Coefficient. This is sometimes called Pearson’s Product moment of Correlation,
after its discoverer, Prof. Karl Pearson (1857 - 1936). He also invented the Chi-Square Statistic
and many other analytical techniques while he was working at the Galton Laboratory of the
University of London. From the computation above you will guess that he is obviously the inventor
of all these measures which we have just considered.
The Correlation Coefficient is the square root of the Coefficient of determination.
We shall use both measures extensively in Biostatistics from now on. Let us now compute the
Correlation Coefficient.
r R2 1
e i
2
1 0 . 0290 0 . 9710
y i
2
r 0 . 9710 0 . 9854
The measure is useful in determining the nature of the slope of the regression line. A negative
relationship has a negatively sloping regression line and a negative Correlation Coefficient. A
positive relationship has a positive Correlation Coefficient. In our case the measure is positive.
This means that the more of this kind of fertilizer per hectare which is applied on Mr. Mbatian’s
farm the more maize yield in terms of debes per hectare that he realizes at the end of each season.
Computation Example
Example
The following data had been obtained for the time required by a drug quality control
department to inspect outgoing drug tablets for various percentages of those tablets found
defective.
Percent defective 17 9 12 7 8 10 14 18 19 6
Inspection Time in
minutes
48 50 43 36 45 49 55 63 55 36
(a) Find the estimated Regression line Yi a b X i
(b) Determine the sum of deviations about this line for each of the ten observations.
(c) Test the Null Hypothesis that change in inspection time has no significant effect on the
percentage rate of the drug tablets found defective using analysis of variance.
(d) Use any other test statistic to test the significance of the correlation coefficient.
Solution
1. The relevant data and preliminary computations are arranged in Table 7 - 4.
2. The following simple formulae assist in the solution of this kind of problems.
We already know that the deviations of X and Y from their means are defined in the
following manner :-
X i
X x Yi Y y
3. The shortcut computations will make use of these deviation formulas to compute various
figures which ultimately lead the definition of the regression equation Yi a b X i .
(a) To find the sum of squared deviations of all the observations of X from the mean
value X , and those of squared deviations of Y from Y , we use the following
shortcut expressions :-
X Y
2 2
x 2
X i
2
n
, y 2
Y i
2
n
(b) The deviations of the cross multiplication between X and Y are found using the
X Y .
expression xy XY
n
(c) If we were to remember these three equations we can then have at our disposal a very
powerful tool for the fast computation of the regression coefficients.
To find the slope coefficient “ b ” we use the results of the expressions in ( a ) and ( b
) above.
b
xy
x2
Then “ a ” coefficient can be found easily through the equation :-
a Y bX
r
xy
x y
2 2
X
2
480 2
x 2
Xi 2
n
23690
10
650
Y
2
120 2
y 2
Yi 2
n
1644
10
204
b
xy
289
0 . 445
x 2
650
a Y bX 12 0.445 48 9 . 36
5. The values of the Regression Equation are relevant in farther tabulation of the figures
which will assist us to derive farther tests. They are recorded on the thrd column from the left
of Table 7 - 5.
Using the data in Table 7 - 5, you can see how fast we have been able to compute the
important measures which take a lot of time to compute under ordinary circumstances. Obviously,
it is faster by computer, since there are proprietary packages which are designed for this kind of
work. However, for learning and examination purposes, this method has a lot of appeal. On is able
to move quickly, and learn quickly at the same time.
6. It is now very easy to compute the standard error of estimate, which helps us to build a
probability model along the regression equation so that we can see how our regression line fits as
an estimator of our actual field situation. What we need now is to find a method of calculating the
standard error of estimate. Given all the data in Table 7 - 5 we can use the expression below to
find the standard error of estimate of the regression equation..
75.5243
n 2 10 2
75. 5243
9.4405375 3.073
8
If we state the hypothesis that there is no significant difference between the observed values and
the calculated values of Y for each value of X we can build a two tail t distribution model centered
on the regression line. This is done by choosing the Confidence level C = 0.95, and a 0.05
alpha level. Since we need the upper and lower tails on both sides of the regression equation, we
divide the alpha level by two to obtain 0.025 on either side.
For the 0.05 alpha level, we obtain the appropriate value from the usual t-tables on page
498 of your textbook (King’oriah, 2004). We find that our t-probability model on both sides of our
regression equation is built by the following critical value of t , using the two-tail model.
Remember this table in your Textbook has two alternatives, the two-tail and the single tail
alternative. We use the columns indicated by the second row of the table for the two-tail model
and find that :-
We now have a probability model which states that for any observed value of Y to belong to the
population of all those values estimated by the regression line, it should not lie more than 2.306
standard errors of estimate on either side of the regression line.
The t-value from the table can be useful if it is possible to compute the t-position for every
observation. This is done by asking ourselves how many standard errors of estimate each
observation lies away from the regression line. We therefore need a formula for computing the
actual number standard errors for each observation. The observed values and the calculated values
are available on Table 7 - 5.
Di
The expression for computing the individual t-values for each observation is ti .
Y
Di Yi Y
ti . Using this expression all the observations can be located and their
Y Y
distance on either side of the regression line can be calculated. This is done at Table 7 - 7.
From the t-values on the right-most column of Table 7 - 7, we find that there is not a single
observation which lies more that the expected value of :-
This tells us that there is a very good relationship between X and Y. This relationship as
outlined by the regression equation did not come by chance. The regression equation is a very
good predictor of actually what is happening in the field. This also means that whatever parameters
of the regression line which we have computed they represent the actual situation regarding the
changes in Y which are caused by the changes in X. We can confidently say at 95% confidence
level that actual percentage of defective tablets which is found the production line depends on the
inspection time in minutes. We may want to instruct our quality control staff to be more vigilant
with the inspection so that our drug product may have a few defective tablets as humanly and
technically possible.
(i) H :
0 Y1 Y 2 ..... Y n No change is recorded in variable Y as a result of the
4. ( a ) The total error in F-Tests comprises the explained variation and the unexplained
variation. This comprises the sum of squared differences between every observed value of the
dependent variable and the mean of the whole string of observations of the dependent
variable.
In our example SS is the sum of squared differences which are recorded in Table 7 - 7 as
D 2
= 75.5143. The percentage of this explained error, or the probability out of 1.0 can be
SS T r 2 D 2
The value of r 2 is easily obtainable from some calculation using the values at page
x y
2
289 2
r2 0 . 630
650 204
SS D 2
= 75.5143
SS T r 2 D 2
0. 630 75.5143 47 .574009
SS T 47 .574009
5. Degrees of Freedom
Total degrees of freedom: caused by the total variation of Y caused by all the
environment.
SS d f. = n - 1 = 10 - 1 = 9 d f.
SS d f. = 9 d f.
Unexplained degrees of freedom are lost due to the investigation of parameters in two
dimensions. Here we lose two degrees of freedom :-
SSE d f. = n - 2 = 10 - 2 = 8 d f.
SSE d f. = 8 d f.
Study the summary table carefully. You will find that the computations included in its matrix are
what is actually the systematic steps in computing the F-statistic.
The critical value of F is obtained from page 490 at 5% significance level and [1, n - 2] degrees
of freedom.
F0.05, 1, n 2 F0.05, 1, 8 5. 3172
7. Compare this F0.05, 1, 8 5. 3172 to the calculated F-value. Fc 13.586 . You
will find that we are justified to reject the null hypothesis that the changes is the values
of X do not have any effect on the changes of the values of Y at 5% significance level. In
our example this means the more watchful the quality control staff are, the more they can
detect the defective tablets at 5% significance level.
Activity
Do all the necessary peripheral reading on this subject and attempt as many
examples in your Textbook as possible. Try to offer interpretations of the calculation
results like we have in this chapter. It is only with constant exercise that one can master
these techniques properly; and be able to apply them with confidence and be able to interpret
most data which comes from proprietary computer packages.
EXERCISES
1. For a long time scholars have postulated that the predominance of agricultural labor
force in any country is an indication of the dependence of that country on primary modes of
production, and as a consequence, the per-capita income of each country that has a preponderance
of agricultural labor force should be low. A low level of agricultural labor force in any country
would then indicate high per- capita income levels for that country.
Using the data given below and the linear regression/correlation model determine what
percentage of per-capita income is determined by agricultural labor force.
Agricultural
labor force
9 10 8 7 10 4 5 5 6 8 7 4 9 5 8
(Millions)
Per capita
income
(US. $ 00) 6 8 8 7 7 12 9 8 9 10 10 11 9 10 11
2. An urban sociologist practicing within Nairobi has done a middle- income survey,
comprising a sample of 15 households, with a view to determining whether the level of education
of any middle-income head of household within this City determines the annual income of
their families. The following is the result of his findings :-
Education
level
7 12 8 12 14 9 18 14 8 12 17 10 16 10 13
Annual
income K.
18 32 28 24 22 32 36 26 26 28 28 32 30 20 18
Shs100,000)
(i) Compute your bi-variate correlation coefficient and the coefficient of determination.
( ii ) Use the shortcut formulae for Regression/correlation analysis to compute the
equation of the regression line of income (Y) on education level ( X ).
( iii ) By what means can the sociologist confirm that there is indeed a relationship?
(Here you should describe any one of the methods of testing the significance of your
statistic.)
Couple 1 2 3 4 5 6 7 8 9 10 11 12
Y 4 3 0 2 4 3 0 4 3 1 3 1
X 3 3 0 2 2 3 0 3 2 1 3 2