0% found this document useful (0 votes)
12 views28 pages

Regression and Correlation

The document discusses linear regression and correlation, explaining the relationship between independent and dependent variables, denoted as X and Y respectively. It outlines the assumptions of the linear regression model, presents the regression equation Y = a + bX, and describes the process of estimating the regression coefficients. Additionally, it provides a practical example involving maize yields and fertilizer application, illustrating how to compute the regression line and test the significance of the coefficients.

Uploaded by

isaac rubia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views28 pages

Regression and Correlation

The document discusses linear regression and correlation, explaining the relationship between independent and dependent variables, denoted as X and Y respectively. It outlines the assumptions of the linear regression model, presents the regression equation Y = a + bX, and describes the process of estimating the regression coefficients. Additionally, it provides a practical example involving maize yields and fertilizer application, illustrating how to compute the regression line and test the significance of the coefficients.

Uploaded by

isaac rubia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

LINEAR REGRESSION AND CORRELATION

Introduction
By this time our reader is confident in statistical data analysis, even if at a rudimentary
level. We now go a state farther and try to understand how to test the statistical significance of
data at the same time as testing the relationship between any two variables of our investigation. In
knowing how variables are related with one another, we will then look for methods of
understanding the strength of their relationship. In the following discussion we shall also learn
how to predict the value of one variable given the trend of the other variable. The two dual statistics
which assist us in all these tasks are called Regression and Correlation.
If any variable changes and influences another, the influencing variable is called an
independent variable. The variable being influenced is the dependent variable, because its size and
effects depend on the other independent variable. the independent variable is usually called an
exogenous variable, because its magnitude is decided outside by other factors which are out of
control of the investigator, and the dependent variable is an endogenous variable. Its value depends
on the vicissitudes of the experiment at hand; it depends on the model under the control of the
researcher.
In referring to the relationship between the dependent variable and the independent
variable, we always say that the dependent variable is a function of the independent variable. The
dependent variable is always denoted by the capital letter Y, and the independent variable by the
capital letter X. (Remember not to confuse these ones with lower case letters because the lower
case letters mean other things, as we shall see later.) Therefore, in symbolic terms we write :-

Y is a function of X Y = f(X)

Meaning, The values of Y depend on the values of X.

Whenever Y changes with each change in X we say that there is a functional relationship between
Y and X.
Assumptions of the Linear Regression/Correlation model
1. There must be two populations, and each of these must contain members of one
variable at a time, varying from the smallest member to the largest member. One
population comprises the independent variable and the other the dependent
variable.
2. The observed values at each level or each value of the independent variable are one
selection out many which could have been observed and be obtained, We say that each
observation of the independent variable is Stochastic - meaning probabilistic and could occur
by chance. This fact does not affect the model very much because, after all the independent
variable is endogenous.
3. Since the independent variable is stochastic, the dependent variable is also stochastic.
This fact is of great interest to the observers and analysts, and forms the basis of all
analysis using these two statistics. the stochastic nature of the dependent variable lies within
the model or the experiment, because it is the subject matter of the investigations and the
analyses of researchers under any specific circumstances.
4. The relationship being investigated between two variables is assumed to be linear.
This assumption will be relaxed later on when we shall be dealing with non-linear
regression correlation in the succeeding chapters.
5. Each value of the dependent variable resulting from the influence of the independent
variable is random, one of the very many near-equal which could have resulted from the effect
of the same level (or value) of the independent variable.
6. Both populations are stochastic, and also normal. In that connection, they are
regarded as bi-variate normal.

Regression Equation of the Linear Form

The name Regression was invented by Sir Francis Galton (1822 - 1911), who, when
studying the natural build of men observed that the heights of fathers are related to those of their
sons. Taking the heights of the fathers as the independent variable , he observed that the heights
of their sons tend to follow the trends of the heights of the fathers. He observed that the heights of
the sons regressed about the heights of the fathers. Soon it came to mean that any dependent
variable regressed with the independent variable.
In this discussion we are interested in knowing how the values of Y regress with the values
of X. This is what we have called the functional relationship between the values of Y and those of
X. The explicit regression equation which we shall be studying is Y = a + b X.
In this equation, “ a ” marks the intercept, or the beginning of things, where the dependent
variable might have been found by the independent variable before investigation. This is not
exactly the case, but we state it this way for the purposes of understanding. The value “ b ” is called
the regression coefficient. When evaluated, b records the rate of change of the dependent variable
with the changing values of the independent variable. The nature of this rate of change is that when
the functional relationship is plotted on a graph, “ b ” is the magnitude of the slope of this
Regression Line.
Most of us have plotted graphs of variables in an attempt to investigate their relationship.
If the independent variable is positioned along the horizontal axis and the dependent variable along
the vertical axis, the stochastic nature of the dependent variable makes all the observations of the
same variable take a scatter on the graph. This scattering of the values of the independent variable
is the so-called scatter diagram or in short the Scattergram. Closely related variables show scatter
diagrams with points of variable interaction tending to regress in one direction, either positive or
negative. Unrelated points of interaction do not show any trend at all. See figure 7 - 1.

The linear Least squares Line

Since the scatter diagram is the plot of actual value of Y which have been observed to exist
for every value of X, the locus of the conditional means of Y can be approximated by eye through
the scatter diagram. In our earlier classes the teachers might have told us to observe the dots or
crosses on the scatter diagram and try to fit the curve by eye. However this is unsatisfactory, it is
not accurate enough. Nowadays there are accurate mathematical methods and computer packages
for plotting this line with great degrees of estimation accuracy and giving various values of
ensuring that the estimate is accurate. The statistical algorithm we are about to learn helps
understand these computations and assess their accuracy and efficacy.
Figure 7 - 1 : Scatter Diagrams can take any of these forms

In a scatter diagram, the least squares line lies exactly in the center of all the dots or the
crosses which may happen to be regressing in any specific direction. ( See Figure
7 - 1). The distances between this line and all the dots in the scattergram which are higher than the
scattergram balance those which are lower, and the line lies exactly in the middle. This is why it
is called the conditional mean of Y. The dots on the scatter diagram are the observed values of Y
and those along the line plot those values which for every dot position on the scattergram define
the mean value of Y defined by the corresponding values of X. the differences between the higher
points and the conditional mean line are called the positive deviations, and between the lower
points and the conditional mean line are called the negative deviations. Now let us use a simple
example to concretize what we have just said.

Example
Alexander ole Mbatian is a maize farmer in Maela area of Narok. He records the maize
yields in debes ( Tins, equivalents of English Bushels) per hectare for various amounts of a
certain type of fertilizer which he used in kilograms per hectare for each of the ten years from
1991 to 2000. The values on the table are plotted on the scatter diagram which appears as
Figure 7 - 2. It looks that the relationship between the number of debes produced and the amounts
of fertilizer applied on his farm is approximately linear, and the points look like they fall on a
straight line. Plot the scatter diagram with the amounts of fertilizer (in Kilograms per
hectare) as the independent variable, and the maize yield per hectare and the dependent
variable.

Solution
Now we need a step-by step method of computing the various coefficients which are
used in estimating the position of the regression line. For this estimate we first of all need to
estimate the slope of the regression line using this expression :-

b 
  X  X Y  Y 
i i

 X  X 
2
i

TABLE 7 - 1: DIFFERENT QUANTITIES OF MAIZE PRODUCED FOR VARYING


AMOUNTS OF FERTILIZER PER HECTARE WHICH IS USED ON THE PLOTS

Year n X Y
1991 1 6 40
1991 2 10 44
1991 3 12 46
1991 4 14 48
1991 5 16 52
1991 6 18 58
1991 7 22 60
1991 8 24 68
1991 9 26 74
2000 10 32 80

Where :
Xi = Each observation of the variable X ; and in this case each value of the fertilizer in
kilograms used for every hectare.
X = The mean value of X.
Yi = Each observation of the variable Y ; and in this case each value of the maize yield in
Debes per hectare.
Y = The mean value of Y.
n

 = The usual summation sign  shown in the abbreviated form.


i  1

The Regression/ correlation statistic involves learning how to evaluate the b-coefficient using the

equation b 
  X  X Y  Y  , and to compute various results which can be obtained from
i i

 X  X 
2
i

this evaluation.
Figure 7 - 1: The Scatter Diagram of Maize produced with Fertilizer Used

The steps which we shall discuss involves the analysis of the various parts of this equation
and performing the instruction in the equation to obtain the value of the coefficient “ b ”. Once the
coefficient has been obtained the corresponding other coefficient in the regression equation “ a ”

is easily obtained because it can be expressed as : a  Y  bX . In addition other values


which will help us in our analysis will be sought and learned . Table 7 - 2 is the tool we shall use
to evaluate the equation for the coefficient b.
For the estimation of the b-coefficient we use the table 7 - 2 to assist us in the analysis We
must now learn to show the deviations in small case variable representative symbols such that

X i 
 X  xi and Yi 
 Y  yi . the numerator for the b expression is therefore x
i yi ,

and the denominator for the same expression is x i


2

These are the values for the X, Y and XY in deviation form, and the summation sign is of
course an instruction to add all the values involved. Accordingly, the equation for the b-coefficient

is : b 
xy i i
. Use Table 7 - 2 and fill in the values to calculate b.
x i
2
TABLE 7 - 2 : CALCULATIONS TO ESTIMATE THE REGRESSION EQUATION FOR THE MAIZE
PRODUCED (DEBES) WITH AMOUNTS OF FERTILIZER USED

YEAR
X i Y X i  X  Y
i  Y 
x i
2
xi yi
YIELD
FERTILIZER
x
(KG.) PER HA.
i
y i

(DEBES

1 6 40 - 12 - 17 144 204

2 10 44 -8 - 13 64 104

3 12 46 -6 - 11 36 66

4 14 48 -4 -9 16 36

5 16 52 -2 -5 4 10

6 18 58 0 1 0 0

7 22 60 4 3 16 12

8 24 68 6 11 36 66

9 26 74 8 17 64 136

10 32 80 14 23 196 322

TOTAL 180 570 000 0 576 956

Means

X,Y 18 57
Solution (Continued).
Using the values in Table 7 - 2 the solution for the b coefficient is sought in the
following manner :-

b 
xyi i

956
 1. 66 This is the slope of the regression line.
x i
2
576

Then the value of “a” which statisticians also call “ b0 ” ( because it is theoretically the

estimate of the value of b in the initial condition of time zero) is calculated as :-

a  Y  bX  57  166
. 18 .
 57  29 .88  27 .12 This is the Y-Intercept.
The estimated regression equation is therefore :-

Yi  27 .12  1. 66 X i

The meaning of this equation is that if we are given any value of fertilizer application “ X i ” by

Ole Mbatian, we can estimate for him how much maize he can expect (in debes per hectare) of
that level fertilizer application using this regression equation. Assume that he chooses to apply 18
kilograms of fertilizer per hectare. The maize yield he expects during a normal season (everything
else like rainfall, soil conditions, etc., and other climatic variables remaining constant) estimated
using the regression equation will be :-

Yi  27 .12  1. 66 X i  27 .12  1. 66 18  57.

The symbols for the calculated values of Y as opposed to the observed values of Y vary
from textbook to textbook. In your Textbook (King’oriah 2004) we use “ Yc ” or

Ycalculated. Here we are using “ Yi ” pronounced “ Y-hat ” Other books use “ Ye ” meaning
“Yestimated” , and so on; it does not make any difference.

Tests of Significance for Regression Coefficients


Once the regression equation has been obtained, we need to test whether the equation
constants “ a ” and “ b ” are significant. This means we need to know whether they could not have
occurred by chance. In order to do this we need to look for the variance of both parameters and
their standard errors of estimate. The variances for “ a ”, also called the variance for b0 can be

estimated by extending Table 7 - 2 and using the estimated values of Y against the actual values
of Y to find each deviation of the actual value from the estimated value which is found on the
regression line.. All the deviations are then squared and the sum of squared-errors of all the
deviations are calculated. Once this is done, the variance of the constants and their standard errors
of estimate of any one of the two constants can be easily computed. Let us now carry out this
exercise to demonstrate this process. In Table 7 - 3, the estimated values of y are in the column

labeled Yi . The deviations of the observed values of Y from the estimated values of Y are on the
2
column labeled “e i ”. Their squared values are found in the column labeled ei . These, and a few

others in this table are the calculations required to calculate the standard error of estimating the
constants b 0 = a and b 1 = b.

The estimated values Yi are found in Table 7 - 3 on the third column from the right. These

have been computed using the equation Yi  27 .12  1. 66 X i . Simply substitute the
observed values of X i into the equation and solve the equation to find the estimated values of

Yi . Other deviation figures in this table will be used later in our analysis. The variance of the
intercept is found using the equation:

TABLE 7- 3 : COMPUTED VALUES OF Y AND THE ASSOCIATED DEVIATIONS (ERRORS)


Yi
YEAR X Y
Xi
2
xi2 yi yi
2 ei ei
2

1 6 36 40 144 - 17 289 37.08 2.92 8.5264

2 10 100 44 64 - 13 169 43.72 0.28 0.0784

3 12 144 46 36 - 11 121 47.04 - 1.04 1.0816

4 14 196 48 16 -9 81 50.36 - 2.36 5.5696

5 16 256 52 4 -5 25 53.68 - 1.68 2.8224

6 18 324 58 0 1 1 57.00 1.00 1.0000

7 22 484 60 16 3 9 63.64 -3.64 13.2496

8 24 576 68 36 11 121 66.96 1.04 1.0816

9 26 676 74 64 17 289 70.28 3.72 13.8384

10 32 1024 80 196 23 529 80.24 - 0.24 0.0576

47.3056

TOTAL 180 3816 570 576 1634


YI

s 2

e   X
i
2
i
2

n  k n x
b0 2 .
i

In this equation, n = number of observations

k = degrees of freedom due to the interaction of two variables.

Other values can be found in Table 7 - 3. Let us now use the equation :-

s 2

e   X
i
2
i
2


40.3056

3816
 3. 92
n  k n x 10  2 10576
b0 2
i
s 2

e i
2


40.3056
 0 . 01
b1
n  k  xi 2
10  2576

Having found the variances of these constants their standard errors are obviously the square roots
of these figures.

sb 0  3. 92  1. 98 sb1  0.01  0 .10

Let us now test the number of standard errors that each of the two constants estimate the slope of
the regression line. This means we compute each of their t-values and compare it with a critical t
at 5% alpha level. If the t values resulting from these constants exceed the expected critical value
t then we conclude that each of them is significant. The calculated t value for these parameters is
:-

b0  b 0 27.12  0
t0    13. 7
Sb 1. 98
0

b1  b1 1. 66
t1    16 . 6
Sb 01.
1

Since both and t = 2.306, with 8 degrees of freedom at 5% level of significance we conclude
that both values of the intercept and the slope are significant at 5% level.

The Coefficient of Determination and the Correlation Coefficient


Using this maize-fertilizer example some measure of the strength of relationship can be
derived from the data on Table 7 - 3. This measure of the strength of relationship is known as the
Coefficient of Determination, from the fact that it determines how related some data observation
series is to the other data available within the independent variable. We begin by computing the
coefficient of non-determination which is the ratio of the sum of squared error between the

predicted values of Yi and the observed values Y, to the squared error between the actual observed

values of Y and the mean of Y .

The Sum of Squared error between the Yi and Y = e i


2
= 47.3036

The Sum of squared error between the Y and Y = y i


2
= 1634

The coefficient of non-determination =


e i
2


47.31
 0 . 0290
y i
2
1634

This coefficient of non-determination is the proportion or the probability of the variation


between the two variables X and Y which is not explained by the changes in the independent
variable X. The Coefficient of determination is the complement of this coefficient of non-
determination. It is the proportion, or the probability of the variation between X and Y which is
explained by the changes in the independent variable X. Therefore, the Coefficient of
Determination “ R2 ” is calculated using the following technique :-

R 2
 1 
e i
2

 1 
47.31
 1  0 . 0290  0 . 9710
y i
2
1634

This is a very strong relationship. About 97.1% of the changes in the maize yield (in debes per
Hectare) within Mr. Mbatian’s farm is explained by the quantities of fertilizers per acre applied on
his farm. It also means that the regression equation which we have defined as

Yi  27 .12  1. 66 X i explains about 97.1% of the variation in output. The remaining 3% or
thereabouts, (approximately 2.9%), is explained by other environmental factors on his farm which
have not been captured in the model.
Figure 7 - 2 : The estimated regression line

In any analysis of this kind, the strength of the relationship between X and Y is measured
by means of the size of Coefficient of Determination. This coefficient varies between Zero value
of no relationship at all to 1.0000 value of a perfect relationship. The example we have on Mr.
Mbatian’s farm is that of a near perfect relationship, which actually shows the observed data
clinging very closely to the regression line that we have constructed; as shown in Figure 7 - 2.
The other value which is used very frequently for theoretical work in Statistics is the
Correlation Coefficient. This is sometimes called Pearson’s Product moment of Correlation,
after its discoverer, Prof. Karl Pearson (1857 - 1936). He also invented the Chi-Square Statistic
and many other analytical techniques while he was working at the Galton Laboratory of the
University of London. From the computation above you will guess that he is obviously the inventor
of all these measures which we have just considered.
The Correlation Coefficient is the square root of the Coefficient of determination.
We shall use both measures extensively in Biostatistics from now on. Let us now compute the
Correlation Coefficient.

r  R2  1 
e i
2

 1  0 . 0290  0 . 9710
y i
2

r  0 . 9710  0 . 9854

The measure is useful in determining the nature of the slope of the regression line. A negative
relationship has a negatively sloping regression line and a negative Correlation Coefficient. A
positive relationship has a positive Correlation Coefficient. In our case the measure is positive.
This means that the more of this kind of fertilizer per hectare which is applied on Mr. Mbatian’s
farm the more maize yield in terms of debes per hectare that he realizes at the end of each season.

Computation Example

Having discussed the theory involved in the computation of various coefficients of


regression and correlation we need to try an example to illustrate the techniques of computing the
relevant coefficients quickly and efficiently.

Example
The following data had been obtained for the time required by a drug quality control
department to inspect outgoing drug tablets for various percentages of those tablets found
defective.

Percent defective 17 9 12 7 8 10 14 18 19 6
Inspection Time in
minutes
48 50 43 36 45 49 55 63 55 36
(a) Find the estimated Regression line Yi  a  b X i
(b) Determine the sum of deviations about this line for each of the ten observations.
(c) Test the Null Hypothesis that change in inspection time has no significant effect on the
percentage rate of the drug tablets found defective using analysis of variance.
(d) Use any other test statistic to test the significance of the correlation coefficient.

Solution
1. The relevant data and preliminary computations are arranged in Table 7 - 4.
2. The following simple formulae assist in the solution of this kind of problems.
We already know that the deviations of X and Y from their means are defined in the
following manner :-

X i 
 X  x Yi  Y  y

3. The shortcut computations will make use of these deviation formulas to compute various

figures which ultimately lead the definition of the regression equation Yi  a  b X i .
(a) To find the sum of squared deviations of all the observations of X from the mean
value X , and those of squared deviations of Y from Y , we use the following
shortcut expressions :-

 X   Y 
2 2

x 2
 X i
2

n
, y 2
 Y i
2

n

(b) The deviations of the cross multiplication between X and Y are found using the

 X  Y  .
expression  xy  XY 
n

TABLE 7 - 4 : PERCENT OF TABLETS FOUND DEFECTIVE FOR INSPECTION TIME IN MINUTES

Time in Percent Found


Minutes Defective ( Y )
Observation X2 Y2 XY
(X)
1 48 17 2304 289 816
2 50 9 2500 81 450
3 43 12 1849 144 516
4 36 7 1296 49 252
5 45 8 2025 64 360
6 49 10 2401 100 490
7 55 14 3025 196 770
8 63 18 3969 324 1134
9 55 19 3025 361 1045
10 36 6 1296 36 216
Total 480 120 23690 1644 6049
Means X  48 Y  12

(c) If we were to remember these three equations we can then have at our disposal a very
powerful tool for the fast computation of the regression coefficients.
To find the slope coefficient “ b ” we use the results of the expressions in ( a ) and ( b
) above.

b 
 xy
 x2
Then “ a ” coefficient can be found easily through the equation :-

a  Y  bX

Then the correlation coefficient is found using the following expression

r 
 xy
 x y
2 2

4. The figures to be used in these computations are to be found in Table 7 - 4.


Whenever you are faced with a problem of this nature, it is prudent if you perform
the tabulations of your data as in Table 7 - 4, and then follow this with the tabulation of
data using these simple formulas. We now demonstrate the immense power which is
available in the memorization of the simple formulas we have demonstrated in ( a ) ( b ) and
( c ) above :-

 X 
2
480 2
 x 2
  Xi 2

n
 23690 
10
 650

 Y 
2
120 2
 y 2
  Yi 2

n
 1644 
10
 204

 X  Y  480  120


 xy  XY  n
 6049 
10
 289

b 
 xy 
289
 0 . 445
x 2
650

a  Y  bX  12  0.445  48   9 . 36

Accordingly, the regression equation is: -

Yi  a  b Xi   9.36  0.445 X .

5. The values of the Regression Equation are relevant in farther tabulation of the figures
which will assist us to derive farther tests. They are recorded on the thrd column from the left
of Table 7 - 5.

TABLE 7 - 5 : COMPUTATION OF DEVIATIONS AND THE SUM OF SQUARED DEVIATIONS

Time in minutes Percent Found


(X) Defective ( Y ) Yi D  Y  Yi D2

48 17 12.00 5.00 25.0000


50 9 12.89 - 3.89 15.1321
43 12 9.78 2.22 4.9284
36 7 6.66 0.34 0.1156
45 8 10.67 - 2.67 7.1289
49 10 12.45 - 2.45 6.0025
55 14 15.12 - 1.12 1.2544
63 18 18.68 - 0.68 0.4624
55 19 15.12 3.88 15.0544
36 6 6.66 - 0.66 0.4356
Sum of squared Deviations D 2 = 75.5143

Using the data in Table 7 - 5, you can see how fast we have been able to compute the
important measures which take a lot of time to compute under ordinary circumstances. Obviously,
it is faster by computer, since there are proprietary packages which are designed for this kind of
work. However, for learning and examination purposes, this method has a lot of appeal. On is able
to move quickly, and learn quickly at the same time.
6. It is now very easy to compute the standard error of estimate, which helps us to build a
probability model along the regression equation so that we can see how our regression line fits as
an estimator of our actual field situation. What we need now is to find a method of calculating the
standard error of estimate. Given all the data in Table 7 - 5 we can use the expression below to
find the standard error of estimate of the regression equation..

Standard Error of estimate =  Y 


D 2


75.5243
n  2 10  2

75. 5243
  9.4405375  3.073
8

If we state the hypothesis that there is no significant difference between the observed values and
the calculated values of Y for each value of X we can build a two tail t distribution model centered
on the regression line. This is done by choosing the Confidence level C = 0.95, and a 0.05
alpha level. Since we need the upper and lower tails on both sides of the regression equation, we
divide the alpha level by two to obtain 0.025 on either side.
For the 0.05 alpha level, we obtain the appropriate value from the usual t-tables on page
498 of your textbook (King’oriah, 2004). We find that our t-probability model on both sides of our
regression equation is built by the following critical value of t , using the two-tail model.
Remember this table in your Textbook has two alternatives, the two-tail and the single tail
alternative. We use the columns indicated by the second row of the table for the two-tail model
and find that :-

t , 10  2  t0.05, 8  2.306 .

We now have a probability model which states that for any observed value of Y to belong to the
population of all those values estimated by the regression line, it should not lie more than 2.306
standard errors of estimate on either side of the regression line.

The t-value from the table can be useful if it is possible to compute the t-position for every
observation. This is done by asking ourselves how many standard errors of estimate each
observation lies away from the regression line. We therefore need a formula for computing the
actual number standard errors for each observation. The observed values and the calculated values
are available on Table 7 - 5.
Di
The expression for computing the individual t-values for each observation is ti  .
 Y

Di Yi  Y
ti   . Using this expression all the observations can be located and their
 Y  Y
distance on either side of the regression line can be calculated. This is done at Table 7 - 7.

TABLE 7 - 7: COMPUTATION OF DEVIATIONS, THE SUM OF SQUARED DEVIATIONS AND


T-VALUES FOR EACH OBSERVATION

Time in Percent Found


minutes Defective ( Y ) Yi D  Y  Yi D2
t 
Di
(X)  Y
48 17 12.00 5.00 25.0000 1.6270
50 9 12.89 - 3.89 15.1321 - 1.2530
43 12 9.78 2.22 4.9284 0.7220
36 7 6.66 0.34 0.1156 0.1106
45 8 10.67 - 2.67 7.1289 - 0.8689
49 10 12.45 - 2.45 6.0025 - 0.7973
55 14 15.12 - 1.12 1.2544 - 0.3645
63 18 18.68 - 0.68 0.4624 - 0.2213
55 19 15.12 3.88 15.0544 1.2626
36 6 6.66 - 0.66 0.4356 - 0.2148

Sum of squared Deviations D 2 = 75.5143

From the t-values on the right-most column of Table 7 - 7, we find that there is not a single
observation which lies more that the expected value of :-

t , 10  2  t0.05, 8  2.306

This tells us that there is a very good relationship between X and Y. This relationship as
outlined by the regression equation did not come by chance. The regression equation is a very
good predictor of actually what is happening in the field. This also means that whatever parameters
of the regression line which we have computed they represent the actual situation regarding the
changes in Y which are caused by the changes in X. We can confidently say at 95% confidence
level that actual percentage of defective tablets which is found the production line depends on the
inspection time in minutes. We may want to instruct our quality control staff to be more vigilant
with the inspection so that our drug product may have a few defective tablets as humanly and
technically possible.

Analysis of Variance for Regression and Correlation


Statisticians are not content with only finding the priori values of statistical computations.
They always are keen on making double sure that what they report is not due to mere chance.
Another tool which they employ for the purpose of data verification is what we learned in Chapter
Six. This is what we shall call the F-test in our discussion.
In regression analysis we are also interested in changes within the dependent variable
which are caused by each change in the independent variable. Actually the string of values of the
independent variable is analogous to Treatment which we learned in Analysis of Variance. Each
position or observation of the independent variable is a treatment, and we are interested to know
the impact of each one of them on the magnitude of the value of the dependent variable each time.
Analysis of variance for the Regression/Correlation operates at the highest level of
measurement (the ratio level) while the other statistic which we considered in Chapter Six operates
at all the other lower levels of measurement.
Use of analysis of variance in the regression correlation analysis tests the null hypothesis
at whatever confidence level that there is no linear relationship between the independent variable
and what we fancy to be the dependent variable. The null hypothesis is that the variation in the
dependent variable happened by chance, and is not due to the effects of the independent variable.
The alternative hypothesis is that what has been discovered in the initial stages of
regression/correlation analysis has not happened by chance. The relationship is statistically
significant. Therefore, To use the F-test we assume:-
1. A normal distribution for the values of Y for each changing value of X. any
observed value of Y is just one of those many which could have been observed. This
means that the values of Y are stochastic about the regression line.
2. All the values of the independent variable X are stochastic as well, and therefore the
distribution is bi-variate normal.
( a ) The null hypothesis is that there is no relationship between X and Y.
( b ) Also there is no change in Y resulting from any change in X.
3. In symbolic terms, the null and the alternative hypotheses of the regression/correlation
analysis could be stated in the following manner to reflect all the assumptions we have made :-

(i) H : 
0 Y1   Y 2  .....   Y n  No change is recorded in variable Y as a result of the

changing levels of the variable X .


( ii ) H : 
A Y1   Y 2  .....   Y n  there is some statistically significant change

recorded in variable Y as a result of the changing levels of the variable X .

4. ( a ) The total error in F-Tests comprises the explained variation and the unexplained
variation. This comprises the sum of squared differences between every observed value of the
dependent variable and the mean of the whole string of observations of the dependent
variable.

SS = TOTAL ERROR = EXPLAINED ERROR + UNEXPLAINED ERROR


(b) The error caused by each observation which we regard as an individual
treatment is what is regarded as the explained error.

SST = EXPLAINED ERROR = VARIATION IN Y CAUSED BY


EACH VALUE OF X.

(c) The residual error is the unexplained variation due to random


circumstances.
SSE = TOTAL ERROR - EXPLAINED ERROR
SSE = SS - SST

In our example SS is the sum of squared differences which are recorded in Table 7 - 7 as

D 2
= 75.5143. The percentage of this explained error, or the probability out of 1.0 can be

computed by multiplying this raw figure by the coefficient of determination.

SS T  r 2 D 2

The value of r 2 is easily obtainable from some calculation using the values at page

  x y
2

164. Mathematically, this is expressed as r 2


 .
 x  y2 2

Using our data :-


  x y  289 , and   x y  289  289 2
2 2

289 2
r2   0 . 630
650  204

This is the coefficient of determination, which indicates what probability or


proportion of the total variation explained by the X-variable, or the treatment.

Therefore, the explained error = SS T  r 2 D 2


 0. 630  75.5143

SS  D 2
= 75.5143

SS T  r 2 D 2
 0. 630  75.5143  47 .574009
SS T  47 .574009

The error due to chance = SSE = SS - SST

SSE  75.5143  47 .574009  27 . 940291


SSE  27 . 940291

5. Degrees of Freedom
Total degrees of freedom: caused by the total variation of Y caused by all the
environment.

SS d f. = n - 1 = 10 - 1 = 9 d f.
SS d f. = 9 d f.

Unexplained degrees of freedom are lost due to the investigation of parameters in two
dimensions. Here we lose two degrees of freedom :-
SSE d f. = n - 2 = 10 - 2 = 8 d f.
SSE d f. = 8 d f.

Treatment degrees of freedom is the difference between the total degrees of


freedom and the unexplained degrees of freedom.

SST d f. = SS d f. - SSE d f. = (10 - 1) - (10 - 2) = 9 - 8 = 1.0


SST d f. = 1.0 d f.
6. This means that we have all the important ingredients for computing the calculated F-
statistic and for obtaining the critical values from the F-Tables, as we have done before. Observe
the following data and pay meticulous attention to the accompanying discussion because at the
end of all this we shall come to an important summary The summary of all what we have
obtained so far can be recorded in the table which is actually suitable for all types of
Analysis of variance - called ANOVA table.

TABLE 7 - 8 : ANOVA TABLE FOR A BI-VARIATE REGRESSION ANALYSIS

SOURCE OF DEGREES OF VARIANCES


SUM OF SQUARES CALCULATED
VARIATION FREEDOM (MEAN SQUARES)
Fc
Total SS  D 2 N - 1 = 10
SS d f. = 9 d f.
=75.5143 MST
Fc 
MST MSE
Explained by SS T  47 .574009 47 .574009 47 .57
Treatment
SST d f. = 1.0 d f.  47 .57 
1 d f. 3.493
MSE = 13.586
SSE d f. = 27 . 940291
Unexplained SSE  27 . 940291  3. 493
10 - 2 = 8 d f. 8 d f.

Study the summary table carefully. You will find that the computations included in its matrix are
what is actually the systematic steps in computing the F-statistic.
The critical value of F is obtained from page 490 at 5% significance level and [1, n - 2] degrees
of freedom.
F0.05, 1, n  2  F0.05, 1, 8  5. 3172

F0.05, 1, 8  5. 3172

7. Compare this F0.05, 1, 8  5. 3172 to the calculated F-value. Fc  13.586 . You

will find that we are justified to reject the null hypothesis that the changes is the values
of X do not have any effect on the changes of the values of Y at 5% significance level. In
our example this means the more watchful the quality control staff are, the more they can
detect the defective tablets at 5% significance level.
Activity
Do all the necessary peripheral reading on this subject and attempt as many
examples in your Textbook as possible. Try to offer interpretations of the calculation
results like we have in this chapter. It is only with constant exercise that one can master
these techniques properly; and be able to apply them with confidence and be able to interpret
most data which comes from proprietary computer packages.

EXERCISES

1. For a long time scholars have postulated that the predominance of agricultural labor
force in any country is an indication of the dependence of that country on primary modes of
production, and as a consequence, the per-capita income of each country that has a preponderance
of agricultural labor force should be low. A low level of agricultural labor force in any country
would then indicate high per- capita income levels for that country.
Using the data given below and the linear regression/correlation model determine what
percentage of per-capita income is determined by agricultural labor force.

Agricultural
labor force
9 10 8 7 10 4 5 5 6 8 7 4 9 5 8
(Millions)
Per capita
income
(US. $ 00) 6 8 8 7 7 12 9 8 9 10 10 11 9 10 11

2. An urban sociologist practicing within Nairobi has done a middle- income survey,
comprising a sample of 15 households, with a view to determining whether the level of education
of any middle-income head of household within this City determines the annual income of
their families. The following is the result of his findings :-

Education
level
7 12 8 12 14 9 18 14 8 12 17 10 16 10 13
Annual
income  K.
18 32 28 24 22 32 36 26 26 28 28 32 30 20 18
Shs100,000)

(i) Compute your bi-variate correlation coefficient and the coefficient of determination.
( ii ) Use the shortcut formulae for Regression/correlation analysis to compute the
equation of the regression line of income (Y) on education level ( X ).
( iii ) By what means can the sociologist confirm that there is indeed a relationship?
(Here you should describe any one of the methods of testing the significance of your
statistic.)

3. A survey of 12 couples is done on the number of children they have Y as compared to


the number of children the had stated previously they would have liked to have X.
(a) Find the regression equation on this phenomenon, computing all the appropriate
regression coefficients.
(b) What is the correlation coefficient, the Coefficient of determination, Coefficient of non-
determination and that of Alienation with regard to this experiment? What is your interpretation
on all these?

Couple 1 2 3 4 5 6 7 8 9 10 11 12
Y 4 3 0 2 4 3 0 4 3 1 3 1
X 3 3 0 2 2 3 0 3 2 1 3 2

You might also like