LINEAR REGRESSION
A mathematical equation that allows us to predict the values of one dependent variable from
known values of one or more independent variables is called a regression equation. The term
regression equation is derived from the original heredity studies made by Francis Galton. In his
study, he compared the heights of the sons of tall fathers over successive generations regressed
toward the mean height of the population. In other words, sons of unusually tall fathers tend to be
shorter than their fathers and sons of unusually short fathers tend to be taller than their fathers. Today
the term regression is applied to all types of prediction problems and does not necessarily imply a
regression toward a population mean.
       In the study of linear regression, we consider the problem of estimating or predicting the
value of a dependent variable Y on the basis of a known measurement of an independent and
frequently controlled variable X.
        Using a scatter diagram, we can determine if the two variables are linearly related to some
extent. Once a reasonable linear relationships has been ascertained, we usually express this
mathematically by a straight-line equation called the linear regression line. The linear regression
line is written using the slope-intercept form
                                       
                                       y  a  bx
                                                                                                           
where the constants a and b represents the y-intercept and slope, respectively. The symbol y is used
here to distinguish between the value given by the regression line and an actual observed value y for
some value of x.
        Once the point estimates a and b are determined from the sample data, the linear regression
                                      
line can be used to predict the value y corresponding to any given value x.
Estimation of Parameters. Given the sample                X i , Yi  ; i  1, 2,  , n , the least-squares estimate
of the parameters in the regression line
                                        
                                        y  a  bx
are obtained from the formulas
                              n    x      1   y1   x1         y      1
                       b
                                         n x    x 1 
                                                                   2
                                                    2
                                                    1
and                       a  y  bx
Example 1. Consider the following data:
        x    1 2       3     4     5       6
        y 6 4          3     5     4       2
   (a) Find the equation of the regression line.
   (b) Graph the line on a scatter diagram.
   (c) Find the point estimate of  y14. .
       Solution:
                                  xi           yi        xi yi         x12       y12
                                  1            6            6           1        36
                                  2            4            8           4        16
                                  3            3            9           9         9
                                  4            5           20          16        25
                                  5            4           20          25        16
                                  6            2           12          36         4
                                                          1
                     Total              21                      24          75                   91              106
        (a) n  6,            x  21
                                i           y  24                 i                 x y  75
                                                                                           i i
              x  91
                 2
                 i                y  106  2
                                            i
                                                y  4                              and         x  3.5
Substituting these values in the formula for b, we get
             n xi yi    xi    yi                    6  75   21  24 450  504  54
        b                                                                                        0.5143
                 n xi2    xi 
                                            2
                                                                
                                                              6  91   21
                                                                              2
                                                                                   546  441 105
        b  0.514.
        a  y  bx  4    0.514  3.5  4  1.799  5.799
        a  5.799
yˆ  a  bx  5.799  0.514 x.          This is the regression line
(b)                  Y
                7
                6                                       .
                                        .               .               .
                5
                4                       .               .               .
                3                       .           .   .
                2
                1                                                                                                      X
                                    1           2           3           4   5         6          7           8
        Since the slope of y is negative, it implies that as x increases y decreases.
( c) yˆ  a  bx  5.799    0.514   4   5.799  2.056  3.743
                                                    LINEAR CORRELATION
        We shall consider here the problem of measuring the relationship between two variables X
and Y rather than predicting a value of Y from a knowledge of the independent variable X. For
example, if X represents the amount of money spent yearly on advertising by a retail merchandising
firm and Y represents their total yearly sales, we might ask whether a decrease in advertising is likely
to be accompanied by a decrease in the yearly sales.
        Correlation analysis attempts to measure the strength of such relationships between two
variables by means of a single number called a correlation coefficient.
        A linear correlation coefficient is defined to be a measure of the linear relationship between
the two random variables X and Y. This relationship is denoted by r. r measures the extent to which
the points cluster about a straight line. By constructing a scatter diagram for the n pairs of
measurements   xi , yi  ; i  1, 2, , n in our random sample (as in the graph below), we are able
to draw certain conclusions concerning r. If the points follow closely a straight line of positive slope
as in (a), we have a high positive correlation between the two variables. On the other hand, if the
points follow closely a straight line of negative slope as in (b), we have a high negative correlation
between the two variables. The correlation between the two variables decreases numerically as the
scattering of points from a straight line increases. If the points follow a strictly random pattern as in
(c) below, we have zero correlation and conclude that no linear relationship exists between X and Y.
                                                                            2
     Y                                                                                     Y
                                                                                                                   .        .
                     ..                                                                                        .        .
                    ...                                                                                  .        .          .        .
                   ....                                                                                       .         .        .
                 ...                                                                                                   .        .    .
               ...                                                                                                      .        .
                                                      X                                                                                         X
         (a)                                                                                            (b)
                                                                                                       Y
     Y
                       .         .                                                                                                  ...
               .             .                .                                                                           ..             ...
               .            .            .                                                                               ...              ...
                       .         .           .                                                                          ...               ...
                   .                 .                                                                                    .                ..
                        .   .         .                                                                                 ....              ...
                                 X                                                       X
                  (c )                                                    (d)
        The correlation coefficient between two variables is a measure of their linear relationship and
a value of r  0 implies a lack of linearity and not a lack of association. Hence, if a strong quadratic
relationship exists between X and Y as indicated in (d), we still obtain a zero correlation even though
there is a strong nonlinear relationship.
     The most widely used measure of linear correlation between two variables is called
PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT or simply the SAMPLE
CORRELATION COEFFICIENT and is denoted by r.
        The measure of linear relationship between two variables X and Y is estimated by the sample
correlation coefficient r, where
                                                 n xi yi    xi 
                                                          y           S         i
                       r                                             b
                                 n  x    x   n  y    y   S
                                                                                                              x
                                                  2            2             2                     2
                                                                                                              y
                                                  1        i                 1                 i
Since SSE   n  1  S y  b S x 
                         2    2 2
And by dividing both sides of the equation by  n  1 S y , we obtain the relation
                                                         2
                                                                       SSE
                                                      r2  1
                                                                    n  1 S y2
                                     2
Note that SSE and S y are always nonnegative, we can say that r 2 must be between zero and 1.
Consequently r must range from –1 to +1. A value of r = -1 will occur when SSE = 0 and all
points lie exactly on a straight line having a negative slope. If all points lie exactly on a straight line
having a positive slope, once again SSE =0 and we obtain a value r= +1. Hence a perfect linear
relationship exists between the values of X and Y in our sample when r  1. If r is close to +1 or –
1, the linear relationship between the two variables is strong and we say that we have a high
correlation. However, if r is close to zero, the linear relationship between X and Y is weak or perhaps
nonexistent.
        A number that expresses the proportion of the total variation in the values of the variable Y
that can be accounted for or explained by the linear relationship with the values of the variable X is
usually referred to as the sample coefficient of variation and is denoted by r 2 . Thus a correlation of
r= 0.6 means that 0.36 or 36% of the total variation of the values of Y in our sample is accounted for
by linear relationship with the values of X.
                                                                                       3
The values of r and its interpretation
          r              Interpretation
          1              Perfect positive correlation
     0.91 - 0.99         very highly positively correlated
    0.71 – 0.90           highly positively correlated
    0.41 – 0.70          Marked or moderately positively correlated
    0.21 - 0.40          Low or slightly positively correlated
    0.01 - 0 .21         Very Low positive or Negligible
   -0.01 – -0.20         Very low negative or Negligible
   -0.21 - -0.40         Low or slightly negatively correlated
   -0.41 - -0.70         Marked or moderately negatively correlated
    -0.71- -0.90         Highly negatively correlated
    -0.91- -0.99         Very highly negatively correlated
         -1              Perfect negative correlation
Example 1: Compute and interpret the correlation coefficient for the following data:
                                       X            4          5           9                 14          18   22     24
                                       Y           16         22           11                16           7    3     17
       Solution:
                                                              xi                yi                       x2    y2         xi yi
                                                              4                 16                     16      256          64
                                                              5                 22                     25      484         110
                                                              9                 11                     81      121          99
                                                             14                 16                    196      256         224
                                                             18                  7                    324       49         126
                                                             22                  3                    484        9          66
                                                             24                 17                    576      289         408
                                       Total                 96                 92                   1702     1464        1097
        n7          x    i    96                y   i    92                      x y  1097i   i
                               x 2
                                  1    1464                 and                     y  1702
                                                                                             2
                                                                                             1
substituting these values in the formula for r, we get
                     n x i y i    x i f      y                 i
         r
               n  x  2
                       1          x   n  y    y 
                                          i
                                              2              2
                                                             1                  i
                                                                                     2
                                                                                         
                        7 1097    96   92 
          
              7 1702   96  7 1464   92 
                                       2                                   2
                           7679  8832
          
              11914  9216 10248  8464
                    1153                          1153                  1153
                                                                               0.5255462
               2698 1784                  4813232                  2193.9079
         r  0.53
       Since r= -0.53, the two variables X and Y are moderately negatively correlated.
Example 2. Compute and interpret the correlation coefficient for the aptitude scores and grade point
averages below:
                                                                                     4
                                     Grade-point Average          Aptitude Score
                                             Y                          X
                                            1.93                       565
                                            2.55                       525
                                            1.72                       477
                                            2.48                       555
                                            2.87                       502
                                            1.87                       469
                                            1.34                       517
                                            3.03                       555
                                            2.54                       576
                                            2.34                       559
                                            1.40                       574
                                            1.45                       578
                                            1.72                       548
                                            3.80                       656
                                            2.13                       688
                                            1.81                        465
                                            2.33                        661
                                            2.53                        477
                                            2.04                        490
                                            3.20                        524
Solution:
                        GPA          AS
                         YI          XI     X I YI           X I2                YI2
                         1.93        565    1090.45        319225              3.7249
                         2.55        525    1338.75        275625             6.50250
                         1.72        477     820.44        227529             2.95840
                         2.48        555    1376.40        308025             6.15040
                         2.87        502    1440.74        252004             8.23690
                         1.87        469     877.03        219961             3.49690
                         1.34        517     692.78        267289             1.79560
                         3.03        555    1681.65        308025             9.18090
                         2.54        576    1463.04        331776             6.45160
                         2.34        559    1308.06        312481             4.47560
                         1.40        574     803.60        329476             1.96000
                         1.45        578     838.10        334084             2.10250
                         1.72        548     942.56        300304             2.95840
                         3.80        656    2492.80        430336            14.44000
                         2.13        688    1465.44        473344             4.53690
                         1.81        465     841.65        216225             3.27610
                         2.33        661    1540.13        436921             5.42890
                         2.53        477    1206.81        227529             6.40090
                         2.04        490     999.60        240100             4.16160
                         3.20        524    1676.80        274576            10.24000
 TOTAL      45.08       10961        24896.83         6084835 109.47900
n  20        X    i    10961            Yi    45.08          X Y  24896.83
                                                                         i   i
                           X  6084835
                                1
                                 2
                                                    and         Y  109.47900
                                                                    1
                                                                     2
                                                            5
                          n xi yi    xi  y  i
              r
                    n  x    x    n  y    y 
                           2
                           1        i
                                        2       2
                                                1           i
                                                                2
                                                                    
                               20  24896.83  10961  45.08
                
                    20  6084835  10961  20 109.47900   45.08 
                                                2
                               497936.6  494121.88
                
                    121696700  120143521  2189.58  2032.2064
                         3814.72
                
                    153179 157.3736
                      3814.72
                
                    24106330 .67
                    3814.72
                
                  4909.81982
                 0.776957228
              r  0.78,
       The grade-point averages are highly correlated with the aptitude scores.
        The sample correlation coefficient r is a value computed from a random sample of n pairs of
measurements. Different random samples of size n from the same population will generally produce
different values of r.
EXERCISES: Solve each of the following problems. Show all solutions.
       1. The grades of a class of 9 students on a midterm report (x) and on the final examination
          (y) are as follows:
              x 77 50 71 72 81 94 96 99 67
              y 82 66 78 34 47 85 99 99 67
              (a) Find the equation of the regression line.
              (b) Estimate the final examination grade of a student who receive a grade of 85 on the
                  midterm report but was ill at the time of the final examination.
              (c) Compute r.
       2. A study was made on the amount of converted sugar in a certain process at various
          temperatures. The data were coded and recorded as follows:
              Temperature, x      Converted Sugar,              Temperature, x   Converted Sugar, y
                                         y
                    1.0                 8.1                             1.6             8.6
                    1.1                 7.8                             1.7            10.2
                    1.2                 8.5                             1.8             9.3
                    1.3                 9.8                             1.9             9.2
                    1.4                 9.5                             2.0            10.5
                    1.5                 8.9
              (a) Estimate the linear regression line.
              (b) Estimate the amount of converted sugar produced when the coded temperature is
                  1.75.
       3. A mathematics placement test is given to all entering freshmen at a small college. A
          student who receives a grade below 35 is denied admission to the regular mathematics
          course and placed in a remedial class. The placement test scores and the final grades for
          20 students who took the regular course were recorded as follows:
                                                        6
              Placement Test        Course Grade    Placement Test   Course Grade
                   50                    53              90               54
                   35                    41              80               91
                   35                    61              60               48
                   40                    56              60               71
                   55                    68              60               71
                   65                    36              40               47
                   35                    11              55               53
                   60                    70              50               68
                   90                    79              65               57
                   35                    59              50               79
       (a) Plot a scatter diagram.
       (b) Find the equation of the regression line to predict course grades from placement
           test scores.
       (c) Graph the line on the scatter diagram
       (d) If 60 is the minimum passing grade, below which placement test score should
           students in the future be denied admission to this course?
4. Compute and interpret the correlation for the following grades of 6 students selected at
   random.
       Mathematics Grade       70     92   80      74   65   83
       English Grade           74     84   63      87   78   90
5. The following data were obtained in a study of the relationship between the weight and
   chest size of infants at birth:
    Weight (kg) Chest Size (cm) Weight (kg)         Chest Size (cm)
       2.75                29.5             4.32          27.7
       2.15                26.3             2.31          28.3
       4.41                32.2             4.30          30.3
       5.52                36.5             3.71          28.7
       3.21                27.2
   (a) Calculate r.
   (b) Graph the line on a scatter diagram.
   (c) Find the point estimate of  y14 .