INTRODUCTION TO PROBABILITY
AND STATISTICS
    FOURTEENTH EDITION
      Chapter 12
      Linear Regression and
      Correlation
INTRODUCTION
 In Chapter 11, we used ANOVA to
  investigate the effect of various factor-level
  combinations (treatments) on a response x.
 Our objective was to see whether the
  treatment means were different.
 In Chapters 12 and 13, we investigate a
  response y which is affected by various
  independent variables, xi.
 Our objective is to use the information
  provided by the xi to predict the value of y.
EXAMPLE
 Let  y be a student’s college achievement,
  measured by his/her GPA. This might be a
  function of several variables:
    x1 = rank in high school class
    x2 = high school’s overall rating
    x3 = high school GPA
    x4 = SAT scores
 We want to predict y using knowledge of
  x1, x2, x3 and x4.
EXAMPLE
 Let  y be the monthly sales revenue for a
  company. This might be a function of
  several variables:
    x1 = advertising expenditure
    x2 = time of year
    x3 = state of economy
    x4 = size of inventory
 We want to predict y using knowledge of
  x1, x2, x3 and x4.
SOME QUESTIONS
 Which of the independent variables are useful and
  which are not?
 How could we create a prediction equation to allow us to
  predict y using knowledge of x1, x2, x3 etc?
 How good is this prediction?
 We start with the simplest case, in which the
 response y is a function of a single
 independent variable, x.
A SIMPLE LINEAR MODEL
 In Chapter 3, we used the equation of
  a line to describe the relationship between y
  and x for a sample of n pairs, (x, y).
 If we want to describe the relationship
  between y and x for the whole population,
  there are two models we can choose
 •Deterministic Model: y = a + bx
 •Probabilistic Model:
   –y = deterministic model + random error
   –y = a + bx + e
  A SIMPLE LINEAR MODEL
 Since   the bivariate measurements
   that we observe do not generally fall
   exactly on a straight line, we choose
   to use:
  Probabilistic Model:
     y = a + bx + e
     E(y) = a + bx
Points deviate from the
line of means by an amount
e where e has a normal
distribution with mean 0 and
variance s2.
THE RANDOM ERROR
 The  line of means, E(y) = a + bx , describes
  average value of y for any fixed value of x.
 The population of measurements is
  generated as y deviates from
  the population line
  by e. We estimate a
  and b using sample
  information.
             THE METHOD OF
             LEAST SQUARES
  The  equation of the best-fitting line
 is calculated using a set of n pairs (xi, yi).
   •We choose our
   estimates a and b to
   estimate a and b so
   that the vertical
   distances of the
   points from the line,
   are minimized.
Bestfitting line :yˆ = a + bx
Choose a and b to minimize
SSE = ( y − yˆ ) 2 = ( y − a − bx) 2
LEAST SQUARES
ESTIMATORS
 Calculatethe sumsof squares:
               ( x)2
                                      ( y )   2
 Sxx =  x − 2
                          Syy =  y −
                                   2
                  n                     n
               ( x)( y )
 Sxy =  xy −
                    n
 Bestfitting line : yˆ = a + bx where
      S xy
 b=          and a = y − bx
      S xx
EXAMPLE
The table shows the math achievement test
scores for a random sample of n = 10 college
freshmen, along with their final calculus
grades.
 Student         1    2    3    4    5    6    7    8       9    10
 Math test, x    39 43     21   64   57   47   28   75      34   52
 Calculus grade, y 65 78   52   82   92   89   73   98      56   75
                            x = 460            y = 760
 Use your
 calculator to find         x = 23634  y = 59816
                                2                       2
 the sums and
 sums of squares.
                            xy = 36854
                           x = 46    y = 76
EXAMPLE
                (460) 2
Sxx = 23634 −           = 2474
                  10
                (760) 2
Syy = 59816 −           = 2056
                  10
                (460)(760)
Sxy = 36854 −               = 1894
                     10
    1894
b=        = .76556 and a = 76 − .76556(46) = 40.78
     2474
Bestfitting line : yˆ = 40.78 + .77 x
    THE ANALYSIS OF VARIANCE
   The total variation in the experiment is
    measured by the total sum of squares:
            Total SS= S yy = ( y − y ) 2
     The Total SS is divided into two parts:
     ✓SSR (sum of squares for regression):
     measures the variation explained by using
     x in the model.
     ✓SSE (sum of squares for error):
     measures the leftover variation not
     explained by x.
THE ANALYSIS OF
VARIANCE
  We calculate
        ( S xy ) 2        18942
SSR =                   =
            S xx           2474
= 1449.9741
SSE = Total SS- SSR
           ( S xy ) 2
= S yy −
             S xx
= 2056 − 1449.9741
= 606.0259
THE ANOVA TABLE
Total df = n -1                  Mean Squares
Regression df =1                          MSR = SSR/(1)
Error df = n –1 – 1 = n - 2
                                          MSE = SSE/(n-2)
      Source       df     SS         MS          F
      Regression   1      SSR        SSR/(1)     MSR/MS
                                                 E
      Error        n-2    SSE        SSE/(n-2)
      Total        n -1   Total SS
THE CALCULUS PROBLEM
                 ( S xy ) 2     18942
         SSR =                =       = 1449.9741
                   S xx          2474
                                              ( S xy ) 2
         SSE = Total SS- SSR = S yy −
                                                 S xx
         = 2056 − 1449.9741 = 606.0259
 Source             df          SS          MS             F
 Regression         1           1449.9741 1449.9741 19.14
 Error              8           606.0259    75.7532
 Total              9           2056.0000
TESTING THE USEFULNESS
OF THE MODEL
• The first question to ask is whether the
  independent variable x is of any use in
  predicting y.
• If it is not, then the value of y does not
  change, regardless of the value of x. This
  implies that the slope of the line, b, is zero.
            H 0 : b = 0 versus H a : b  0
TESTING THE
USEFULNESS OF THE
MODEL
 • The test statistic is function of b, our best
   estimate of b. Using MSE as the best
   estimate of the random variation s2, we
   obtain a t statistic.
                        b−0
  Test statistic: t =           which has a t distribution
                        MSE
                         S xx
                                                        MSE
  with df = n − 2 or a confidenceinterval: b  ta / 2
                                                         S xx
THE CALCULUS PROBLEM
   • Isthere a significant relationship
     between the calculus grades and the test
     scores at the 5% level of significance?
        H 0 : b = 0 versusH a : b  0
                b−0                .7656 − 0
        t=                 =                        = 4.38
              MSE/ S xx         75.7532 / 2474
       Reject H 0 when |t| > 2.306. Since t = 4.38 falls
       into the rejection region, H 0 is rejected .
There is a significant linear relationship between the calculus grades and the
test scores for the population of college freshmen.
THE F TEST
 You   can test the overall usefulness of
   the model using an F test. If the model
   is useful, MSR will be large compared
   to the unexplained variation, MSE.
 To test H 0 : model is usefulin predicting y
                      MSR                    This test is
  Test Statistic: F =                        exactly
                      MSE                    equivalent to
  RejectH 0 if F  Fa with1 and n - 2 df .   the t-test, with
                                             t2 = F.
MINITAB OUTPUT
         Least squares
          H 0 : b = 0line
  To testregression
  Regression Analysis: y versus x
  The regression equation is y = 40.8 + 0.766 x
  Predictor           Coef        SE Coef    T      P
  Constant         40.784           8.507 4.79  0.001
  x                0.7656          0.1750 4.38  0.002
  S = 8.70363     R-Sq = 70.5%        R-Sq(adj) = 66.8%
  Analysis of Variance
  Source        DF     SS          MS      F        P
  Regression     1 1450.0        1450.0   19.14    0.002
  Residual Error 8   606.0         75.8
  Total          9 2056.0
                                Regression
                            coefficients, a and b t = F
                                                   2
          MSE
MEASURING THE STRENGTH
OF THE RELATIONSHIP
 • If the independent variable x is of useful in
   predicting y, you will want to know how well
   the model fits.
 • The strength of the relationship between x
   and y can be measured using:
                                  S xy
 Correlation coefficient : r =
                                 S xx S yy
                                                    2
                                             S xy
                                               SSR
 Coefficient of determination : r =2
                                             =
                                    S xx S yy Total SS
MEASURING THE STRENGTH
OF THE RELATIONSHIP
 •    Since Total SS = SSR + SSE, r2 measures
 ✓    the proportion of the total variation in the
      responses that can be explained by using
      the independent variable x in the model.
 ✓ the percent reduction the total variation by
      using the regression equation rather than
      just using the sample mean y-bar to
      estimate y.
                                              SSR
  For the calculus problem, r2 = .705 or r =
                                          2
  70.5%. The model is working well!
                                             Total SS
INTERPRETING A
SIGNIFICANT REGRESSION
 •   Even if you do not reject the null hypothesis
     that the slope of the line equals 0, it does
     not necessarily mean that y and x are
     unrelated.
 •   Type II error—falsely declaring that the
     slope is 0 and that x and y are unrelated.
 •   It may happen that y and x are perfectly
     related in a nonlinear way.
SOME CAUTIONS
•   You may have fit the wrong model.
•   Extrapolation—predicting values of y
    outside the range of the fitted data.
•   Causality—Do not conclude that x causes
    y. There may be an unknown variable at
    work!
CHECKING THE
REGRESSION ASSUMPTIONS
 • Remember that the results of a regression
   analysis are only valid when the necessary
   assumptions have been satisfied.
1. The relationship between x and y is linear,
   given by y = a + bx + e.
2. The random error terms e are independent
   and, for any value of x, have a normal
   distribution with mean 0 and variance s 2.
DIAGNOSTIC TOOLS
• We use the same diagnostic tools
  used in Chapter 11 to check the
  normality assumption and the
  assumption of equal variances.
 1. Normal probability plot of
    residuals
 2. Plot of residuals versus fit or
    residuals versus variables
RESIDUALS
 • The residual error is the “leftover”
   variation in each data point after the
   variation explained by the regression
   model has been removed.
        Residual= yi − yˆ i or yi − a − bxi
 • If all assumptions have been met,
   these residuals should be normal, with
   mean 0 and variance s2.
NORMAL PROBABILITY PLOT
✓ If the normality assumption is valid,
  the plot should resemble a straight
  line, sloping upward to the right.
✓ If not, you will often see the pattern
  fail in the tails of the graph.
                             Normal Probability Plot of the Residuals
                                           (response is y)
                  99
                  95
                  90
                  80
                  70
        Percent
                  60
                  50
                  40
                  30
                  20
                  10
                  1
                       -20        -10              0             10     20
                                                Residual
RESIDUALS VERSUS FITS
✓ If the equal variance assumption is
  valid, the plot should appear as a
  random scatter around the zero
  center line.
✓ If not, you will see a pattern in the
  residuals.                        Residuals Versus the Fitted Values
                                                 (response is y)
                         15
                         10
                          5
              Residual
                          -5
                         -10
                               60           70                80         90   100
                                                     Fitted Value
ESTIMATION AND
PREDICTION
•  Once you have
  ✓ determined that the regression line is
     useful
  ✓ used the diagnostic plots to check for
     violation of the regression assumptions.
• ✓YouEstimate
       are readythetoaverage
                      use the value
                              regression  line to
                                    of y for
      a given value of x
  ✓ Predict a particular value of y for a
      given value of x.
ESTIMATION AND
PREDICTION
Estimating a
particular value of y
when x = x0
                        Estimating the
                        average value of
                        y when x = x0
ESTIMATION AND
PREDICTION
 • The best estimate of either E(y) or y for
   a given value x = x0 is
                            yˆ = a + bx0
 • Particular values of y are more difficult to
   predict, requiring a wider range of values in the
   prediction interval.
ESTIMATION AND
PREDICTION
To estimatethe averagevalueof y when x = x0 :
                  1 ( x0 − x ) 2 
yˆ  ta / 2 MSE  +              
                                  
                  n     S xx     
To predict a particularvalueof y when x = x0 :
                      1 ( x0 − x ) 2   
yˆ  ta / 2   MSE 1 + +               
                                        
                      n     S xx       
THE CALCULUS
PROBLEM
 Estimatethe average calculus grade for
 students whose achievement score is 50
 with a 95% confidence interval.
Calculateyˆ = 40.78424 + .76556(50) = 79.06
                    1 (50 − 46) 2   
yˆ  2.306 75.7532 +               
                    10     2474      
79.06  6.55 or 72.51to 85.61.
 THE CALCULUS
 PROBLEM
   Estimate the calculus grade for a
   particular student whose achievement
   score is 50 with a 95% confidence interval.
Calculateyˆ = 40.78424 + .76556(50) = 79.06
                       1 (50 − 46)      2
                                             
yˆ  2.306 75.75321 +    +                 
                    10        2474           
79.06  21.11 or 57.95 to 100.17.
                                    Notice how
                                    much wider
                                    this interval is!
MINITAB OUTPUT
            Confidence and prediction
            intervals when x = 50
Predicted Values for New Observations
New Obs   Fit     SE Fit    95.0% CI                                95.0% PI
1       79.06       2.84   (72.51, 85.61)                         (57.95,100.17)
Values of Predictors for New Observations
New Obs         x
1            50.0                     Fitted Line Plot
                                     y = 40.78 + 0.7656 x
✓Green prediction                  120
                                                                                     Regression
                                                                                        95% CI
                                                                                        95% PI
                                   110
bands are always                   100
                                                                              S
                                                                              R-Sq
                                                                                          8.70363
                                                                                           70.5%
                                                                              R-Sq(adj)    66.8%
wider than red                     90
                                   80
                               y
confidence bands.                  70
                                   60
✓Both intervals are                50
                                   40
narrowest when x =                 30
                                         20   30   40   50   60     70   80
                                                        x
x-bar.
CORRELATION ANALYSIS
• The strength of the relationship between x
  and y is measured using the coefficient of
  correlation:
                                    S xy
   Correlation coefficient : r =
                                   S xx S yy
• Recall from Chapter 3 that
(1) -1  r  1    (2) r and b have the same sign
(3) r  0 means no linear relationship
(4) r  1 or –1 means a strong (+) or (-)
     relationship
EXAMPLE
The table shows the heights and weights of
n = 10 randomly selected college football
players.
 Player      1    2    3      4     5     6    7    8    9    10
 Height, x   73   71   75     72    72    75   67   69   71   69
 Weight, y 185 175 200 210          190 195 150     170 180 175
Use your                    S xy = 328 S xx = 60.4 S yy = 2610
calculator to find                       328
the sums and                r=                      = .8261
sums of squares.                   (60.4)(2610)
FOOTBALL PLAYERS
                           Scatterplot of Weight vs Height
           210
           200
           190
  Weight
           180
                                                                                r = .8261
           170
           160
                                                                             Strong positive
           150
                                                                               correlation
                 66   67   68    69    70       71   72      73   74   75
                                            Height
                                                                             As the player’s
                                                                            height increases,
                                                                               so does his
                                                                                 weight.
SOME CORRELATION PATTERNS
•       Use the Exploring Correlation applet to explore some
        correlation patterns:
                                          r = .931; Strong
     r = 0; No                          positive correlation
    correlation
r = 1; Linear
relationship                              r = -.67; Weaker
                                         negative correlation
INFERENCE USING R
• The population coefficient of correlation is
  called r (“rho”). We can test for a significant
  correlation between x and y using a t test:
 To test H 0 : r = 0 versusH a : r  0               This test is
                                                     exactly
                             n−2                     equivalent to
 Test Statistic: t = r                               the t-test for
                            1− r2                    the slope b=0.
 RejectH 0 if t  ta / 2 or t  −ta / 2 with n - 2 df .
      r = .8261
                    EXAMPLE
    Is there a significant positive correlation between weight
    and height in the population of all college football players?
      H0 : r = 0                                          n−2
                                    Test Statistic: t = r
      Ha : r  0                                          1− r2
                                                  8
Use the t-table with n-2 = 8 df     = .8261             = 4.15
to bound the p-value as p-
                                            1 − .8261 2
value < .005. There is a
significant positive correlation.
KEY CONCEPTS
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship,
    the appropriate model is y = a + b x + e .
2. The random error e has a normal distribution
    with mean 0 and variance s2.
II. Method of Least Squares
1. Estimates a and b, for a and b, are chosen to
    minimize SSE, the sum of the squared
    deviations about the regression
                           yˆ = a + bx.line,
2. The least squares estimates are b = Sxy/Sxx
    and
     a = y − bx.
KEY CONCEPTS
III. Analysis of Variance
1. Total SS = SSR + SSE, where Total SS = Syy and
     SSR = (Sxy)2 / Sxx.
2. The best estimate of s 2 is MSE = SSE / (n − 2).
IV. Testing, Estimation, and Prediction
1.   A test for the significance of the linear regression—
     H0 : b = 0 can be implemented using one of two test
     statistics:
                   b                   MSR
         t=                   or    F=
               MSE / S xx              MSE
KEY CONCEPTS
2.   The strength of the relationship between x and y can be
     measured using
                           SSR
                      R =
                        2
                          Total SS
     which gets closer to 1 as the relationship gets stronger.
3.   Use residual plots to check for nonnormality,
     inequality of variances, and an incorrectly fit model.
4.   Confidence intervals can be constructed to estimate
     the intercept a and slope b of the regression line and to
     estimate the average value of y, E( y ), for a given value
     of x.
5.   Prediction intervals can be constructed to predict a
     particular observation, y, for a given value of x. For a
     given x, prediction intervals are always wider than
     confidence intervals.
   KEY CONCEPTS
V. Correlation Analysis
1. Use the correlation coefficient to measure the
   relationship between x and y when both
   variables are random:
                              S xy
                        r=
                             S xx S yy
2. The sign of r indicates the direction of the
   relationship; r near 0 indicates no linear
   relationship, and r near 1 or −1 indicates a
   strong linear relationship.
3. A test of the significance of the correlation
   coefficient is identical to the test of the slope b.