Lecture 3
Simple Linear Regression
Linear regression quantifies the relationship between one or
more predictor variables and one outcome variable. Linear regression is
used for predictive analysis and modeling. For example, linear regression
can be used to quantify the relative impacts of age, gender, and diet (the
predictor variables) on height (the outcome variable). Linear regression is
also known as multiple regression, multivariate regression.
Regression Model and Regression Equation
The equation that describes how y is related to x and an error term is called
the regression model. The regression model used in simple linear
regression follows.
Y 0 1 x E{Y } 0 1 x. (1)
The graph of the simple linear regression equation is a straight line; 0 is
the y-intercept of the regression line, 1 is the slope, and E (y) is the mean
or expected value of y for a given value of x.
Examples of possible regression lines are shown in Figure 1.
FIGURE . POSSIBLE REGRESSION LINES IN SIMPLE LINEAR REGRESSION
1
Estimated Regression Equation
If the values of the population parameters 0 and 1 were known, we could
use equation (1) to compute the mean value of y for a given value of x. In
practice, the parameter values are not known and must be estimated using
sample data. Sample statistics (denoted b0 and b1) are computed as
estimates of the population parameters 0 and 1. Substituting the values of
the sample statistics b0 and b1 for 0 and 1 in the regression equation, we
obtain the estimated regression equation. The estimated regression
equation for simple linear regression follows.
ESTIMATED SIMPLE LINEAR REGRESSION EQUATION
y b0 b1 x. (2)
Figure 2 provides a summary of the estimation process for simple linear
regression.
FIGURE 2. THE ESTIMATION PROCESS IN SIMPLE LINEAR REGRESSION
2
The graph of the estimated simple linear regression equation is called the
estimated regression line; b0 is the y-intercept and b1 is the slope. In the
next section, we show how the least squares method can be used to
compute the values of b0 and b1 in the estimated regression equation.
In general, y is the point estimator of E{ y}, the mean value of y for a
given value of x.
0 and 1 are the unknown parameters of interest, and b0 and b1 are the
sample statistics used to estimate the parameters.
In simple linear regression, each observation consists of two values:
one for the independent variable and one for the dependent variable.
Least Squares Method
Example 1. The least squares method is a procedure for
using sample data to find the estimated regression equation. To illustrate
the least squares method, suppose data were collected from a sample of 10
Armand’s Pizza Parlor restaurants located near college campuses. For the
ith observation or restaurant in the sample, xi is the size of the student
population (in thousands) and yi is the quarterly sales (in thousands of
dollars). The values of xi and yi for the 10 restaurants in the sample are
summarized in Table 1. We see that restaurant 1, with x1 = 2 and y1 = 58, is
near a campus with 2000 students and has quarterly sales of $58,000.
Restaurant 2, with x2 = 6 and y2 = 105, is near a campus with 6000
students and has quarterly sales of $105,000. The largest sales value is for
restaurant 10, which is near a campus with 26,000 students and has
quarterly sales of $202,000.
TABLE 1. STUDENT POPULATION AND QUARTERLY SALES DATA
FOR 10 ARMAND’S PIZZA PARLORS
Figure 3 is a scatter diagram of the data in Table.1.
3
FIGURE 3. SCATTER DIAGRAM OF STUDENT POPULATION
AND QUARTERLY SALES FOR ARMAND’S PIZZA PARLORS
For these data the relationship between the size of the student population
and quarterly sales appears to be approximated by a straight line; indeed, a
positive linear relationship is indicated between x and y. We therefore
choose the simple linear regression model to represent the relationship
between quarterly sales and student population. Given that choice, our next
task is to use the sample data in Table 1 to determine the values of b0 and b1
in the estimated simple linear regression equation. For the ith restaurant,
the estimated regression equation provides
yi b0 b1 xi (3)
where
yi = predicted value of quarterly sales ($1000s) for the ith restaurant
b0 = the y-intercept of the estimated regression line
b1 = the slope of the estimated regression line
4
xi = size of the student population (1000s) for the ith restaurant
with yi denoting the observed (actual) sales for restaurant i and yi in
equation (3) representing the predicted value of sales for restaurant i, every
restaurant in the sample will have an observed value of sales yi and a
predicted value of sales yi . For the estimated regression line to provide a
good fit to the data, we want the differences between the observed sales
values and the predicted sales values to be small.
The least squares method uses the sample data to provide the values of
b0 and b1 that minimize the sum of the squares of the deviations between
the observed values of the dependent variable yi and the predicted values of
the dependent variable yi . The criterion for the least squares method is
given by expression (4).
(4)
Differential calculus can be used to show that the values of b0 and b1 that
minimize expression (4) can be found by using equations (5) and (6).
(5)
(6)
Some of the calculations necessary to develop the least squares estimated
regression equation for Armand’s Pizza Parlors are shown in Table 2. With
5
the sample of 10 restaurants, we have n = 10 observations. Because
equations (5) and (6) require x and y we begin the calculations by
computing x and y .
Using equations (5) and (6) and the information in Table 1, we can compute
the slope and intercept of the estimated regression equation for Armand’s
Pizza Parlors.
TABLE 2. CALCULATIONS FOR THE LEAST SQUARES ESTIMATED
REGRESSION EQUATION FOR ARMAND’S PIZZA PARLORS
The calculation of the slope (b1) proceeds as follows.
The calculation of the y intercept (b0) follows.
6
Thus, the estimated regression equation is
Figure 4 shows the graph of this equation on the scatter diagram.
The slope of the estimated regression equation (b1 = 5) is positive,
implying that as student population increases, sales increase. In fact, we
can conclude (based on sales measured in $1000s and student population
in 1000s) that an increase in the student population of 1000 is associated
with an increase of $5000 in expected sales; that is, quarterly sales are
expected to increase by $5 per student.
If we believe the least squares estimated regression equation adequately
describes the relationship between x and y, it would seem reasonable to use
the estimated regression equation to predict the value of y for a given value
of x. For example, if we wanted to predict quarterly sales for a restaurant to
be located near a campus with 16,000 students, we would compute
Hence, we would predict quarterly sales of $140,000 for this restaurant.
FIGURE 4. GRAPH OF THE ESTIMATED REGRESSION EQUATION FOR
ARMAND’S PIZZA PARLORS: y 60 5 x
7
NOTES AND COMMENTS
Example 2
Companies in the U.S. car rental market vary greatly in terms of the size of
the fleet, the number of locations, and annual revenue. In 2011 Hertz had
320,000 cars in service and annual revenue of approximately $4.2 billion.
The following data show the number of cars in service (1000s) and the
annual revenue ($ millions) for six smaller car rental companies. (auto
Rental News website, August 7, 2012).
a) Develop a scatter diagram with the number of cars in service as the
independent variable.
b) What does the scatter diagram developed in part (a) indicate about the
relationship between the two variables?
c) Use the least squares method to develop the estimated regression
equation.
d) For every additional car placed in service, estimate how much annual
revenue will change.
e) Fox Rent A Car has 15,000 cars in service. Use the estimated regression
equation developed in part (c) to predict annual revenue for Fox Rent A
Car.
8
SOLUTION of EXAMPLE 2
Table 3
a) Develop a scatter diagram with the number of cars in
service as the independent variable.
FIGURE 5
b) What does the scatter diagram developed in part (a)
indicate about the relationship between the two variables?
For these data the relationship between the number of cars in service
(1000s) and the annual revenue ($ millions) appears to be approximated
by a straight line; indeed, a positive linear relationship is indicated
between x and y. We therefore choose the simple linear regression model
to represent the relationship between the annual revenue ($ millions) of
the car rental company and the number of cars in service. Given that
choice, our next task is to use the sample data in Table 3 to determine the
values of b0 and b1 in the estimated simple linear regression equation. For
the ith car rental company, the estimated regression equation provides
yi b0 b1 xi (3)
where
yi = predicted value of revenue ($ millions) for the ith car rental company
b0 = the y-intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi = number of cars in service (1000s) for the ith car rental company
9
c) Use the least squares method to develop the estimated
regression equation.
Some of the calculations necessary to develop the least squares estimated
regression equation for the car rental companies are shown in Table 3. With
the sample of 6 companies, we have n = 6 observations. Because equations
(5) and (6) require x and y we begin the calculations by computing x and
y.
n
x i
43.5
x i 1
7.25
n 6
n
y i
y i 1
77
n
Using the above equations and the information in Table 3, we can compute
the slope and intercept of the estimated regression equation for the car
rental companies.
TABLE 4. CALCULATIONS FOR THE LEAST SQUARES ESTIMATED
REGRESSION EQUATION FOR THE CAR RENTAL COMPANIES
( xi x )
i xi yi xi x yi y ( xi x ) 2
( yi y )
1 11.5 118 4.25 41 174.25 18.0625
2 10 135 2.75 135 371.25 7.5625
3 9 100 1.75 100 175 3.0625
4 5.5 37 -1.75 37 -64.75 3.0625
5 4.2 40 -3.05 40 -122 9.3025
6 3.3 32 -3.95 32 -126.4 15.6025
Totals 43.5 462 734.6 56.655
x 7.25 y 77 b1 12.9662
The calculation of the slope (b1) proceeds as follows.
n
( x x )( y
i i y)
734.6
b1 i 1
n
12.9662
(x x ) 2 56.655
i
i 1
10
The calculation of the y intercept (b0) follows.
b0 y b1 x 77 12.9662 7.25 17.0049
Thus, the estimated regression equation is
y 17.0049 12.9662 x
d) For every additional car placed in service, estimate how
much annual revenue will change.
Figure 5 shows the graph of this equation on the scatter diagram.
The slope of the estimated regression equation (b1 = 12.9662) is positive,
implying that as the number of cars in service increases, revenue increase.
In fact, we can conclude (based on revenue measured in $ millions and
number of cars in 1000s) that an increase in the number of cars of 1000 is
associated with an increase of $12.9662 millions in expected revenue; that
is, revenue is expected to increase by $0.0129662 millions per car.
e) Fox Rent A Car has 15,000 cars in service. Use the estimated
regression equation developed in part (c) to predict annual
revenue for Fox Rent A Car.
If we believe the least squares estimated regression equation adequately
describes the relationship between x and y, it would seem reasonable to use
the estimated regression equation to predict the value of y for a given value
of x. For example, if we wanted to predict revenue for Fox Rent A Car
with 15,000 cars in service, we would compute
y 17.0049 12.9662(15) 177.488
Hence, we would predict quarterly sales of $177.488 millions for this
company.
11