3
Linear Regression
This chapter is about linear regression, a very simple approach for super-
vised learning. In particular, linear regression is a useful tool for predicting
a quantitative response. It has been around for a long time and is the topic
of innumerable textbooks. Though it may seem somewhat dull compared to
some of the more modern statistical learning approaches described in later
chapters of this book, linear regression is still a useful and widely used sta-
tistical learning method. Moreover, it serves as a good jumping-off point for
newer approaches: as we will see in later chapters, many fancy statistical
learning approaches can be seen as generalizations or extensions of linear
regression. Consequently, the importance of having a good understanding
of linear regression before studying more complex learning methods cannot
be overstated. In this chapter, we review some of the key ideas underlying
the linear regression model, as well as the least squares approach that is
most commonly used to fit this model.
Recall the Advertising data from Chapter 2. Figure 2.1 displays sales
(in thousands of units) for a particular product as a function of advertis-
ing budgets (in thousands of dollars) for TV, radio, and newspaper media.
Suppose that in our role as statistical consultants we are asked to suggest,
on the basis of this data, a marketing plan for next year that will result in
high product sales. What information would be useful in order to provide
such a recommendation? Here are a few important questions that we might
seek to address:
1. Is there a relationship between advertising budget and sales?
Our first goal should be to determine whether the data provide evi-
dence of an association between advertising expenditure and sales. If
the evidence is weak, then one might argue that no money should be
spent on advertising!
© Springer Nature Switzerland AG 2023 69
G. James et al., An Introduction to Statistical Learning, Springer Texts in Statistics,
https://doi.org/10.1007/978-3-031-38747-0_3
70 3. Linear Regression
2. How strong is the relationship between advertising budget and sales?
Assuming that there is a relationship between advertising and sales,
we would like to know the strength of this relationship. Does knowl-
edge of the advertising budget provide a lot of information about
product sales?
3. Which media are associated with sales?
Are all three media—TV, radio, and newspaper—associated with
sales, or are just one or two of the media associated? To answer this
question, we must find a way to separate out the individual contribu-
tion of each medium to sales when we have spent money on all three
media.
4. How large is the association between each medium and sales?
For every dollar spent on advertising in a particular medium, by
what amount will sales increase? How accurately can we predict this
amount of increase?
5. How accurately can we predict future sales?
For any given level of television, radio, or newspaper advertising, what
is our prediction for sales, and what is the accuracy of this prediction?
6. Is the relationship linear?
If there is approximately a straight-line relationship between advertis-
ing expenditure in the various media and sales, then linear regression
is an appropriate tool. If not, then it may still be possible to trans-
form the predictor or the response so that linear regression can be
used.
7. Is there synergy among the advertising media?
Perhaps spending $50,000 on television advertising and $50,000 on ra-
dio advertising is associated with higher sales than allocating $100,000
to either television or radio individually. In marketing, this is known
as a synergy effect, while in statistics it is called an interaction effect. synergy
interaction
It turns out that linear regression can be used to answer each of these
questions. We will first discuss all of these questions in a general context,
and then return to them in this specific context in Section 3.4.
3.1 Simple Linear Regression
Simple linear regression lives up to its name: it is a very straightforward
simple linear
approach for predicting a quantitative response Y on the basis of a sin- regression
gle predictor variable X. It assumes that there is approximately a linear
relationship between X and Y . Mathematically, we can write this linear
relationship as
Y ≈ β0 + β1 X. (3.1)
You might read “≈” as “is approximately modeled as”. We will sometimes
describe (3.1) by saying that we are regressing Y on X (or Y onto X).