0% found this document useful (0 votes)

30 views235 pages

Linear Review 1

The document discusses the principles of regression analysis, focusing on how to optimally guess the value of a quantitative random variable Y using mean squared error (MSE) as a criterion. It explains the process of constructing a regression function based on auxiliary variable X and highlights the trade-offs involved in choosing the complexity of the function class to avoid overfitting or underfitting. Additionally, it addresses the bias-variance trade-off in estimating regression functions and the objectives of prediction versus inference in regression analysis.

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views235 pages

Linear Review 1

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 235

Review of Linear Models I

Presidency University

February, 2025
Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative

random variable Y . What is the best value to guess?
Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative

random variable Y . What is the best value to guess?

I To answer this question, we need to pick a function to be

optimized, which should measure how good or bad our guesses
are: how big an error we’re making. A reasonable, traditional
starting point is the mean squared error:

MSE (c) = E (Y − c)2

So we’d like to find the value c where MSE (c) is smallest.

Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative

random variable Y . What is the best value to guess?

I To answer this question, we need to pick a function to be

optimized, which should measure how good or bad our guesses
are: how big an error we’re making. A reasonable, traditional
starting point is the mean squared error:

MSE (c) = E (Y − c)2

So we’d like to find the value c where MSE (c) is smallest.

I Thus the optimal choice is given by c = E (Y ). Hence the best

guess we can make about Y with respect to mean square error
is E (Y ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we

make a guess of Y by some function of X say g (X ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we

make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we

seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we

make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we

seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).

I This function f (X ) is called the regression function which we

would like to know when we try to predict Y based on X .
Regression of Y on X is the locus of the conditional mean
E (Y |X ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we

make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we

seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).

I This function f (X ) is called the regression function which we

would like to know when we try to predict Y based on X .
Regression of Y on X is the locus of the conditional mean
E (Y |X ).

I Problem: This function f (X ) is generally unknown unless we

assume some completely known probability distribution of
(X , Y ).
Regression Analysis

I What we have at our hand are random samples

(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.
Regression Analysis

I What we have at our hand are random samples

(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.

I Regression analysis is all about constructing a suitable

approximation fˆ of f based on the training data set.
Regression Analysis

I What we have at our hand are random samples

(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.

I Regression analysis is all about constructing a suitable

approximation fˆ of f based on the training data set.

I As we shall see, constructing a suitable approximation fˆ is a

two step process:
I Step 1: Restricting to a class of functions F and find the best
approximation fF of f within that class.
I Step 2: Estimating (or learning) fF based on the data
(x1 , y1 ), ..., (xn , yn ) by fˆ.
More about Step 1

I The choice of the class F is a trade-off:

I If F contains too much complicated functions, this will
capture the variation in training data too much and will lead
to what we call over-fitting or undersmoothing.
More about Step 1

I The choice of the class F is a trade-off:

I If F contains too much complicated functions, this will
capture the variation in training data too much and will lead
to what we call over-fitting or undersmoothing.
I On the contrary, if F contains too much simple functions, then
it will fail to capture any variation in the training data: this is
called underfitting or oversmoothing.
More about Step 1

I The choice of the class F is a trade-off:

I Neither scenario is desirable: both have their own problems

which we shall see.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
I The problem of curve estimation is a much wider problem than
the problem of prediction because solving the former problem
will solve the later problem as a consequence.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
I The problem of curve estimation is a much wider problem than
the problem of prediction because solving the former problem
will solve the later problem as a consequence.

I A natural question is then why consider the two problems

separately? This is because, the problem of prediction is much
simpler than the problem of curve estimation and hence we
can device many trivial regression procedure for the purpose.
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
I We want to estimate f but not for predicting Y . Now fˆ cannot be
treated as black-box, we need to know its exact mathematical form.
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
I We want to estimate f but not for predicting Y . Now fˆ cannot be
treated as black-box, we need to know its exact mathematical form.

I In this setting, one may be interested in answering the following

questions:
I Which predictors are associated with the response? It is often the

case that only some of the available predictors are substantially

associated with Y. Identifying the few important predictors among a
large set of possible variables can be extremely useful.
I What is the relationship between the response and each predictor?

Some predictors may have a positive relationship with Y while

others may have the opposite relationship.
I Can the relationship between Y and each predictor be adequately

summarized using a linear equation, or is the relationship more

complicated? In some situation linear form is reasonable or
desirable. But often the true relationship is more complicated, in
which case a linear model may not provide an accurate
representation of the relationship between the input and output
variables.
Problem of prediction

I In many situations, a set of inputs X are readily available, but

the output Y cannot be easily obtained.
Problem of prediction

I In many situations, a set of inputs X are readily available, but

the output Y cannot be easily obtained.

I In this setting, since the error term averages to zero, we can

predict Y using
Ŷ = fˆ(X )
where fˆ is estimate of f and Ŷ is the prediction for Y .
Problem of prediction

I In many situations, a set of inputs X are readily available, but

the output Y cannot be easily obtained.

I In this setting, since the error term averages to zero, we can

predict Y using
Ŷ = fˆ(X )
where fˆ is estimate of f and Ŷ is the prediction for Y .

I We note that fˆ acts like a black-box in the sense that we need

not to know the exact mathematical form of fˆ, so far it yields
accurate predictions Ŷ .
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
I But how do we measure accuracy here? The answer is using
mean squared error. We need to choose fˆ for which MSE is
minimum.
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
I But how do we measure accuracy here? The answer is using
mean squared error. We need to choose fˆ for which MSE is
minimum.
I For any approximation fˆn (x) based on a sample of size n, we
can write
MSE (fˆ(x)) = σx2 + Bias 2 (fˆn (x)) + Var (fˆn (x))
where
I σx2 = Var (Y |X = x) is the variance which is uncontrollable
(variance due to random causes),
I the second term Bias 2 (fˆn (x)) = E 2 [f (x) − E (fˆn (x))] is the
squared approximation bias (or error) which is incorporated
due to use of fˆn (x) instead of f (x) and
I the third term Var (fˆn (x)) = E [fˆn (x) − E (fˆn (x))]2 indicates the
variance of our estimated regression function.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then

Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then

Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the

approximation bias can only come through increasing the estimation
variance.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then

Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the

approximation bias can only come through increasing the estimation
variance.

I This is the bias-variance trade-off.

Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then

Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the

approximation bias can only come through increasing the estimation
variance.

I This is the bias-variance trade-off.

I This trade-off is exactly the same as we have discussed earlier: overfitting

and oversmoothing.
Methods of finding fˆ

I We shall explore many linear and non-linear approaches for

estimating f .
Methods of finding fˆ

I We shall explore many linear and non-linear approaches for

estimating f .

I Most statistical estimation methods for this task can be

characterized as either parametric or non-parametric. Both
these approaches have their relative merits and demerits.
Parametric methods
I Parametric methods is generally a two-step approach to build
models:
1. Model Assumption: First we assume a specific functional form of f .
For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
Y , one may assume a linear form of f (X ) as
f (X ) = β0 + β1 X1 + .... + βp Xp .
We note that this assumption of linearity makes our search for f
simple. We now need not to search among the set of all
p−dimensional functions, rather we only need to estimate the p + 1
coefficients βi to get the desired model.
2. Fitting the assumed model: After a model has been selected, we
need a procedure that uses the training data to fit or train the
model. For example, in the case of the linear model, for fitting we
need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp
Parametric methods
I Parametric methods is generally a two-step approach to build
models:
1. Model Assumption: First we assume a specific functional form of f .
For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
Y , one may assume a linear form of f (X ) as
f (X ) = β0 + β1 X1 + .... + βp Xp .
We note that this assumption of linearity makes our search for f
simple. We now need not to search among the set of all
p−dimensional functions, rather we only need to estimate the p + 1
coefficients βi to get the desired model.
2. Fitting the assumed model: After a model has been selected, we
need a procedure that uses the training data to fit or train the
model. For example, in the case of the linear model, for fitting we
need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp

I One of the many available techniques to fit such model is

least square approach.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
I We can make our model more flexible by including more
parameters in the model, but that will lead to overfitting.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
I We can make our model more flexible by including more
parameters in the model, but that will lead to overfitting.
I This is trade-off : if we choose too less parameters we may
have oversmoothing which means ignoring many potential
causes and if we include many parameters we suffer from
overfitting or what we call undersmoothing.
Non-parametric approach

I Non-parametric methods do not make explicit assumptions

about the functional form of f .
Non-parametric approach

I Non-parametric methods do not make explicit assumptions

about the functional form of f .

I Instead here we seek an estimate of f that gets as close to the

data points as possible without being too rough.
Non-parametric approach

I Non-parametric methods do not make explicit assumptions

about the functional form of f .

I Instead here we seek an estimate of f that gets as close to the

data points as possible without being too rough.

I Examples of such non-parametric model is spline regression.

I The major advantage of such approaches is here we avoid the
assumption of a particular functional form for f , and hence
they have the potential to accurately fit a wider range of
possible shapes for f .
I The major advantage of such approaches is here we avoid the
assumption of a particular functional form for f , and hence
they have the potential to accurately fit a wider range of
possible shapes for f .

I But a major disadvantage of this approach is: since they do

not reduce the problem of estimating f to a small number of
parameters,a very large number of observations(far more than
is typically needed for a parametric approach) is required in
order to obtain an accurate estimate for f .
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.

I True regression f (x) is really a constant.
I f (x) varies rapidly but within narrow limits.
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.

I True regression f (x) is really a constant.
I f (x) varies rapidly but within narrow limits.
I In such situations we can actually do better by fitting a
constant than matching the correct functional form.
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.

I In the second situation fˆ(x) = f0 should be biased but can

have less MSE than an unbiased estimator.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β 1 and γ 1 to ensure that f lies

within narrow limits.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β 1 and γ 1 to ensure that f lies

within narrow limits.

I Here estimating a constant regression function does better

than the estimated regression with the correct functional form.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β 1 and γ 1 to ensure that f lies

within narrow limits.

I Here estimating a constant regression function does better

than the estimated regression with the correct functional form.

I In fact here the MSE of the model fˆ1 (x) = f0 is less than the
MSE of the unbiased model fˆ2 (x) = α̂ + β̂ sin(γx) (assuming
γ to be known).
Example (contd.)

1.6
1.2
y

0.8
0.4

0.2 0.4 0.6 0.8 1.0

x
Example (Contd.)

I A rapidly-varying but nearly-constant regression function

y = 1 + 0.02 sin(200x) + where ∼ N(0, 0.5).
I The red dotted line is the constant line indicating sample
mean of response.
I The blue dot-dashed curve is the estimated function of the
form α̂ + β̂ sin(200x).
I With just a few observations, the constant actually predicts
better on new data ( MSE 0.53 ) than does the estimate sine
function (MSE 0.59).
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the

constant function is optimum.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the

constant function is optimum.

I Optimum means a “reasonably good” approximation of the

truth to serve our purpose.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the

constant function is optimum.

I Optimum means a “reasonably good” approximation of the

truth to serve our purpose.

I We shall always remember that we are searching for the

optimum not the truth.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the

constant function is optimum.

I Optimum means a “reasonably good” approximation of the

truth to serve our purpose.

I We shall always remember that we are searching for the

optimum not the truth.

I This is the motivation behind fixing the class F.

General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.
General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.

I Suppose we have n observations on each of these p variables. That is,

suppose we have observations y1 , y2 , ..., yn on the response y and
x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.
General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.

I Suppose we have n observations on each of these p variables. That is,

suppose we have observations y1 , y2 , ..., yn on the response y and
x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.

I Then the general linear model can be written as

y = Xβ +

where y = (y1 , y2 , ..., yn ) is the response vector and β = (b1 , ..., bp ) is the
vector of parameters and

x11 x12 · · · x1p

 
x21 x22 x2p 
X = . ..
 
.

 . . ··· 
xn1 xn2 xnp

is the design matrix.

Linear Model (contd.)

I Further = (1 , 2 , .., n ) is the vector of random errors where

I Further = (1 , 2 , .., n ) is the vector of random errors where

I More specifically we assume single quantitative resposne

variable y and p covarites such that

E (y |X ) = X β and Var (y |X ) = σ 2 In .
Example: Simple Linear Regression

I Suppose we restrict our class to the class of all linear functions

F = {f (x) : f (x) = α + βx, α, β ∈ R}.
I For n observations, we can write it as

y = Xθ +

where E () = 0 and Var () = σ 2 In .

I Here we have
 
1 x1
1 x2 
α
X =  . .  and θ = .
 
. .
. .  β
1 xn
Example: Polynomial Regression

I An immediate extension of this can be done by expanding the

class to incorporate the polynomials in x as

F = {f : f (x) = β0 + β1 x + .... + βp x p for some p.}

I For polynomial regression, we have

1 x1 x12 · · · x1p
   
β0
1 x2 x 2 · · · x p   β1 
2 2
X = . . .. .. ..  and θ =  ..  .
  
 .. .. . . .  .
2 p
1 xn xn · · · xn βp
Example: Multiple Regression

I Hence we can consider the class of functions as

F = {f : f (x) = β0 + β1 x1 + .... + βp xp }

where we have a single response variable y and p quantitative

predictor variables, x1 , x2 , ...xp .
I For multiple linear regression we have
   
1 x11 x12 · · · x1p β0
1 x21 x22 · · · x2p   β1 
X = . .. .. .. ..  and θ =  ..  .
   
.. . . . .  .
1 xn1 xn2 · · · xnp βp
Classification

I If all the columns of X (except the first column) contains

values of continuous variables, then the linear model is called
regression model.
Classification

I If all the columns of X (except the first column) contains

values of continuous variables, then the linear model is called
regression model.

I If all the columns of X contains values of discrete variables

(more specifically if all the columns contains values 0 or 1),
then the linear model is called ANOVA model.
Classification

I If all the columns of X (except the first column) contains

values of continuous variables, then the linear model is called
regression model.

I If all the columns of X contains values of discrete variables

(more specifically if all the columns contains values 0 or 1),
then the linear model is called ANOVA model.

I If some columns of X contains values of continuous variables

and some columns contains values of discrete variable, then
the linear model is called ANCOVA (or ANOCOVA) model.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
I The answer is to use indicator variables or dummy
variables x1 , x2 , ..., xk−1 where
(
1 if the observation receives the i th level
xi =
0 otherwise.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
I The answer is to use indicator variables or dummy
variables x1 , x2 , ..., xk−1 where
(
1 if the observation receives the i th level
xi =
0 otherwise.
I We can write a linear model as
y = α + β1 x1 + .... + βk−1 xk−1 +
where βi is the effect of the i th level of A.
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the

factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the

factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .

I When an observation receives the level Ai , i = 1, 2, ..., k − 1,

then expected value of y is α + βi . As such
βi , i = 1, 2, ..., k − 1 represents the change in the expected
value of y due to Ai as compared to Ak .
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the

factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .

I When an observation receives the level Ai , i = 1, 2, ..., k − 1,

then expected value of y is α + βi . As such
βi , i = 1, 2, ..., k − 1 represents the change in the expected
value of y due to Ai as compared to Ak .

I That means each βi represents the expected difference in y

when the observation belongs to Ai and Ak . For this reason
βi ’s are sometimes called contrasts between two classes.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.

I Obviously we can take any level Ai (not necessarily Ak ) to be

the baseline level.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.

I Obviously we can take any level Ai (not necessarily Ak ) to be

the baseline level.

I A general rule is thus : if we are working with a factor, then

we need to introduce k − 1 dummy variables.
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk +
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk +

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk +

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .

I Here the design matrix is

1 x11 x12 ··· x1k

 
1 x21 x22 ··· x2k 
X = . .. .. 
 
 .. . . 
1 xn1 xn2 ··· xnk

but it is not of full rank.

Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk +

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .

I Here the design matrix is

1 x11 x12 ··· x1k

 
1 x21 x22 ··· x2k 
X = . .. .. 
 
 .. . . 
1 xn1 xn2 ··· xnk

but it is not of full rank.

I Statistical lesson: There can be alternative parametrization for the same model.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.

I We assume that
eij ∼ N(0, σ 2 )
and eij0 s are independent.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.

I We assume that
eij ∼ N(0, σ 2 )
and eij0 s are independent.

I This implies that E (yij ) = µi and Var (yij ) = σ 2 for all j = 1, ..., ni which
means µ0i s are the factor level means for each i and σ 2 is the common
variability among observations belonging to each group.
One way ANOVA as linear model

I Here we have introduced k dummy variables for k levels but

without any intercept.
One way ANOVA as linear model

I Here we have introduced k dummy variables for k levels but

without any intercept.

I In terms of dummy variables we can write

y = µ1 x1 + µ2 x2 + .... + µk xk +

where xi = 1 or 0 according as the observation receives Ai or

not.
One way ANOVA as linear model
I Suppose we denote

y11 e11
   
 y12   e12 
 ..   .. 
   
 .   . 
   
y1n  e1n 
 1  1
1n1 0 ··· 0
   
 y21 
  µ1  e21 
 
 .   µ2   0 1n2 ··· 0   . 
y =  ..  , β =  .  and Xn×k =  .
 . 
.. .. ..  and = .
     
 ..   ..
 
y2n2 
  . . .  e2n2 
 
 . 
 .  µ k 0 0 ··· 1nk  . 
 . 
 .   . 
   
 yk1   ek1 
   
 .   . 
 ..   .. 
yknk eknk
One way ANOVA as linear model
I Suppose we denote

I Then the above model can be written as

y = Xβ +
2
where ∼ Nn (0, σ In ).
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i

I Now our linear model of interest becomes

yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k

where µ denotes the general effect or the average effect and αi denotes
the
X additional effect (fixed) due to Ai subject to the restriction
ni αi = 0 and eij denotes the random error.
i
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i

I Now our linear model of interest becomes

yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k

where µ denotes the general effect or the average effect and αi denotes
the
X additional effect (fixed) due to Ai subject to the restriction
ni αi = 0 and eij denotes the random error.
i

I We assume that for all i, j, eij are independent N(0, σ 2 ) variables.

Reparametrized form as linear Model

I Now, in terms of dummy variables, we have included k dummy

variables for k levels along with an intercept.
Reparametrized form as linear Model

I Now, in terms of dummy variables, we have included k dummy

variables for k levels along with an intercept.

I In this case the linear model becomes

y = Xβ +

where β = (µ, α1 , α2 , ..., αk )T and

 
1n1 1n1 0 ··· 0
 1n
 2 0 1n2 ··· 0 
Xn×(k+1) =  . .. .. .. .

 .. . . . 
1n k 0 0 ··· 1nk
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form

Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form

Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2

I Note that we allow the number of observations in the two groups to be different
and y ’s represent the value of the response.
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form

Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2

I Note that we allow the number of observations in the two groups to be different
and y ’s represent the value of the response.

I This situation can also be tackled with a linear model with the use of dummy
variables.
More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not

More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not

I Then the linear model can be written as

z = α + βx +

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).

More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not

I Then the linear model can be written as

z = α + βx +

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).

I Here (
y1i , i = 1, 2, ..., n1
zi =
y2(i−n1 ) , i = n1 + 1, ..., n1 + n2
and 1 , 2 , ..., n are the random errors.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and is

the random error vector.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and is

the random error vector.

I It is instructive to have a look at the structure of the design matrix

 
1 0
 1 0
 
 . .. 
 .
 . . 

 1 0
 
Xn×2 = · · · · · ·
 
 1 1
 
 1 1
 
 .. . 
 
 . .. 
1 1

where the upper submatrix consists of n1 rows and the lower one contains
n2 rows.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and is

the random error vector.

I It is instructive to have a look at the structure of the design matrix

where the upper submatrix consists of n1 rows and the lower one contains
n2 rows.

I Note that in the above formulation the effect of the treatment is α + β and the
effect of the control is α- so the change in effect due to the treatment is β.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our

linear model.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our

linear model.

I All the factors need not to have same number of levels.

More than one categorical predictors

I We can include as many factor covariates as we wish to in our

linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each

factors.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our

linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each

factors.

I The only wrinkle with having multiple factors is that α, the

over-all intercept, is now the expected value of y for
individuals where all categorical variables are in their respective
baseline levels.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our

linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each

factors.

I The only wrinkle with having multiple factors is that α, the

over-all intercept, is now the expected value of y for
individuals where all categorical variables are in their respective
baseline levels.

I With multiple factors we shall have a new issue in our model

called interaction effect.
Interaction

I Interaction effect is the joint effect of two or more factors.

Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on

the yield of plots.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on

the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on

the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.

I It may happen that a particular soil type behaves exceptionally

in presence of a particular fertilizer.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on

the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.

I It may happen that a particular soil type behaves exceptionally

in presence of a particular fertilizer.

I This is the interaction effect and should be included in the

model explicitly.
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that
∂
E [y |x]
∂xi
is not a function of xj .
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that
∂
E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that
∂
E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
I The standard multiple linear regression model of course
includes no interactions between any of the predictor variables.
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that
∂
E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
I The standard multiple linear regression model of course
includes no interactions between any of the predictor variables.
I But general considerations of statistical modeling give us no
reason whatsoever to anticipate that interactions are rare, or
that when they exist they are small.
Interaction (Contd.)

I Conventionally interactions are included in a linear model by

adding a product term.
Interaction (Contd.)

I Conventionally interactions are included in a linear model by

adding a product term.

I For example, suppose we are dealing with two factor covariates

each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .
Interaction (Contd.)

I Conventionally interactions are included in a linear model by

adding a product term.

I For example, suppose we are dealing with two factor covariates

each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .

I Then the interaction effect will be modeled as,

y = α + β1 x1 + β2 x2 + β3 x1 x2 +
Interaction (Contd.)

I Conventionally interactions are included in a linear model by

adding a product term.

I For example, suppose we are dealing with two factor covariates

each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .

I Then the interaction effect will be modeled as,

y = α + β1 x1 + β2 x2 + β3 x1 x2 +

I It is no longer correct to interpret β1 as

E [y |X1 = x1 + 1, X2 = x2 ] − E [y |X1 = x1 , X2 = x2 ].
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between

two otherwise-identical cases where x2 differs by 1.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between

two otherwise-identical cases where x2 differs by 1.

I The fact that we can’t give one answer to “how much does the
response change when we change this variable?”, that the
correct answer to that question always involves the other
variable, is what interaction means.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between

two otherwise-identical cases where x2 differs by 1.

I What we can say is that β1 is the slope with regard to x1 when

x2 = 0, and likewise β2 is how much we expect y to change for
a one-unit change in x2 when x1 = 0.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between

two otherwise-identical cases where x2 differs by 1.

I What we can say is that β1 is the slope with regard to x1 when

x2 = 0, and likewise β2 is how much we expect y to change for
a one-unit change in x2 when x1 = 0.

I β3 is the rate at which the slope on x1 changes as x2 changes,

and likewise the rate at which the slope on x2 changes with x1 .
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
I Suppose that the real regression function µ(x) = E (Y |x) is a
smooth function of all the coordinates of x.
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
I Suppose that the real regression function µ(x) = E (Y |x) is a
smooth function of all the coordinates of x.
I Because it is smooth, we should be able to do a Taylor
expansion around any particular point, say x ∗ as
p p p
X ∂µ 1 XX ∂2µ
µ(x) ≈ µ(x ∗ )+ (xi −xi∗ ) |x=x ∗ + (x−xi∗ )(x−xj∗ )
∂xi 2 ∂xi xj
i=1 i=1 j=1
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only

see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only

see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms

are same as linear terms.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only

see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms

are same as linear terms.

I Obviously we can include other type of interactions like

x1 x2
1+|x1 x2 | but then we need to form a new column of predictors
in the design matrix.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only

see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms

are same as linear terms.

I Obviously we can include other type of interactions like

x1 x2
1+|x1 x2 | but then we need to form a new column of predictors
in the design matrix.

I Also then there may be difficulty in interpretations.

Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A

namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.
Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A

namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).

Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A

namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).

I Further suppose µij is fixed effect due to (Ai , Bj ).

Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A

namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).

I Further suppose µij is fixed effect due to (Ai , Bj ).

I Thus µij is the mean response of observations receiving

treatment combination (Ai , Bj ).
Interpretation

I The interpretation of a treatment mean µij depends on

whether the study is observational, experimental, or a mixture
of the two.
Interpretation

I The interpretation of a treatment mean µij depends on

whether the study is observational, experimental, or a mixture
of the two.

I In an observational study, the treatment mean µij corresponds

to the population mean for the elements having the
characteristics of the i th level of factor A and the j th level of
factor B.
Interpretation

I The interpretation of a treatment mean µij depends on

whether the study is observational, experimental, or a mixture
of the two.

I In an observational study, the treatment mean µij corresponds

to the population mean for the elements having the
characteristics of the i th level of factor A and the j th level of
factor B.

I In an experimental study, the treatment mean µij stands for

the mean response that would be obtained if the treatment
consisting of the i th level of factor A and the j th level of factor
B were applied to all units in the population of experimental
units about which inferences are to be drawn.
Reparametrization

I For all i, j, rewrite µij as

µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )
Reparametrization

I For all i, j, rewrite µij as

µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )

XX
1
I Here µ̄00 = pq µij is the general effect (say µ) as it is
i j
obtained by averaging over the effects of all possible level
combinations.
Reparametrization

I For all i, j, rewrite µij as

µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )

XX
1
I Here µ̄00 = pq µij is the general effect (say µ) as it is
i j
obtained by averaging over the effects of all possible level
combinations.
I Further
1X
µ̄i0 = µij = the fixed effect due to Ai
q j
X
⇒ αi = µ̄i0 − µ̄00 = fixed additional effect (main) due to Ai with αi = 0
i
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j

I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
i th level Ai .
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j

I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
i th level Ai .

I Averaging out over those effects for varying i, we get µ̄0j − µ̄00 .
I Thus

γij = (µij − µ̄i0 ) − (µ̄0j − µ̄00 ) = fixed interaction effect due to (Ai , Bj )

with X
γij = 0 for all j
i

and X
γij = 0 for all i.
j
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
I When we assume no interaction effects we call the factor
effects are additive, that is,
µij = µ + αi + βj
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
I When we assume no interaction effects we call the factor
effects are additive, that is,
µij = µ + αi + βj

I This can mean two things.

No interaction

I The figure shows that Age has some effect (due to difference
in height) whereas gender has no effect (since lines have zero
slope) on the mean response.
I Also the lines do not intersect meaning that there is no
interaction effect.
No interaction

I Here both age and gender have effects on the mean response
but still there is no interaction effect because the lines do not
intersect.
I Thus it is entirely possible that factors are additive (that is
factors have main effects but they do not interact).
Interaction

I There are main effects of both the factors along with the
interaction effect.
I Is it possible that factors have interaction effects but no main
effects? (Can some parallel lines intersect ? )
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the

factor level means are still meaningful measures.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the

factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level

means comes out to be 13 and 11. It may be argued that
these are misleading measures.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the

factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level

means comes out to be 13 and 11. It may be argued that
these are misleading measures.

I They indicate that some difference exists in learning time for

men and women, but that this difference is not too great.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the

factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level

means comes out to be 13 and 11. It may be argued that
these are misleading measures.

I They indicate that some difference exists in learning time for

men and women, but that this difference is not too great.

I These factor level means hide the fact that there is no

difference in mean learning time between genders for young
persons, but there is a relatively large difference for old
persons.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I The determination of whether interactions are important or
unimportant is admittedly sometimes difficult because it
depends on the context of the application.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I The determination of whether interactions are important or
unimportant is admittedly sometimes difficult because it
depends on the context of the application.
I The subject area specialist (researcher) needs to play a
prominent role in deciding whether an interaction is important
or unimportant. The advantage of unimportant (or no)
interactions, namely, that one is then able to analyze the
factor effects separated is especially great when the study
contains more than two factors.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s

quantitative ability were found to be present.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s

quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,

with the two teaching methods.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s

quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,

with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to

perform better when taught by the standard method.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s

quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,

with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to

perform better when taught by the standard method.

I If equal numbers of students with moderate good, and excellent

quantitative ability are to be taught by one of the the teaching methods,
then the method that produces the best average result for all students
might be of interest even in the presence of important interactions.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using

projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s

quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,

with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to

perform better when taught by the standard method.

I If equal numbers of students with moderate good, and excellent

I A comparison of the teaching method factor level means would then be

relevant, even though important interactions are present.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company

may have only a limited time to experiment with the
production line.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company

may have only a limited time to experiment with the
production line.

I If the line is available for one day and only eight batches of
product can be produced in a day, the experiment may have to
be limited to eight observations.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company

may have only a limited time to experiment with the
production line.

I If the line is available for one day and only eight batches of
product can be produced in a day, the experiment may have to
be limited to eight observations.

I If the study involves one factor at four levels and a second

factor at two levels so that there are eight factor level
combinations, only one replication of the experiment is then
possible for each treatment.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative

package designs, evaluation of each alternative may require a
separate market test.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative

package designs, evaluation of each alternative may require a
separate market test.

I The response of interest is the observed market share, and this

results in a single response for each treatment combination.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative

package designs, evaluation of each alternative may require a
separate market test.

I The response of interest is the observed market share, and this

results in a single response for each treatment combination.

I Special attention is required for the analysis of two-factor

studies containing only one replication per treatment because
no degrees of freedom are available for estimation of the
experimental error with the standard two-factor ANOVA
model.
No interaction model

I When there is only one case for each treatment, we no longer

can work with two-factor ANOVA model using interaction
effect.
No interaction model

I When there is only one case for each treatment, we no longer

can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be

available.
No interaction model

I When there is only one case for each treatment, we no longer

can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be

available.

I Recall that SSE is a sum of squares made up of components

measuring the variability within each treatment.
No interaction model

I When there is only one case for each treatment, we no longer

can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be

available.

I Recall that SSE is a sum of squares made up of components

measuring the variability within each treatment.

I With only one case per treatment, there is no variability within

a treatment, and SSE will then always be zero.
I A way out of this difficulty is to change the model.
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so

that γij = 0, the interaction mean square MSAB has
expectation σ 2 .
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so

that γij = 0, the interaction mean square MSAB has
expectation σ 2 .

I Thus, if it is possible to assume that the two factors do not

interact, we may use MSAB as the estimator of the error
variance σ 2 and proceed with the analysis of factor effects as
usual.
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so

that γij = 0, the interaction mean square MSAB has
expectation σ 2 .

I Thus, if it is possible to assume that the two factors do not

interact, we may use MSAB as the estimator of the error
variance σ 2 and proceed with the analysis of factor effects as
usual.

I If it is unreasonable to assume that the two factors do not

interact, transformations may be tried to remove the
interaction effects.
Model

I We assume that we have a single observation yij corresponding

to each level combination.
Model

I We assume that we have a single observation yij corresponding

to each level combination.

I Hence the model we consider here is

yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q

where µij is fixed effect due to (Ai , Bj ) and eij is random error.
Model

I We assume that we have a single observation yij corresponding

to each level combination.

I Hence the model we consider here is

yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q

where µij is fixed effect due to (Ai , Bj ) and eij is random error.

I With the reparametrized version the model reduces to

yij = µ + αi + βj + eij .
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
I Suppose yijk be the k th observation receiving the treatment
combination (Ai , Bj ).
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
I Suppose yijk be the k th observation receiving the treatment
combination (Ai , Bj ).
I Then the model we consider here is

yijk = µ+αi +βj +γij +eijk , k = 1, 2, ..m, i = 1, 2, .., p, j = 1, 2, ..., q

where eijk is the random error which we assume to be
I independent (over i, j and k)
I having N(0, σ 2 ) for all i, j, k.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign
them numerical codes (say 1, 2, 3, . . . ) and treat them like
ordinary numerical variables.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like

ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like

ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
I It also has the drawback that if the relationship between Y and the
categorical variable is monotone, that may not be respected by the
coefficients we estimate.
I The second procedure is very easy, but usually without any substantive or
logical basis. It implies that each step up in the ordinal variable will
predict exactly the same difference in y , and why should that be the
case?
I If, after treating an ordinal variable like a nominal one, we get contrasts
which are all (approximately) equally spaced, we might then try the
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against

x1 , ...xp , we will now get two regression surfaces.
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against

x1 , ...xp , we will now get two regression surfaces.
I They will be parallel to each other, and offset by βb .
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against

x1 , ...xp , we will now get two regression surfaces.
I They will be parallel to each other, and offset by βb .
I We thus have a model where each category gets its own
intercept: α for the baseline level and α + βb for the other
class.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if

there’s a lot of data for each class.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if

there’s a lot of data for each class.

I However, if the regression surfaces for the two categories really

are parallel to each other, by splitting the data we’re losing
some precision in our estimate of the common slopes, without
gaining anything.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if

there’s a lot of data for each class.

I However, if the regression surfaces for the two categories really

are parallel to each other, by splitting the data we’re losing
some precision in our estimate of the common slopes, without
gaining anything.

I In fact, if the two surfaces are nearly parallel, for moderate

sample sizes the small bias that comes from pretending the
slopes are all equal can be overwhelmed by the reduction in
variance, so that the resulting MSE of the estimates of
parameters are less.
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary

category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 +
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary

category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 +

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the

slope on x1 is β1 + β1b
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary

category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 +

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the

slope on x1 is β1 + β1b

I The coefficient for the interaction is the difference in slopes

between the two categories.
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary

category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 +

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the

slope on x1 is β1 + β1b

I The coefficient for the interaction is the difference in slopes

between the two categories.

I It says that the categories share a common intercept, but their

regression lines are not parallel (unless β1b = 0).
Interaction (Contd.)

I We could expand the model by letting each category have its

own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 +
Interaction (Contd.)

I We could expand the model by letting each category have its

own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 +

I This model, where “everything is interacted with the category”,

is very close to just running two separate regressions, one per
category.
Interaction (Contd.)

I We could expand the model by letting each category have its

own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 +

I This model, where “everything is interacted with the category”,

is very close to just running two separate regressions, one per
category.

I It does, however, insist on having a single noise variance σ 2

(which separate regressions wouldn’t accomplish).

Machine Learning
No ratings yet
Machine Learning
92 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
MLDL Lecture 1
No ratings yet
MLDL Lecture 1
28 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Introduction 1
No ratings yet
Introduction 1
113 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Unit 2&3 - 250421 - 215911
No ratings yet
Unit 2&3 - 250421 - 215911
60 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Merge
No ratings yet
Merge
240 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
Predictive Analytics Primer
No ratings yet
Predictive Analytics Primer
66 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Regression
No ratings yet
Regression
45 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
Regression Analysis Techniques
No ratings yet
Regression Analysis Techniques
16 pages
Unit 2
No ratings yet
Unit 2
136 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Day.9 SML
No ratings yet
Day.9 SML
23 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Linear Regression Explained
No ratings yet
Linear Regression Explained
67 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
Unit - 3 - ML - 24
No ratings yet
Unit - 3 - ML - 24
41 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
ML 2
No ratings yet
ML 2
155 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
W1.2 Regression 1
No ratings yet
W1.2 Regression 1
28 pages
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
No ratings yet
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
36 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
23 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
No ratings yet
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
22 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
No ratings yet
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
13 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
Supervised and Unsupervised Learning Feature
No ratings yet
Supervised and Unsupervised Learning Feature
2 pages
Simple Linear Regression Analysis - Final
No ratings yet
Simple Linear Regression Analysis - Final
46 pages
ESL: Chapter 1: 1.1 Introduction To Linear Regression
No ratings yet
ESL: Chapter 1: 1.1 Introduction To Linear Regression
4 pages
Topic 7.6 Regression Analysis and Learning Regression Analysis
No ratings yet
Topic 7.6 Regression Analysis and Learning Regression Analysis
6 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Statistics Quiz
No ratings yet
Statistics Quiz
20 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Machine Learning for Data Analysts
No ratings yet
Machine Learning for Data Analysts
201 pages
Basics of Regression Analysis
No ratings yet
Basics of Regression Analysis
63 pages
Lecture 13 BA
No ratings yet
Lecture 13 BA
36 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Linear Model 1
No ratings yet
Linear Model 1
71 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Introduction
No ratings yet
Introduction
47 pages
Basic Testing
No ratings yet
Basic Testing
116 pages
Assignment 6 New
No ratings yet
Assignment 6 New
3 pages
Assignment 1 New
No ratings yet
Assignment 1 New
6 pages
Prob Intro4
No ratings yet
Prob Intro4
277 pages
Linear Model Recap 2
No ratings yet
Linear Model Recap 2
313 pages
Tidy Verse
No ratings yet
Tidy Verse
76 pages
Prob Intro2
No ratings yet
Prob Intro2
224 pages
Stative Verbs Chart
No ratings yet
Stative Verbs Chart
2 pages
No Name 1
No ratings yet
No Name 1
1 page
OpenStax Statistics CH13 ImageSlideshow
No ratings yet
OpenStax Statistics CH13 ImageSlideshow
13 pages
Lecture 6 Example Problem
No ratings yet
Lecture 6 Example Problem
5 pages
Parameter Estimation
No ratings yet
Parameter Estimation
19 pages
Pharma Sales Forecasting Methods
No ratings yet
Pharma Sales Forecasting Methods
10 pages
Complete Download (Ebook PDF) Statistics in Context by Barbara Blatchley PDF All Chapters
No ratings yet
Complete Download (Ebook PDF) Statistics in Context by Barbara Blatchley PDF All Chapters
55 pages
Transforming Random Variables Guide
No ratings yet
Transforming Random Variables Guide
4 pages
ADL 07 Quantitative Techniques in Management V3
No ratings yet
ADL 07 Quantitative Techniques in Management V3
5 pages
An Introduction To Mathematical Statistics and Its Applications (6th Edition) Larsen
0% (2)
An Introduction To Mathematical Statistics and Its Applications (6th Edition) Larsen
10 pages
Estonian Foreign Intelligence Service Report On International Security and Russian Influence Operations and Cyber Threats
No ratings yet
Estonian Foreign Intelligence Service Report On International Security and Russian Influence Operations and Cyber Threats
17 pages
Chapter 8 Interval Estimation
No ratings yet
Chapter 8 Interval Estimation
25 pages
WGU Healthcare Statistics Prep
No ratings yet
WGU Healthcare Statistics Prep
15 pages
Biostatistics Assignment
No ratings yet
Biostatistics Assignment
5 pages
Cost Analysis for Freight Transport
No ratings yet
Cost Analysis for Freight Transport
3 pages
Lecture 31: Split Plot/Repeated Measures
No ratings yet
Lecture 31: Split Plot/Repeated Measures
6 pages
Busi 344 Project 1 Sample 2
No ratings yet
Busi 344 Project 1 Sample 2
19 pages
Gender & Job Qualification Analysis
No ratings yet
Gender & Job Qualification Analysis
2 pages
Chicco 2023
No ratings yet
Chicco 2023
23 pages
Quantitative Techniques for BBA Students
No ratings yet
Quantitative Techniques for BBA Students
1 page
Math 1040 Skittles Project Worksheet-2 1
No ratings yet
Math 1040 Skittles Project Worksheet-2 1
6 pages
(Ebook PDF) Elementary Statistics: A Step by Step Approach 9th Edition - Instantly Access The Full Ebook Content in Just A Few Seconds
100% (4)
(Ebook PDF) Elementary Statistics: A Step by Step Approach 9th Edition - Instantly Access The Full Ebook Content in Just A Few Seconds
48 pages
(The MIT Press Ser.) Frank Westhoff - An Introduction To Econometrics - A Self-Contained Approach-MIT Press (2013)
No ratings yet
(The MIT Press Ser.) Frank Westhoff - An Introduction To Econometrics - A Self-Contained Approach-MIT Press (2013)
893 pages
Single Exponential Smoothing Algorithm: 3.1 Notation
No ratings yet
Single Exponential Smoothing Algorithm: 3.1 Notation
2 pages
Time Series Analysis of Electric Energy
No ratings yet
Time Series Analysis of Electric Energy
10 pages
Large Sample Tests of Hypothesis
No ratings yet
Large Sample Tests of Hypothesis
44 pages
Instructor's Manual For Multivariate Data Analysis: A Global Perspective Seventh Edition
No ratings yet
Instructor's Manual For Multivariate Data Analysis: A Global Perspective Seventh Edition
18 pages
Simple Linear Regression and Its Properties 82
No ratings yet
Simple Linear Regression and Its Properties 82
8 pages
Machine Learning Interview Questions & Answers For Data Scientists
No ratings yet
Machine Learning Interview Questions & Answers For Data Scientists
13 pages
Math Model Validation Worksheet
No ratings yet
Math Model Validation Worksheet
3 pages
Topic 14 Length of Confidence Interval and Appropriate Sample Size PDF
No ratings yet
Topic 14 Length of Confidence Interval and Appropriate Sample Size PDF
7 pages
STATA 2 Class
No ratings yet
STATA 2 Class
3 pages