Review of Linear Models I
Presidency University
       February, 2025
Guessing the value of a variable
     I   Suppose we need to guess a single value for a quantitative
         random variable Y . What is the best value to guess?
Guessing the value of a variable
     I   Suppose we need to guess a single value for a quantitative
         random variable Y . What is the best value to guess?
     I   To answer this question, we need to pick a function to be
         optimized, which should measure how good or bad our guesses
         are: how big an error we’re making. A reasonable, traditional
         starting point is the mean squared error:
                             MSE (c) = E (Y − c)2
         So we’d like to find the value c where MSE (c) is smallest.
Guessing the value of a variable
     I   Suppose we need to guess a single value for a quantitative
         random variable Y . What is the best value to guess?
     I   To answer this question, we need to pick a function to be
         optimized, which should measure how good or bad our guesses
         are: how big an error we’re making. A reasonable, traditional
         starting point is the mean squared error:
                             MSE (c) = E (Y − c)2
         So we’d like to find the value c where MSE (c) is smallest.
     I   Thus the optimal choice is given by c = E (Y ). Hence the best
         guess we can make about Y with respect to mean square error
         is E (Y ).
Guessing Y from knowledge of another variable
    I   Now suppose we have another auxiliary variable X and we
        make a guess of Y by some function of X say g (X ).
Guessing Y from knowledge of another variable
    I   Now suppose we have another auxiliary variable X and we
        make a guess of Y by some function of X say g (X ).
    I   As before if we take MSE as the optimality criteria, then we
        seek to minimize E (Y − g (X ))2 with respect to g (X ). We
        find that the optimal function is f (x) = E (Y |X = x).
Guessing Y from knowledge of another variable
    I   Now suppose we have another auxiliary variable X and we
        make a guess of Y by some function of X say g (X ).
    I   As before if we take MSE as the optimality criteria, then we
        seek to minimize E (Y − g (X ))2 with respect to g (X ). We
        find that the optimal function is f (x) = E (Y |X = x).
    I   This function f (X ) is called the regression function which we
        would like to know when we try to predict Y based on X .
        Regression of Y on X is the locus of the conditional mean
        E (Y |X ).
Guessing Y from knowledge of another variable
    I   Now suppose we have another auxiliary variable X and we
        make a guess of Y by some function of X say g (X ).
    I   As before if we take MSE as the optimality criteria, then we
        seek to minimize E (Y − g (X ))2 with respect to g (X ). We
        find that the optimal function is f (x) = E (Y |X = x).
    I   This function f (X ) is called the regression function which we
        would like to know when we try to predict Y based on X .
        Regression of Y on X is the locus of the conditional mean
        E (Y |X ).
    I   Problem: This function f (X ) is generally unknown unless we
        assume some completely known probability distribution of
        (X , Y ).
Regression Analysis
     I   What we have at our hand are random samples
         (x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
         the learning set or the training set.
Regression Analysis
     I   What we have at our hand are random samples
         (x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
         the learning set or the training set.
     I   Regression analysis is all about constructing a suitable
         approximation fˆ of f based on the training data set.
Regression Analysis
     I   What we have at our hand are random samples
         (x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
         the learning set or the training set.
     I   Regression analysis is all about constructing a suitable
         approximation fˆ of f based on the training data set.
     I   As we shall see, constructing a suitable approximation fˆ is a
         two step process:
           I   Step 1: Restricting to a class of functions F and find the best
               approximation fF of f within that class.
           I   Step 2: Estimating (or learning) fF based on the data
               (x1 , y1 ), ..., (xn , yn ) by fˆ.
More about Step 1
    I   The choice of the class F is a trade-off:
          I   If F contains too much complicated functions, this will
              capture the variation in training data too much and will lead
              to what we call over-fitting or undersmoothing.
More about Step 1
    I   The choice of the class F is a trade-off:
          I   If F contains too much complicated functions, this will
              capture the variation in training data too much and will lead
              to what we call over-fitting or undersmoothing.
          I   On the contrary, if F contains too much simple functions, then
              it will fail to capture any variation in the training data: this is
              called underfitting or oversmoothing.
More about Step 1
    I   The choice of the class F is a trade-off:
          I   If F contains too much complicated functions, this will
              capture the variation in training data too much and will lead
              to what we call over-fitting or undersmoothing.
          I   On the contrary, if F contains too much simple functions, then
              it will fail to capture any variation in the training data: this is
              called underfitting or oversmoothing.
          I   Neither scenario is desirable: both have their own problems
              which we shall see.
Two Perspectives
    I   In general there are two objectives of any regression analysis:
          I   Given a new data point xnew we want to predict the value of
              the response variable y , that is, we are interested only in
              getting the fitted values ŷ at some new data xnew - problem of
              prediction.
          I   We want to know the functional relationship between y and x,
              that is, we want an approximation of the true regression of y
              on x- problem of curve estimation or problem of inference.
Two Perspectives
    I   In general there are two objectives of any regression analysis:
          I   Given a new data point xnew we want to predict the value of
              the response variable y , that is, we are interested only in
              getting the fitted values ŷ at some new data xnew - problem of
              prediction.
          I   We want to know the functional relationship between y and x,
              that is, we want an approximation of the true regression of y
              on x- problem of curve estimation or problem of inference.
    I   The problem of curve estimation is a much wider problem than
        the problem of prediction because solving the former problem
        will solve the later problem as a consequence.
Two Perspectives
    I   In general there are two objectives of any regression analysis:
          I   Given a new data point xnew we want to predict the value of
              the response variable y , that is, we are interested only in
              getting the fitted values ŷ at some new data xnew - problem of
              prediction.
          I   We want to know the functional relationship between y and x,
              that is, we want an approximation of the true regression of y
              on x- problem of curve estimation or problem of inference.
    I   The problem of curve estimation is a much wider problem than
        the problem of prediction because solving the former problem
        will solve the later problem as a consequence.
    I   A natural question is then why consider the two problems
        separately? This is because, the problem of prediction is much
        simpler than the problem of curve estimation and hence we
        can device many trivial regression procedure for the purpose.
Problem of inference
     I Here our objective starts with understanding how the covariate X affects
       the response Y .
Problem of inference
     I Here our objective starts with understanding how the covariate X affects
       the response Y .
     I We want to estimate f but not for predicting Y . Now fˆ cannot be
       treated as black-box, we need to know its exact mathematical form.
Problem of inference
     I Here our objective starts with understanding how the covariate X affects
        the response Y .
     I We want to estimate f but not for predicting Y . Now fˆ cannot be
        treated as black-box, we need to know its exact mathematical form.
     I In this setting, one may be interested in answering the following
        questions:
          I Which predictors are associated with the response? It is often the
              case that only some of the available predictors are substantially
              associated with Y. Identifying the few important predictors among a
              large set of possible variables can be extremely useful.
          I What is the relationship between the response and each predictor?
              Some predictors may have a positive relationship with Y while
              others may have the opposite relationship.
          I Can the relationship between Y and each predictor be adequately
              summarized using a linear equation, or is the relationship more
              complicated? In some situation linear form is reasonable or
              desirable. But often the true relationship is more complicated, in
              which case a linear model may not provide an accurate
              representation of the relationship between the input and output
              variables.
Problem of prediction
     I   In many situations, a set of inputs X are readily available, but
         the output Y cannot be easily obtained.
Problem of prediction
     I   In many situations, a set of inputs X are readily available, but
         the output Y cannot be easily obtained.
     I   In this setting, since the error term averages to zero, we can
         predict Y using
                                     Ŷ = fˆ(X )
         where fˆ is estimate of f and Ŷ is the prediction for Y .
Problem of prediction
     I   In many situations, a set of inputs X are readily available, but
         the output Y cannot be easily obtained.
     I   In this setting, since the error term averages to zero, we can
         predict Y using
                                     Ŷ = fˆ(X )
         where fˆ is estimate of f and Ŷ is the prediction for Y .
     I   We note that fˆ acts like a black-box in the sense that we need
         not to know the exact mathematical form of fˆ, so far it yields
         accurate predictions Ŷ .
Measurement of accuracy
    I   In the problem of prediction the main objective was to find an
        accurate estimate fˆ.
Measurement of accuracy
    I   In the problem of prediction the main objective was to find an
        accurate estimate fˆ.
    I   But how do we measure accuracy here? The answer is using
        mean squared error. We need to choose fˆ for which MSE is
        minimum.
Measurement of accuracy
    I   In the problem of prediction the main objective was to find an
        accurate estimate fˆ.
    I   But how do we measure accuracy here? The answer is using
        mean squared error. We need to choose fˆ for which MSE is
        minimum.
    I   For any approximation fˆn (x) based on a sample of size n, we
        can write
                  MSE (fˆ(x)) = σx2 + Bias 2 (fˆn (x)) + Var (fˆn (x))
        where
          I   σx2 = Var (Y |X = x) is the variance which is uncontrollable
              (variance due to random causes),
          I   the second term Bias 2 (fˆn (x)) = E 2 [f (x) − E (fˆn (x))] is the
              squared approximation bias (or error) which is incorporated
              due to use of fˆn (x) instead of f (x) and
          I   the third term Var (fˆn (x)) = E [fˆn (x) − E (fˆn (x))]2 indicates the
              variance of our estimated regression function.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
     I We note that even an unbiased estimator fˆn (x) may not be admissible
        because of large variance.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
     I We note that even an unbiased estimator fˆn (x) may not be admissible
        because of large variance.
     I On the other hand if we choose fˆn (x) to be simple functions, then
        Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
     I We note that even an unbiased estimator fˆn (x) may not be admissible
        because of large variance.
     I On the other hand if we choose fˆn (x) to be simple functions, then
        Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
     I The catch is that, at least past a certain point, decreasing the
        approximation bias can only come through increasing the estimation
        variance.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
     I We note that even an unbiased estimator fˆn (x) may not be admissible
        because of large variance.
     I On the other hand if we choose fˆn (x) to be simple functions, then
        Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
     I The catch is that, at least past a certain point, decreasing the
        approximation bias can only come through increasing the estimation
        variance.
     I This is the bias-variance trade-off.
Bias- variance Trade-off
     I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
        the choice of fˆn (x) which makes the situation interesting.
     I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
        but Var (fˆn (x)) increases considerably.
     I We note that even an unbiased estimator fˆn (x) may not be admissible
        because of large variance.
     I On the other hand if we choose fˆn (x) to be simple functions, then
        Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
     I The catch is that, at least past a certain point, decreasing the
        approximation bias can only come through increasing the estimation
        variance.
     I This is the bias-variance trade-off.
     I This trade-off is exactly the same as we have discussed earlier: overfitting
        and oversmoothing.
Methods of finding fˆ
     I   We shall explore many linear and non-linear approaches for
         estimating f .
Methods of finding fˆ
     I   We shall explore many linear and non-linear approaches for
         estimating f .
     I   Most statistical estimation methods for this task can be
         characterized as either parametric or non-parametric. Both
         these approaches have their relative merits and demerits.
Parametric methods
    I   Parametric methods is generally a two-step approach to build
        models:
         1. Model Assumption: First we assume a specific functional form of f .
            For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
            Y , one may assume a linear form of f (X ) as
                              f (X ) = β0 + β1 X1 + .... + βp Xp .
            We note that this assumption of linearity makes our search for f
            simple. We now need not to search among the set of all
            p−dimensional functions, rather we only need to estimate the p + 1
            coefficients βi to get the desired model.
         2. Fitting the assumed model: After a model has been selected, we
            need a procedure that uses the training data to fit or train the
            model. For example, in the case of the linear model, for fitting we
            need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
            find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
                                Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp
Parametric methods
    I   Parametric methods is generally a two-step approach to build
        models:
         1. Model Assumption: First we assume a specific functional form of f .
            For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
            Y , one may assume a linear form of f (X ) as
                              f (X ) = β0 + β1 X1 + .... + βp Xp .
            We note that this assumption of linearity makes our search for f
            simple. We now need not to search among the set of all
            p−dimensional functions, rather we only need to estimate the p + 1
            coefficients βi to get the desired model.
         2. Fitting the assumed model: After a model has been selected, we
            need a procedure that uses the training data to fit or train the
            model. For example, in the case of the linear model, for fitting we
            need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
            find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
                                Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp
    I   One of the many available techniques to fit such model is
        least square approach.
Issues with parametric approach
     I   This model-based approach is called parametric because
         estimating the model essentially reduces to estimate a number
         of parameters.
Issues with parametric approach
     I   This model-based approach is called parametric because
         estimating the model essentially reduces to estimate a number
         of parameters.
     I   Although assuming a parametric model simplifies the task, but
         this approach has its own limitations.
Issues with parametric approach
     I   This model-based approach is called parametric because
         estimating the model essentially reduces to estimate a number
         of parameters.
     I   Although assuming a parametric model simplifies the task, but
         this approach has its own limitations.
     I   First of all we need to make a certain assumption regarding
         the model and this choice usually does not match the true
         form. Further if this choice is too bad, then the estimates will
         be poor.
Issues with parametric approach
     I   This model-based approach is called parametric because
         estimating the model essentially reduces to estimate a number
         of parameters.
     I   Although assuming a parametric model simplifies the task, but
         this approach has its own limitations.
     I   First of all we need to make a certain assumption regarding
         the model and this choice usually does not match the true
         form. Further if this choice is too bad, then the estimates will
         be poor.
     I   We can make our model more flexible by including more
         parameters in the model, but that will lead to overfitting.
Issues with parametric approach
     I   This model-based approach is called parametric because
         estimating the model essentially reduces to estimate a number
         of parameters.
     I   Although assuming a parametric model simplifies the task, but
         this approach has its own limitations.
     I   First of all we need to make a certain assumption regarding
         the model and this choice usually does not match the true
         form. Further if this choice is too bad, then the estimates will
         be poor.
     I   We can make our model more flexible by including more
         parameters in the model, but that will lead to overfitting.
     I   This is trade-off : if we choose too less parameters we may
         have oversmoothing which means ignoring many potential
         causes and if we include many parameters we suffer from
         overfitting or what we call undersmoothing.
Non-parametric approach
    I   Non-parametric methods do not make explicit assumptions
        about the functional form of f .
Non-parametric approach
    I   Non-parametric methods do not make explicit assumptions
        about the functional form of f .
    I   Instead here we seek an estimate of f that gets as close to the
        data points as possible without being too rough.
Non-parametric approach
    I   Non-parametric methods do not make explicit assumptions
        about the functional form of f .
    I   Instead here we seek an estimate of f that gets as close to the
        data points as possible without being too rough.
    I   Examples of such non-parametric model is spline regression.
I   The major advantage of such approaches is here we avoid the
    assumption of a particular functional form for f , and hence
    they have the potential to accurately fit a wider range of
    possible shapes for f .
I   The major advantage of such approaches is here we avoid the
    assumption of a particular functional form for f , and hence
    they have the potential to accurately fit a wider range of
    possible shapes for f .
I   But a major disadvantage of this approach is: since they do
    not reduce the problem of estimating f to a small number of
    parameters,a very large number of observations(far more than
    is typically needed for a parametric approach) is required in
    order to obtain an accurate estimate for f .
Step1:F contains constant functions
     I   We choose to restricted class F to be the constant f (x) = f0 .
Step1:F contains constant functions
     I   We choose to restricted class F to be the constant f (x) = f0 .
     I   This indicates over smoothing.
Step1:F contains constant functions
     I   We choose to restricted class F to be the constant f (x) = f0 .
     I   This indicates over smoothing.
     I   But at times this may produce appropriate results.
           I   True regression f (x) is really a constant.
           I   f (x) varies rapidly but within narrow limits.
Step1:F contains constant functions
     I   We choose to restricted class F to be the constant f (x) = f0 .
     I   This indicates over smoothing.
     I   But at times this may produce appropriate results.
           I   True regression f (x) is really a constant.
           I   f (x) varies rapidly but within narrow limits.
     I   In such situations we can actually do better by fitting a
         constant than matching the correct functional form.
Step1:F contains constant functions
     I   We choose to restricted class F to be the constant f (x) = f0 .
     I   This indicates over smoothing.
     I   But at times this may produce appropriate results.
           I   True regression f (x) is really a constant.
           I   f (x) varies rapidly but within narrow limits.
     I   In such situations we can actually do better by fitting a
         constant than matching the correct functional form.
     I   In the second situation fˆ(x) = f0 should be biased but can
         have less MSE than an unbiased estimator.
Example: Bias variance tradeoff in action
     I   For example, suppose our f (x) be of the form
                             f (x) = α + β sin(γx)
Example: Bias variance tradeoff in action
     I   For example, suppose our f (x) be of the form
                             f (x) = α + β sin(γx)
     I   Further we assume β  1 and γ  1 to ensure that f lies
         within narrow limits.
Example: Bias variance tradeoff in action
     I   For example, suppose our f (x) be of the form
                             f (x) = α + β sin(γx)
     I   Further we assume β  1 and γ  1 to ensure that f lies
         within narrow limits.
     I   Here estimating a constant regression function does better
         than the estimated regression with the correct functional form.
Example: Bias variance tradeoff in action
     I   For example, suppose our f (x) be of the form
                              f (x) = α + β sin(γx)
     I   Further we assume β  1 and γ  1 to ensure that f lies
         within narrow limits.
     I   Here estimating a constant regression function does better
         than the estimated regression with the correct functional form.
     I   In fact here the MSE of the model fˆ1 (x) = f0 is less than the
         MSE of the unbiased model fˆ2 (x) = α̂ + β̂ sin(γx) (assuming
         γ to be known).
Example (contd.)
        1.6
        1.2
   y
        0.8
        0.4
                   0.2   0.4       0.6   0.8   1.0
                               x
Example (Contd.)
    I   A rapidly-varying but nearly-constant regression function
        y = 1 + 0.02 sin(200x) +  where  ∼ N(0, 0.5).
    I   The red dotted line is the constant line indicating sample
        mean of response.
    I   The blue dot-dashed curve is the estimated function of the
        form α̂ + β̂ sin(200x).
    I   With just a few observations, the constant actually predicts
        better on new data ( MSE 0.53 ) than does the estimate sine
        function (MSE 0.59).
What does this example tell us?
     I   “Optimum” choice does not necessarily mean the “truth”.
What does this example tell us?
     I   “Optimum” choice does not necessarily mean the “truth”.
     I   In this example, the truth was a sin curve whereas the
         constant function is optimum.
What does this example tell us?
     I   “Optimum” choice does not necessarily mean the “truth”.
     I   In this example, the truth was a sin curve whereas the
         constant function is optimum.
     I   Optimum means a “reasonably good” approximation of the
         truth to serve our purpose.
What does this example tell us?
     I   “Optimum” choice does not necessarily mean the “truth”.
     I   In this example, the truth was a sin curve whereas the
         constant function is optimum.
     I   Optimum means a “reasonably good” approximation of the
         truth to serve our purpose.
     I   We shall always remember that we are searching for the
         optimum not the truth.
What does this example tell us?
     I   “Optimum” choice does not necessarily mean the “truth”.
     I   In this example, the truth was a sin curve whereas the
         constant function is optimum.
     I   Optimum means a “reasonably good” approximation of the
         truth to serve our purpose.
     I   We shall always remember that we are searching for the
         optimum not the truth.
     I   This is the motivation behind fixing the class F.
General Linear model
    I Consider a setup where we have a single response variable y which is
      quantitative and p covariates x1 , x2 , ...., xp which can be either
      quantitative or qualitative or both.
General Linear model
    I Consider a setup where we have a single response variable y which is
      quantitative and p covariates x1 , x2 , ...., xp which can be either
      quantitative or qualitative or both.
    I Suppose we have n observations on each of these p variables. That is,
      suppose we have observations y1 , y2 , ..., yn on the response y and
      x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.
General Linear model
    I Consider a setup where we have a single response variable y which is
      quantitative and p covariates x1 , x2 , ...., xp which can be either
      quantitative or qualitative or both.
    I Suppose we have n observations on each of these p variables. That is,
      suppose we have observations y1 , y2 , ..., yn on the response y and
      x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.
    I Then the general linear model can be written as
                                         y = Xβ + 
      where y = (y1 , y2 , ..., yn ) is the response vector and β = (b1 , ..., bp ) is the
      vector of parameters and
                                          x11 x12 · · · x1p
                                                             
                                        x21 x22          x2p 
                                 X = .         ..
                                                             
                                           .
                                                              
                                         .      .   ···      
                                       xn1   xn2           xnp
      is the design matrix.
Linear Model (contd.)
     I   Further  = (1 , 2 , .., n ) is the vector of random errors where
         we assume
           I   E (i |X ) = 0∀i.
           I   Var (i |X ) = σ 2 ∀i.
           I   Cov (i , j |X ) = 0∀i 6= j.
     I   These assumptions can alternatively be stated as E (|X ) = 0
         and D(|X ) = σ 2 In .
Linear Model (contd.)
     I   Further  = (1 , 2 , .., n ) is the vector of random errors where
         we assume
           I   E (i |X ) = 0∀i.
           I   Var (i |X ) = σ 2 ∀i.
           I   Cov (i , j |X ) = 0∀i 6= j.
     I   These assumptions can alternatively be stated as E (|X ) = 0
         and D(|X ) = σ 2 In .
     I   More specifically we assume single quantitative resposne
         variable y and p covarites such that
                         E (y |X ) = X β and Var (y |X ) = σ 2 In .
Example: Simple Linear Regression
    I   Suppose we restrict our class to the class of all linear functions
        F = {f (x) : f (x) = α + βx, α, β ∈ R}.
    I   For n observations, we can write it as
                                   y = Xθ + 
        where E () = 0 and Var () = σ 2 In .
    I   Here we have
                                  
                              1 x1
                             1 x2           
                                              α
                         X =  . .  and θ =     .
                                  
                               . .
                             . .            β
                              1 xn
Example: Polynomial Regression
    I   An immediate extension of this can be done by expanding the
        class to incorporate the polynomials in x as
            F = {f : f (x) = β0 + β1 x + .... + βp x p for some p.}
    I   For polynomial regression, we have
                       1 x1 x12 · · · x1p
                                                     
                                                        β0
                     1 x2 x 2 · · · x p              β1 
                                 2        2
                X = . .        ..   ..  ..  and θ =  ..  .
                                                      
                      .. ..     .    .   .          .
                                  2        p
                       1 xn xn · · · xn                βp
Example: Multiple Regression
    I   Hence we can consider the class of functions as
                  F = {f : f (x) = β0 + β1 x1 + .... + βp xp }
        where we have a single response variable y and p quantitative
        predictor variables, x1 , x2 , ...xp .
    I   For multiple linear regression we have
                                                        
                       1 x11 x12 · · · x1p                 β0
                     1 x21 x22 · · · x2p                β1 
              X = .        ..   ..    ..   ..  and θ =  ..  .
                                                        
                     ..     .    .     .    .          .
                       1 xn1 xn2 · · · xnp                βp
Classification
     I   If all the columns of X (except the first column) contains
         values of continuous variables, then the linear model is called
         regression model.
Classification
     I   If all the columns of X (except the first column) contains
         values of continuous variables, then the linear model is called
         regression model.
     I   If all the columns of X contains values of discrete variables
         (more specifically if all the columns contains values 0 or 1),
         then the linear model is called ANOVA model.
Classification
     I   If all the columns of X (except the first column) contains
         values of continuous variables, then the linear model is called
         regression model.
     I   If all the columns of X contains values of discrete variables
         (more specifically if all the columns contains values 0 or 1),
         then the linear model is called ANOVA model.
     I   If some columns of X contains values of continuous variables
         and some columns contains values of discrete variable, then
         the linear model is called ANCOVA (or ANOCOVA) model.
Dealing with Factors
     I   In linear models we need to deal with what we call factor
         variables or factors which are categorical variables with
         different categories. The different categories of a factor are
         called the factor levels.
Dealing with Factors
     I   In linear models we need to deal with what we call factor
         variables or factors which are categorical variables with
         different categories. The different categories of a factor are
         called the factor levels.
     I   Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
         having potential effect on the response y . A natural question is
         : How do we model the effects of all these levels in a single
         linear model?
Dealing with Factors
     I   In linear models we need to deal with what we call factor
         variables or factors which are categorical variables with
         different categories. The different categories of a factor are
         called the factor levels.
     I   Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
         having potential effect on the response y . A natural question is
         : How do we model the effects of all these levels in a single
         linear model?
     I   The answer is to use indicator variables or dummy
         variables x1 , x2 , ..., xk−1 where
                     (
                        1 if the observation receives the i th level
               xi =
                        0 otherwise.
Dealing with Factors
     I   In linear models we need to deal with what we call factor
         variables or factors which are categorical variables with
         different categories. The different categories of a factor are
         called the factor levels.
     I   Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
         having potential effect on the response y . A natural question is
         : How do we model the effects of all these levels in a single
         linear model?
     I   The answer is to use indicator variables or dummy
         variables x1 , x2 , ..., xk−1 where
                     (
                        1 if the observation receives the i th level
               xi =
                        0 otherwise.
     I   We can write a linear model as
                       y = α + β1 x1 + .... + βk−1 xk−1 + 
         where βi is the effect of the i th level of A.
Using dummy variables
    I   So why did we use k − 1 dummy variables when we had k
        levels? Where is the effect of Ak modeled in the linear model?
Using dummy variables
    I   So why did we use k − 1 dummy variables when we had k
        levels? Where is the effect of Ak modeled in the linear model?
    I   The answer is if the observation receives the k th level of the
        factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
        represents the expected value of y when the observations
        receives Ak .
Using dummy variables
    I   So why did we use k − 1 dummy variables when we had k
        levels? Where is the effect of Ak modeled in the linear model?
    I   The answer is if the observation receives the k th level of the
        factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
        represents the expected value of y when the observations
        receives Ak .
    I   When an observation receives the level Ai , i = 1, 2, ..., k − 1,
        then expected value of y is α + βi . As such
        βi , i = 1, 2, ..., k − 1 represents the change in the expected
        value of y due to Ai as compared to Ak .
Using dummy variables
    I   So why did we use k − 1 dummy variables when we had k
        levels? Where is the effect of Ak modeled in the linear model?
    I   The answer is if the observation receives the k th level of the
        factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
        represents the expected value of y when the observations
        receives Ak .
    I   When an observation receives the level Ai , i = 1, 2, ..., k − 1,
        then expected value of y is α + βi . As such
        βi , i = 1, 2, ..., k − 1 represents the change in the expected
        value of y due to Ai as compared to Ak .
    I   That means each βi represents the expected difference in y
        when the observation belongs to Ai and Ak . For this reason
        βi ’s are sometimes called contrasts between two classes.
I   Here we compare the effects of other level with that of Ak . In
    such case we call Ak to be the baseline level.
I   Here we compare the effects of other level with that of Ak . In
    such case we call Ak to be the baseline level.
I   Obviously we can take any level Ai (not necessarily Ak ) to be
    the baseline level.
I   Here we compare the effects of other level with that of Ak . In
    such case we call Ak to be the baseline level.
I   Obviously we can take any level Ai (not necessarily Ak ) to be
    the baseline level.
I   A general rule is thus : if we are working with a factor, then
    we need to introduce k − 1 dummy variables.
Using dummy for all levels
     I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
       instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as
                             y = α + β1 x1 + β2 x2 + ... + βk xk + 
Using dummy for all levels
     I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
       instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as
                              y = α + β1 x1 + β2 x2 + ... + βk xk + 
     I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
                    P
       a constraint   xi = 1, that is any observation must receive any one of the levels
       Ai .
Using dummy for all levels
     I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
       instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as
                               y = α + β1 x1 + β2 x2 + ... + βk xk + 
     I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
                    P
       a constraint   xi = 1, that is any observation must receive any one of the levels
       Ai .
     I Here the design matrix is
                                      1     x11   x12   ···   x1k
                                                                 
                                     1     x21   x22   ···   x2k 
                                 X = .            ..          .. 
                                                                 
                                      ..           .           . 
                                      1     xn1   xn2   ···   xnk
        but it is not of full rank.
Using dummy for all levels
     I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
       instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as
                               y = α + β1 x1 + β2 x2 + ... + βk xk + 
     I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
                    P
       a constraint   xi = 1, that is any observation must receive any one of the levels
       Ai .
     I Here the design matrix is
                                      1     x11   x12   ···   x1k
                                                                 
                                     1     x21   x22   ···   x2k 
                                 X = .            ..          .. 
                                                                 
                                      ..           .           . 
                                      1     xn1   xn2   ···   xnk
        but it is not of full rank.
     I Statistical lesson: There can be alternative parametrization for the same model.
Example: ANOVA model (One way layout)
    I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
       which constitutes the population of interest.
Example: ANOVA model (One way layout)
    I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
       which constitutes the population of interest.
    I Further assume there are ni observations receiving the level Ai and yij be
       the j th observation receiving the i th level Ai .
Example: ANOVA model (One way layout)
    I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
       which constitutes the population of interest.
    I Further assume there are ni observations receiving the level Ai and yij be
       the j th observation receiving the i th level Ai .
    I The model we consider is
                         yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.
       where
                    µi = fixed effect due to Ai and eij = random error.
Example: ANOVA model (One way layout)
    I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
       which constitutes the population of interest.
    I Further assume there are ni observations receiving the level Ai and yij be
       the j th observation receiving the i th level Ai .
    I The model we consider is
                         yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.
       where
                    µi = fixed effect due to Ai and eij = random error.
    I We assume that
                                           eij ∼ N(0, σ 2 )
       and eij0 s are independent.
Example: ANOVA model (One way layout)
    I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
       which constitutes the population of interest.
    I Further assume there are ni observations receiving the level Ai and yij be
       the j th observation receiving the i th level Ai .
    I The model we consider is
                         yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.
       where
                    µi = fixed effect due to Ai and eij = random error.
    I We assume that
                                           eij ∼ N(0, σ 2 )
       and eij0 s are independent.
    I This implies that E (yij ) = µi and Var (yij ) = σ 2 for all j = 1, ..., ni which
       means µ0i s are the factor level means for each i and σ 2 is the common
       variability among observations belonging to each group.
One way ANOVA as linear model
    I   Here we have introduced k dummy variables for k levels but
        without any intercept.
One way ANOVA as linear model
    I   Here we have introduced k dummy variables for k levels but
        without any intercept.
    I   In terms of dummy variables we can write
                      y = µ1 x1 + µ2 x2 + .... + µk xk + 
        where xi = 1 or 0 according as the observation receives Ai or
        not.
One way ANOVA as linear model
    I Suppose we denote
            y11                                                            e11
                                                                             
           y12                                                          e12 
           ..                                                           .. 
                                                                             
           .                                                            . 
                                                                             
          y1n                                                          e1n 
           1                                                            1
                                         1n1    0    ···    0
                                                            
           y21 
                       µ1                                               e21 
                                                                               
           .          µ2             0    1n2   ···    0            . 
      y =  ..  , β =  .  and Xn×k =  .
                                                                          . 
                                                ..    ..    ..  and  =     .
                                                          
                        ..             ..
                                                                               
          y2n2 
                                               .     .     .          e2n2 
                                                                               
           . 
           .          µ  k              0     0    ···   1nk            . 
                                                                          . 
           .                                                            . 
                                                                             
           yk1                                                          ek1 
                                                                             
           .                                                            . 
           ..                                                           .. 
            yknk                                                           eknk
One way ANOVA as linear model
    I Suppose we denote
            y11                                                             e11
                                                                              
           y12                                                           e12 
           ..                                                            .. 
                                                                              
           .                                                             . 
                                                                              
          y1n                                                           e1n 
           1                                                             1
                                         1n1     0    ···    0
                                                             
           y21 
                       µ1                                                e21 
                                                                                
           .          µ2             0     1n2   ···    0            . 
      y =  ..  , β =  .  and Xn×k =  .
                                                                           . 
                                                 ..    ..    ..  and  =     .
                                                           
                        ..             ..
                                                                                
          y2n2 
                                                .     .     .          e2n2 
                                                                                
           . 
           .          µ  k              0      0    ···   1nk            . 
                                                                           . 
           .                                                             . 
                                                                              
           yk1                                                           ek1 
                                                                              
           .                                                             . 
           ..                                                            .. 
            yknk                                                            eknk
    I Then the above model can be written as
                                   y = Xβ + 
                         2
      where  ∼ Nn (0, σ In ).
Reparametrization
    I At times, an alternative but completely equivalent formulation of the
      single-factor ANOVA model is used. This alternative formulation is called
      the factor effects model.
Reparametrization
    I At times, an alternative but completely equivalent formulation of the
      single-factor ANOVA model is used. This alternative formulation is called
      the factor effects model.
    I Let us write
                                   µi = µ̄ + (µi − µ̄) = µ + αi
                       P
                           ni µi
      where µ = µ̄ =       n
                                   and αi = µi − µ̄.
Reparametrization
    I At times, an alternative but completely equivalent formulation of the
      single-factor ANOVA model is used. This alternative formulation is called
      the factor effects model.
    I Let us write
                                   µi = µ̄ + (µi − µ̄) = µ + αi
                       P
                           ni µi
      where µ = µ̄ =       n
                                   and αi = µi − µ̄.
                            X
    I Then we note that            ni αi = 0.
                               i
Reparametrization
    I At times, an alternative but completely equivalent formulation of the
      single-factor ANOVA model is used. This alternative formulation is called
      the factor effects model.
    I Let us write
                                   µi = µ̄ + (µi − µ̄) = µ + αi
                       P
                           ni µi
      where µ = µ̄ =       n
                                   and αi = µi − µ̄.
                            X
    I Then we note that            ni αi = 0.
                               i
    I Now our linear model of interest becomes
                     yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k
      where µ denotes the general effect or the average effect and αi denotes
      the
      X additional effect (fixed) due to Ai subject to the restriction
          ni αi = 0 and eij denotes the random error.
        i
Reparametrization
    I At times, an alternative but completely equivalent formulation of the
       single-factor ANOVA model is used. This alternative formulation is called
       the factor effects model.
    I Let us write
                                    µi = µ̄ + (µi − µ̄) = µ + αi
                        P
                            ni µi
       where µ = µ̄ =       n
                                    and αi = µi − µ̄.
                             X
    I Then we note that             ni αi = 0.
                                i
    I Now our linear model of interest becomes
                     yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k
       where µ denotes the general effect or the average effect and αi denotes
       the
       X additional effect (fixed) due to Ai subject to the restriction
           ni αi = 0 and eij denotes the random error.
        i
    I We assume that for all i, j, eij are independent N(0, σ 2 ) variables.
Reparametrized form as linear Model
    I   Now, in terms of dummy variables, we have included k dummy
        variables for k levels along with an intercept.
Reparametrized form as linear Model
    I   Now, in terms of dummy variables, we have included k dummy
        variables for k levels along with an intercept.
    I   In this case the linear model becomes
                                   y = Xβ + 
        where β = (µ, α1 , α2 , ..., αk )T and
                                                               
                                    1n1 1n1 0       ···    0
                                  1n
                                  2 0 1n2          ···    0    
                 Xn×(k+1) =  .            ..  ..          ..   .
                                                                
                                  ..       .   .           .   
                                  1n k   0    0     ···   1nk
Example: More use of dummy variables
    I Consider a setup where we need to judge the effectiveness of a treatment (or
      may be comparing the effectiveness of two treatments, in which case the control
      group may be thought of getting some treatment). This may be a controlled
      experiment or observational study.
Example: More use of dummy variables
    I Consider a setup where we need to judge the effectiveness of a treatment (or
      may be comparing the effectiveness of two treatments, in which case the control
      group may be thought of getting some treatment). This may be a controlled
      experiment or observational study.
    I Suppose the data are obtained in the form
        Control    Treatment
          y11         y21
          y12         y22
           ..          ..
            .           .
                        ..
          y1n1           .
                      y2n2
Example: More use of dummy variables
    I Consider a setup where we need to judge the effectiveness of a treatment (or
      may be comparing the effectiveness of two treatments, in which case the control
      group may be thought of getting some treatment). This may be a controlled
      experiment or observational study.
    I Suppose the data are obtained in the form
        Control    Treatment
          y11         y21
          y12         y22
           ..          ..
            .           .
                        ..
          y1n1           .
                      y2n2
    I Note that we allow the number of observations in the two groups to be different
      and y ’s represent the value of the response.
Example: More use of dummy variables
    I Consider a setup where we need to judge the effectiveness of a treatment (or
      may be comparing the effectiveness of two treatments, in which case the control
      group may be thought of getting some treatment). This may be a controlled
      experiment or observational study.
    I Suppose the data are obtained in the form
        Control    Treatment
          y11         y21
          y12         y22
           ..          ..
            .           .
                        ..
          y1n1           .
                      y2n2
    I Note that we allow the number of observations in the two groups to be different
      and y ’s represent the value of the response.
    I This situation can also be tackled with a linear model with the use of dummy
      variables.
More use of dummy (contd.)
    I   Let us define a dummy variable as
        x = 1 or 0 according as the observation receives the treatment or not
More use of dummy (contd.)
    I   Let us define a dummy variable as
        x = 1 or 0 according as the observation receives the treatment or not
    I   Then the linear model can be written as
                                z = α + βx + 
        or more precisely
                 zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).
More use of dummy (contd.)
    I   Let us define a dummy variable as
        x = 1 or 0 according as the observation receives the treatment or not
    I   Then the linear model can be written as
                                  z = α + βx + 
        or more precisely
                  zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).
    I   Here             (
                          y1i          , i = 1, 2, ..., n1
                    zi =
                          y2(i−n1 )    , i = n1 + 1, ..., n1 + n2
        and 1 , 2 , ..., n are the random errors.
More use (Contd.)
    I Suppose we write the above linear model as
                                      z = X θ + .
More use (Contd.)
    I Suppose we write the above linear model as
                                         z = X θ + .
    I Then θ = (α, β) is the vector of parameters, z is the response vector and  is
      the random error vector.
More use (Contd.)
    I Suppose we write the above linear model as
                                          z = X θ + .
    I Then θ = (α, β) is the vector of parameters, z is the response vector and  is
      the random error vector.
    I It is instructive to have a look at the structure of the design matrix
                                                           
                                                  1    0
                                               1      0
                                                           
                                               .       .. 
                                               .
                                               .        . 
                                                            
                                               1      0
                                                           
                                     Xn×2 = · · · · · ·
                                                           
                                               1      1
                                                           
                                               1      1
                                                           
                                               ..       . 
                                                           
                                               .        .. 
                                                  1    1
       where the upper submatrix consists of n1 rows and the lower one contains
       n2 rows.
More use (Contd.)
    I Suppose we write the above linear model as
                                          z = X θ + .
    I Then θ = (α, β) is the vector of parameters, z is the response vector and  is
      the random error vector.
    I It is instructive to have a look at the structure of the design matrix
                                                           
                                                  1    0
                                               1      0
                                                           
                                               .       .. 
                                               .
                                               .        . 
                                                            
                                               1      0
                                                           
                                     Xn×2 = · · · · · ·
                                                           
                                               1      1
                                                           
                                               1      1
                                                           
                                               ..       . 
                                                           
                                               .        .. 
                                                  1    1
       where the upper submatrix consists of n1 rows and the lower one contains
       n2 rows.
    I Note that in the above formulation the effect of the treatment is α + β and the
      effect of the control is α- so the change in effect due to the treatment is β.
More than one categorical predictors
     I   We can include as many factor covariates as we wish to in our
         linear model.
More than one categorical predictors
     I   We can include as many factor covariates as we wish to in our
         linear model.
     I   All the factors need not to have same number of levels.
More than one categorical predictors
     I   We can include as many factor covariates as we wish to in our
         linear model.
     I   All the factors need not to have same number of levels.
     I   But then we have separate sets of dummy variables for each
         factors.
More than one categorical predictors
     I   We can include as many factor covariates as we wish to in our
         linear model.
     I   All the factors need not to have same number of levels.
     I   But then we have separate sets of dummy variables for each
         factors.
     I   The only wrinkle with having multiple factors is that α, the
         over-all intercept, is now the expected value of y for
         individuals where all categorical variables are in their respective
         baseline levels.
More than one categorical predictors
     I   We can include as many factor covariates as we wish to in our
         linear model.
     I   All the factors need not to have same number of levels.
     I   But then we have separate sets of dummy variables for each
         factors.
     I   The only wrinkle with having multiple factors is that α, the
         over-all intercept, is now the expected value of y for
         individuals where all categorical variables are in their respective
         baseline levels.
     I   With multiple factors we shall have a new issue in our model
         called interaction effect.
Interaction
     I   Interaction effect is the joint effect of two or more factors.
Interaction
     I   Interaction effect is the joint effect of two or more factors.
     I   Suppose we measure the effect of fertilizers and soil quality on
         the yield of plots.
Interaction
     I   Interaction effect is the joint effect of two or more factors.
     I   Suppose we measure the effect of fertilizers and soil quality on
         the yield of plots.
     I   Here the responses are the yields on different plots and there
         are two factors fertilizer brand and soil type.
Interaction
     I   Interaction effect is the joint effect of two or more factors.
     I   Suppose we measure the effect of fertilizers and soil quality on
         the yield of plots.
     I   Here the responses are the yields on different plots and there
         are two factors fertilizer brand and soil type.
     I   It may happen that a particular soil type behaves exceptionally
         in presence of a particular fertilizer.
Interaction
     I   Interaction effect is the joint effect of two or more factors.
     I   Suppose we measure the effect of fertilizers and soil quality on
         the yield of plots.
     I   Here the responses are the yields on different plots and there
         are two factors fertilizer brand and soil type.
     I   It may happen that a particular soil type behaves exceptionally
         in presence of a particular fertilizer.
     I   This is the interaction effect and should be included in the
         model explicitly.
Interaction effect
     I   Formally when we say that there are no interactions between
         two variables xi and xj , we mean that
                                      ∂
                                         E [y |x]
                                     ∂xi
         is not a function of xj .
Interaction effect
     I   Formally when we say that there are no interactions between
         two variables xi and xj , we mean that
                                      ∂
                                         E [y |x]
                                     ∂xi
         is not a function of xj .
     I   This means there are no interactions if and only if
                                               p
                                               X
                              E [y |x] = α +         fi (xi )
                                               i=1
         so that each coordinate of x makes its own separate, additive
         contribution to y .
Interaction effect
     I   Formally when we say that there are no interactions between
         two variables xi and xj , we mean that
                                      ∂
                                         E [y |x]
                                     ∂xi
         is not a function of xj .
     I   This means there are no interactions if and only if
                                               p
                                               X
                              E [y |x] = α +         fi (xi )
                                               i=1
         so that each coordinate of x makes its own separate, additive
         contribution to y .
     I   The standard multiple linear regression model of course
         includes no interactions between any of the predictor variables.
Interaction effect
     I   Formally when we say that there are no interactions between
         two variables xi and xj , we mean that
                                      ∂
                                         E [y |x]
                                     ∂xi
         is not a function of xj .
     I   This means there are no interactions if and only if
                                               p
                                               X
                              E [y |x] = α +         fi (xi )
                                               i=1
         so that each coordinate of x makes its own separate, additive
         contribution to y .
     I   The standard multiple linear regression model of course
         includes no interactions between any of the predictor variables.
     I   But general considerations of statistical modeling give us no
         reason whatsoever to anticipate that interactions are rare, or
         that when they exist they are small.
Interaction (Contd.)
     I   Conventionally interactions are included in a linear model by
         adding a product term.
Interaction (Contd.)
     I   Conventionally interactions are included in a linear model by
         adding a product term.
     I   For example, suppose we are dealing with two factor covariates
         each with two levels so that we include their effect in the linear
         model by two dummy variables x1 and x2 .
Interaction (Contd.)
     I   Conventionally interactions are included in a linear model by
         adding a product term.
     I   For example, suppose we are dealing with two factor covariates
         each with two levels so that we include their effect in the linear
         model by two dummy variables x1 and x2 .
     I   Then the interaction effect will be modeled as,
                       y = α + β1 x1 + β2 x2 + β3 x1 x2 + 
Interaction (Contd.)
     I   Conventionally interactions are included in a linear model by
         adding a product term.
     I   For example, suppose we are dealing with two factor covariates
         each with two levels so that we include their effect in the linear
         model by two dummy variables x1 and x2 .
     I   Then the interaction effect will be modeled as,
                       y = α + β1 x1 + β2 x2 + β3 x1 x2 + 
     I   It is no longer correct to interpret β1 as
              E [y |X1 = x1 + 1, X2 = x2 ] − E [y |X1 = x1 , X2 = x2 ].
Interaction (Contd.)
     I   That difference is, rather β1 + β3 x2 .
Interaction (Contd.)
     I   That difference is, rather β1 + β3 x2 .
     I   Similarly, β2 is no longer the expected difference in y between
         two otherwise-identical cases where x2 differs by 1.
Interaction (Contd.)
     I   That difference is, rather β1 + β3 x2 .
     I   Similarly, β2 is no longer the expected difference in y between
         two otherwise-identical cases where x2 differs by 1.
     I   The fact that we can’t give one answer to “how much does the
         response change when we change this variable?”, that the
         correct answer to that question always involves the other
         variable, is what interaction means.
Interaction (Contd.)
     I   That difference is, rather β1 + β3 x2 .
     I   Similarly, β2 is no longer the expected difference in y between
         two otherwise-identical cases where x2 differs by 1.
     I   The fact that we can’t give one answer to “how much does the
         response change when we change this variable?”, that the
         correct answer to that question always involves the other
         variable, is what interaction means.
     I   What we can say is that β1 is the slope with regard to x1 when
         x2 = 0, and likewise β2 is how much we expect y to change for
         a one-unit change in x2 when x1 = 0.
Interaction (Contd.)
     I   That difference is, rather β1 + β3 x2 .
     I   Similarly, β2 is no longer the expected difference in y between
         two otherwise-identical cases where x2 differs by 1.
     I   The fact that we can’t give one answer to “how much does the
         response change when we change this variable?”, that the
         correct answer to that question always involves the other
         variable, is what interaction means.
     I   What we can say is that β1 is the slope with regard to x1 when
         x2 = 0, and likewise β2 is how much we expect y to change for
         a one-unit change in x2 when x1 = 0.
     I   β3 is the rate at which the slope on x1 changes as x2 changes,
         and likewise the rate at which the slope on x2 changes with x1 .
Why Product Interactions?
    I   Conventionally linear models use interaction terms as product
        of the indicator variables, e.g. x1 x2 .
Why Product Interactions?
    I   Conventionally linear models use interaction terms as product
        of the indicator variables, e.g. x1 x2 .
    I   Interactions could have alternatively been introduced by using
                     x1 x2
        terms like 1+|x 1 x2 |
                               or x1 H(x2 − c) where H is the step function
                                          (
                                           1 ,x ≥ 0
                                 H(x) =              .
                                           0 ,x < 0
Why Product Interactions?
    I   Conventionally linear models use interaction terms as product
        of the indicator variables, e.g. x1 x2 .
    I   Interactions could have alternatively been introduced by using
                     x1 x2
        terms like 1+|x 1 x2 |
                               or x1 H(x2 − c) where H is the step function
                                          (
                                           1 ,x ≥ 0
                                 H(x) =              .
                                           0 ,x < 0
    I   A natural question is: Is there any special reason to use
        product interactions?
Why Product Interactions?
    I   Conventionally linear models use interaction terms as product
        of the indicator variables, e.g. x1 x2 .
    I   Interactions could have alternatively been introduced by using
                     x1 x2
        terms like 1+|x 1 x2 |
                               or x1 H(x2 − c) where H is the step function
                                          (
                                           1 ,x ≥ 0
                                 H(x) =              .
                                           0 ,x < 0
    I   A natural question is: Is there any special reason to use
        product interactions?
    I   Suppose that the real regression function µ(x) = E (Y |x) is a
        smooth function of all the coordinates of x.
Why Product Interactions?
    I   Conventionally linear models use interaction terms as product
        of the indicator variables, e.g. x1 x2 .
    I   Interactions could have alternatively been introduced by using
                     x1 x2
        terms like 1+|x 1 x2 |
                               or x1 H(x2 − c) where H is the step function
                                          (
                                           1 ,x ≥ 0
                                 H(x) =              .
                                           0 ,x < 0
    I   A natural question is: Is there any special reason to use
        product interactions?
    I   Suppose that the real regression function µ(x) = E (Y |x) is a
        smooth function of all the coordinates of x.
    I   Because it is smooth, we should be able to do a Taylor
        expansion around any particular point, say x ∗ as
                          p                           p p
                          X            ∂µ           1 XX                   ∂2µ
        µ(x) ≈ µ(x ∗ )+     (xi −xi∗ )     |x=x ∗ +       (x−xi∗ )(x−xj∗ )
                                       ∂xi          2                      ∂xi xj
                          i=1                       i=1 j=1
Product interactions (Contd.)
     I   The first term, µ(x ∗ ), is a constant. The next sum will give us
         linear terms in all the xi (plus more constants). The double
         sum after that will give us terms for each product xi xj , plus
         all the squares xi2 , plus more constants.
Product interactions (Contd.)
     I   The first term, µ(x ∗ ), is a constant. The next sum will give us
         linear terms in all the xi (plus more constants). The double
         sum after that will give us terms for each product xi xj , plus
         all the squares xi2 , plus more constants.
     I   Thus, if the true regression function is smooth, and we only
         see a small range of values for each predictor variable, using
         product terms is reasonable — provided we also include
         quadratic terms for each variable.
Product interactions (Contd.)
     I   The first term, µ(x ∗ ), is a constant. The next sum will give us
         linear terms in all the xi (plus more constants). The double
         sum after that will give us terms for each product xi xj , plus
         all the squares xi2 , plus more constants.
     I   Thus, if the true regression function is smooth, and we only
         see a small range of values for each predictor variable, using
         product terms is reasonable — provided we also include
         quadratic terms for each variable.
     I   Further we note that if xi ’s are indicators the quadratic terms
         are same as linear terms.
Product interactions (Contd.)
     I   The first term, µ(x ∗ ), is a constant. The next sum will give us
         linear terms in all the xi (plus more constants). The double
         sum after that will give us terms for each product xi xj , plus
         all the squares xi2 , plus more constants.
     I   Thus, if the true regression function is smooth, and we only
         see a small range of values for each predictor variable, using
         product terms is reasonable — provided we also include
         quadratic terms for each variable.
     I   Further we note that if xi ’s are indicators the quadratic terms
         are same as linear terms.
     I   Obviously we can include other type of interactions like
           x1 x2
         1+|x1 x2 | but then we need to form a new column of predictors
         in the design matrix.
Product interactions (Contd.)
     I   The first term, µ(x ∗ ), is a constant. The next sum will give us
         linear terms in all the xi (plus more constants). The double
         sum after that will give us terms for each product xi xj , plus
         all the squares xi2 , plus more constants.
     I   Thus, if the true regression function is smooth, and we only
         see a small range of values for each predictor variable, using
         product terms is reasonable — provided we also include
         quadratic terms for each variable.
     I   Further we note that if xi ’s are indicators the quadratic terms
         are same as linear terms.
     I   Obviously we can include other type of interactions like
           x1 x2
         1+|x1 x2 | but then we need to form a new column of predictors
         in the design matrix.
     I   Also then there may be difficulty in interpretations.
Example: Two way ANOVA
    I   Let there be two factors A and B. Suppose p levels of A
        namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
        constitute the entire population.
Example: Two way ANOVA
    I   Let there be two factors A and B. Suppose p levels of A
        namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
        constitute the entire population.
    I   Therefore we have pq level combinations (Ai , Bj ).
Example: Two way ANOVA
    I   Let there be two factors A and B. Suppose p levels of A
        namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
        constitute the entire population.
    I   Therefore we have pq level combinations (Ai , Bj ).
    I   Further suppose µij is fixed effect due to (Ai , Bj ).
Example: Two way ANOVA
    I   Let there be two factors A and B. Suppose p levels of A
        namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
        constitute the entire population.
    I   Therefore we have pq level combinations (Ai , Bj ).
    I   Further suppose µij is fixed effect due to (Ai , Bj ).
    I   Thus µij is the mean response of observations receiving
        treatment combination (Ai , Bj ).
Interpretation
     I   The interpretation of a treatment mean µij depends on
         whether the study is observational, experimental, or a mixture
         of the two.
Interpretation
     I   The interpretation of a treatment mean µij depends on
         whether the study is observational, experimental, or a mixture
         of the two.
     I   In an observational study, the treatment mean µij corresponds
         to the population mean for the elements having the
         characteristics of the i th level of factor A and the j th level of
         factor B.
Interpretation
     I   The interpretation of a treatment mean µij depends on
         whether the study is observational, experimental, or a mixture
         of the two.
     I   In an observational study, the treatment mean µij corresponds
         to the population mean for the elements having the
         characteristics of the i th level of factor A and the j th level of
         factor B.
     I   In an experimental study, the treatment mean µij stands for
         the mean response that would be obtained if the treatment
         consisting of the i th level of factor A and the j th level of factor
         B were applied to all units in the population of experimental
         units about which inferences are to be drawn.
Reparametrization
    I   For all i, j, rewrite µij as
             µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )
Reparametrization
    I   For all i, j, rewrite µij as
             µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )
                            XX
                        1
    I   Here µ̄00 =    pq             µij is the general effect (say µ) as it is
                            i     j
        obtained by averaging over the effects of all possible level
        combinations.
Reparametrization
    I   For all i, j, rewrite µij as
             µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )
                            XX
                        1
    I   Here µ̄00 =    pq              µij is the general effect (say µ) as it is
                            i      j
        obtained by averaging over the effects of all possible level
        combinations.
    I   Further
                                   1X
                          µ̄i0 =       µij = the fixed effect due to Ai
                                   q j
                                                                                X
        ⇒ αi = µ̄i0 − µ̄00 = fixed additional effect (main) due to Ai with            αi = 0
                                                                                  i
I And
                            1X
                   µ̄0j =       µij = the fixed effect due to Bj
                            p i
                                                                       X
  ⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with       βj = 0.
                                                                       j
I And
                              1X
                     µ̄0j =       µij = the fixed effect due to Bj
                              p i
                                                                         X
    ⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with       βj = 0.
                                                                         j
I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
  i th level Ai .
I And
                               1X
                      µ̄0j =       µij = the fixed effect due to Bj
                               p i
                                                                         X
    ⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with       βj = 0.
                                                                         j
I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
  i th level Ai .
I Averaging out over those effects for varying i, we get µ̄0j − µ̄00 .
I Thus
        γij = (µij − µ̄i0 ) − (µ̄0j − µ̄00 ) = fixed interaction effect due to (Ai , Bj )
  with                               X
                                           γij = 0 for all j
                                       i
  and                                X
                                           γij = 0 for all i.
                                       j
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
     I   The fact is that there cannot be any objective answer to this
         question.
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
     I   The fact is that there cannot be any objective answer to this
         question.
     I   Rather let us understand the difference between including or
         not including the interaction term in the model.
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
     I   The fact is that there cannot be any objective answer to this
         question.
     I   Rather let us understand the difference between including or
         not including the interaction term in the model.
     I   For illustration let us consider an example of a simple
         two-factor study in which the effects of gender (male and
         female) and age (young, middle and old) on learning of a task
         are of interest.
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
     I   The fact is that there cannot be any objective answer to this
         question.
     I   Rather let us understand the difference between including or
         not including the interaction term in the model.
     I   For illustration let us consider an example of a simple
         two-factor study in which the effects of gender (male and
         female) and age (young, middle and old) on learning of a task
         are of interest.
     I   When we assume no interaction effects we call the factor
         effects are additive, that is,
                               µij = µ + αi + βj
Interaction or no interaction?
     I   One potential question of interest is when should we include
         interaction in our model?
     I   The fact is that there cannot be any objective answer to this
         question.
     I   Rather let us understand the difference between including or
         not including the interaction term in the model.
     I   For illustration let us consider an example of a simple
         two-factor study in which the effects of gender (male and
         female) and age (young, middle and old) on learning of a task
         are of interest.
     I   When we assume no interaction effects we call the factor
         effects are additive, that is,
                               µij = µ + αi + βj
     I   This can mean two things.
No interaction
     I   The figure shows that Age has some effect (due to difference
         in height) whereas gender has no effect (since lines have zero
         slope) on the mean response.
     I   Also the lines do not intersect meaning that there is no
         interaction effect.
No interaction
     I   Here both age and gender have effects on the mean response
         but still there is no interaction effect because the lines do not
         intersect.
     I   Thus it is entirely possible that factors are additive (that is
         factors have main effects but they do not interact).
Interaction
     I   There are main effects of both the factors along with the
         interaction effect.
     I   Is it possible that factors have interaction effects but no main
         effects? (Can some parallel lines intersect ? )
Notes on interactions
     I   In case of multifactor studies some interactions may be zero
         even though the factors are interacting. All interactions must
         equal zero in order for the two factors to be additive.
Notes on interactions
     I   In case of multifactor studies some interactions may be zero
         even though the factors are interacting. All interactions must
         equal zero in order for the two factors to be additive.
     I   When two factors interact, the question arises whether the
         factor level means are still meaningful measures.
Notes on interactions
     I   In case of multifactor studies some interactions may be zero
         even though the factors are interacting. All interactions must
         equal zero in order for the two factors to be additive.
     I   When two factors interact, the question arises whether the
         factor level means are still meaningful measures.
     I   For instance, suppose in our example the gender factor level
         means comes out to be 13 and 11. It may be argued that
         these are misleading measures.
Notes on interactions
     I   In case of multifactor studies some interactions may be zero
         even though the factors are interacting. All interactions must
         equal zero in order for the two factors to be additive.
     I   When two factors interact, the question arises whether the
         factor level means are still meaningful measures.
     I   For instance, suppose in our example the gender factor level
         means comes out to be 13 and 11. It may be argued that
         these are misleading measures.
     I   They indicate that some difference exists in learning time for
         men and women, but that this difference is not too great.
Notes on interactions
     I   In case of multifactor studies some interactions may be zero
         even though the factors are interacting. All interactions must
         equal zero in order for the two factors to be additive.
     I   When two factors interact, the question arises whether the
         factor level means are still meaningful measures.
     I   For instance, suppose in our example the gender factor level
         means comes out to be 13 and 11. It may be argued that
         these are misleading measures.
     I   They indicate that some difference exists in learning time for
         men and women, but that this difference is not too great.
     I   These factor level means hide the fact that there is no
         difference in mean learning time between genders for young
         persons, but there is a relatively large difference for old
         persons.
I   In such a case we call the interaction effect to be important
    interactions implying that one should not ordinarily examine
    the effects of each factor separately in terms of the factor level
    means.
I   In such a case we call the interaction effect to be important
    interactions implying that one should not ordinarily examine
    the effects of each factor separately in terms of the factor level
    means.
I   Sometimes when two factors interact, the interaction effects
    are so small that they are considered to be unimportant
    interactions (the curves get almost parallel).
I   In such a case we call the interaction effect to be important
    interactions implying that one should not ordinarily examine
    the effects of each factor separately in terms of the factor level
    means.
I   Sometimes when two factors interact, the interaction effects
    are so small that they are considered to be unimportant
    interactions (the curves get almost parallel).
I   In the case of unimportant interactions, the analysis of factor
    effects can proceed as for the case of no interactions.
I   In such a case we call the interaction effect to be important
    interactions implying that one should not ordinarily examine
    the effects of each factor separately in terms of the factor level
    means.
I   Sometimes when two factors interact, the interaction effects
    are so small that they are considered to be unimportant
    interactions (the curves get almost parallel).
I   In the case of unimportant interactions, the analysis of factor
    effects can proceed as for the case of no interactions.
I   The determination of whether interactions are important or
    unimportant is admittedly sometimes difficult because it
    depends on the context of the application.
I   In such a case we call the interaction effect to be important
    interactions implying that one should not ordinarily examine
    the effects of each factor separately in terms of the factor level
    means.
I   Sometimes when two factors interact, the interaction effects
    are so small that they are considered to be unimportant
    interactions (the curves get almost parallel).
I   In the case of unimportant interactions, the analysis of factor
    effects can proceed as for the case of no interactions.
I   The determination of whether interactions are important or
    unimportant is admittedly sometimes difficult because it
    depends on the context of the application.
I   The subject area specialist (researcher) needs to play a
    prominent role in deciding whether an interaction is important
    or unimportant. The advantage of unimportant (or no)
    interactions, namely, that one is then able to analyze the
    factor effects separated is especially great when the study
    contains more than two factors.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Important interactions between teaching method and student’s
   quantitative ability were found to be present.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Important interactions between teaching method and student’s
   quantitative ability were found to be present.
I Students with excellent quantitative ability tended to perform equally,
   with the two teaching methods.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Important interactions between teaching method and student’s
   quantitative ability were found to be present.
I Students with excellent quantitative ability tended to perform equally,
   with the two teaching methods.
I Whereas students of moderate or good quantitative ability tended to
   perform better when taught by the standard method.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Important interactions between teaching method and student’s
   quantitative ability were found to be present.
I Students with excellent quantitative ability tended to perform equally,
   with the two teaching methods.
I Whereas students of moderate or good quantitative ability tended to
   perform better when taught by the standard method.
I If equal numbers of students with moderate good, and excellent
   quantitative ability are to be taught by one of the the teaching methods,
   then the method that produces the best average result for all students
   might be of interest even in the presence of important interactions.
I Occasionally, it is meaningful to consider the effects of each factor in
   terms of the factor level means even when important interactions are
   present.
I For example, two methods of teaching Linear Models (hard: using
   projections and standard: using usual sampling distributions) were used in
   teaching students of excellent, good, and medium quantitative ability.
I Important interactions between teaching method and student’s
   quantitative ability were found to be present.
I Students with excellent quantitative ability tended to perform equally,
   with the two teaching methods.
I Whereas students of moderate or good quantitative ability tended to
   perform better when taught by the standard method.
I If equal numbers of students with moderate good, and excellent
   quantitative ability are to be taught by one of the the teaching methods,
   then the method that produces the best average result for all students
   might be of interest even in the presence of important interactions.
I A comparison of the teaching method factor level means would then be
   relevant, even though important interactions are present.
Two way layout with one observation per cell
     I   In many studies we have constraints on cost, time, and
         materials that limit the number of observations that can be
         obtained.
Two way layout with one observation per cell
     I   In many studies we have constraints on cost, time, and
         materials that limit the number of observations that can be
         obtained.
     I   For example, a process engineer in a manufacturing company
         may have only a limited time to experiment with the
         production line.
Two way layout with one observation per cell
     I   In many studies we have constraints on cost, time, and
         materials that limit the number of observations that can be
         obtained.
     I   For example, a process engineer in a manufacturing company
         may have only a limited time to experiment with the
         production line.
     I   If the line is available for one day and only eight batches of
         product can be produced in a day, the experiment may have to
         be limited to eight observations.
Two way layout with one observation per cell
     I   In many studies we have constraints on cost, time, and
         materials that limit the number of observations that can be
         obtained.
     I   For example, a process engineer in a manufacturing company
         may have only a limited time to experiment with the
         production line.
     I   If the line is available for one day and only eight batches of
         product can be produced in a day, the experiment may have to
         be limited to eight observations.
     I   If the study involves one factor at four levels and a second
         factor at two levels so that there are eight factor level
         combinations, only one replication of the experiment is then
         possible for each treatment.
I   Another reason why some studies contain only one case per
    treatment is that the response of interest is a single aggregate
    measure of performance.
I   Another reason why some studies contain only one case per
    treatment is that the response of interest is a single aggregate
    measure of performance.
I   For example, in a marketing research study of alternative
    package designs, evaluation of each alternative may require a
    separate market test.
I   Another reason why some studies contain only one case per
    treatment is that the response of interest is a single aggregate
    measure of performance.
I   For example, in a marketing research study of alternative
    package designs, evaluation of each alternative may require a
    separate market test.
I   The response of interest is the observed market share, and this
    results in a single response for each treatment combination.
I   Another reason why some studies contain only one case per
    treatment is that the response of interest is a single aggregate
    measure of performance.
I   For example, in a marketing research study of alternative
    package designs, evaluation of each alternative may require a
    separate market test.
I   The response of interest is the observed market share, and this
    results in a single response for each treatment combination.
I   Special attention is required for the analysis of two-factor
    studies containing only one replication per treatment because
    no degrees of freedom are available for estimation of the
    experimental error with the standard two-factor ANOVA
    model.
No interaction model
    I   When there is only one case for each treatment, we no longer
        can work with two-factor ANOVA model using interaction
        effect.
No interaction model
    I   When there is only one case for each treatment, we no longer
        can work with two-factor ANOVA model using interaction
        effect.
    I   This is because no estimate of the error variance σ 2 will be
        available.
No interaction model
    I   When there is only one case for each treatment, we no longer
        can work with two-factor ANOVA model using interaction
        effect.
    I   This is because no estimate of the error variance σ 2 will be
        available.
    I   Recall that SSE is a sum of squares made up of components
        measuring the variability within each treatment.
No interaction model
    I   When there is only one case for each treatment, we no longer
        can work with two-factor ANOVA model using interaction
        effect.
    I   This is because no estimate of the error variance σ 2 will be
        available.
    I   Recall that SSE is a sum of squares made up of components
        measuring the variability within each treatment.
    I   With only one case per treatment, there is no variability within
        a treatment, and SSE will then always be zero.
I   A way out of this difficulty is to change the model.
I   A way out of this difficulty is to change the model.
I   We shall see later that if the two factors do not interact so
    that γij = 0, the interaction mean square MSAB has
    expectation σ 2 .
I   A way out of this difficulty is to change the model.
I   We shall see later that if the two factors do not interact so
    that γij = 0, the interaction mean square MSAB has
    expectation σ 2 .
I   Thus, if it is possible to assume that the two factors do not
    interact, we may use MSAB as the estimator of the error
    variance σ 2 and proceed with the analysis of factor effects as
    usual.
I   A way out of this difficulty is to change the model.
I   We shall see later that if the two factors do not interact so
    that γij = 0, the interaction mean square MSAB has
    expectation σ 2 .
I   Thus, if it is possible to assume that the two factors do not
    interact, we may use MSAB as the estimator of the error
    variance σ 2 and proceed with the analysis of factor effects as
    usual.
I   If it is unreasonable to assume that the two factors do not
    interact, transformations may be tried to remove the
    interaction effects.
Model
    I   We assume that we have a single observation yij corresponding
        to each level combination.
Model
    I   We assume that we have a single observation yij corresponding
        to each level combination.
    I   Hence the model we consider here is
                   yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q
        where µij is fixed effect due to (Ai , Bj ) and eij is random error.
Model
    I   We assume that we have a single observation yij corresponding
        to each level combination.
    I   Hence the model we consider here is
                   yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q
        where µij is fixed effect due to (Ai , Bj ) and eij is random error.
    I   With the reparametrized version the model reduces to
                              yij = µ + αi + βj + eij .
Example: Two way layout with more than one observation
per cell
    I   Let there be two factors A and B such that A has p levels
        A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
        combinations (Ai , Bj ) constitute the entire population of
        interest.
Example: Two way layout with more than one observation
per cell
    I   Let there be two factors A and B such that A has p levels
        A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
        combinations (Ai , Bj ) constitute the entire population of
        interest.
    I   Further we assume that we have m observations corresponding
        to each level combination.
Example: Two way layout with more than one observation
per cell
    I   Let there be two factors A and B such that A has p levels
        A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
        combinations (Ai , Bj ) constitute the entire population of
        interest.
    I   Further we assume that we have m observations corresponding
        to each level combination.
    I   Suppose yijk be the k th observation receiving the treatment
        combination (Ai , Bj ).
Example: Two way layout with more than one observation
per cell
    I   Let there be two factors A and B such that A has p levels
        A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
        combinations (Ai , Bj ) constitute the entire population of
        interest.
    I   Further we assume that we have m observations corresponding
        to each level combination.
    I   Suppose yijk be the k th observation receiving the treatment
        combination (Ai , Bj ).
    I   Then the model we consider here is
        yijk = µ+αi +βj +γij +eijk , k = 1, 2, ..m, i = 1, 2, .., p, j = 1, 2, ..., q
        where eijk is the random error which we assume to be
          I   independent (over i, j and k)
          I   having N(0, σ 2 ) for all i, j, k.
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
     I We have basically two ways to handle them:
           I   Ignoring the ordering and treat them like nominal categorical
               variables.
           I   Ignoring the fact that they’re only ordinal and not metric, assign
               them numerical codes (say 1, 2, 3, . . . ) and treat them like
               ordinary numerical variables.
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
     I We have basically two ways to handle them:
           I Ignoring the ordering and treat them like nominal categorical
             variables.
          I Ignoring the fact that they’re only ordinal and not metric, assign
             them numerical codes (say 1, 2, 3, . . . ) and treat them like
             ordinary numerical variables.
     I The first procedure is unbiased, but can end up dealing with a lot of
       distinct coefficients.
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
     I We have basically two ways to handle them:
           I Ignoring the ordering and treat them like nominal categorical
             variables.
          I Ignoring the fact that they’re only ordinal and not metric, assign
             them numerical codes (say 1, 2, 3, . . . ) and treat them like
             ordinary numerical variables.
     I The first procedure is unbiased, but can end up dealing with a lot of
       distinct coefficients.
     I It also has the drawback that if the relationship between Y and the
        categorical variable is monotone, that may not be respected by the
        coefficients we estimate.
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
     I We have basically two ways to handle them:
           I Ignoring the ordering and treat them like nominal categorical
             variables.
          I Ignoring the fact that they’re only ordinal and not metric, assign
             them numerical codes (say 1, 2, 3, . . . ) and treat them like
             ordinary numerical variables.
     I The first procedure is unbiased, but can end up dealing with a lot of
       distinct coefficients.
     I It also has the drawback that if the relationship between Y and the
        categorical variable is monotone, that may not be respected by the
        coefficients we estimate.
     I The second procedure is very easy, but usually without any substantive or
        logical basis. It implies that each step up in the ordinal variable will
        predict exactly the same difference in y , and why should that be the
        case?
Ordinal Factors
     I In case of ordinal variables the levels can be put in a sensible order, but
        there’s no implication that the distance from one level to the next is
        constant.
     I We have basically two ways to handle them:
           I Ignoring the ordering and treat them like nominal categorical
             variables.
          I Ignoring the fact that they’re only ordinal and not metric, assign
             them numerical codes (say 1, 2, 3, . . . ) and treat them like
             ordinary numerical variables.
     I The first procedure is unbiased, but can end up dealing with a lot of
       distinct coefficients.
     I It also has the drawback that if the relationship between Y and the
        categorical variable is monotone, that may not be respected by the
        coefficients we estimate.
     I The second procedure is very easy, but usually without any substantive or
        logical basis. It implies that each step up in the ordinal variable will
        predict exactly the same difference in y , and why should that be the
        case?
     I If, after treating an ordinal variable like a nominal one, we get contrasts
        which are all (approximately) equally spaced, we might then try the
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
     I   To illustrate things let us assume that we are dealing with a
         factor having two levels and there are other p numeric
         covariates x1 , x2 , ..., xp .
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
     I   To illustrate things let us assume that we are dealing with a
         factor having two levels and there are other p numeric
         covariates x1 , x2 , ..., xp .
     I   Thus introducing a single dummy variable xb we can write the
         linear model as
                                         p
                                         X
                         y = α + βb xb +    βi xi + .
                                            i=1
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
     I   To illustrate things let us assume that we are dealing with a
         factor having two levels and there are other p numeric
         covariates x1 , x2 , ..., xp .
     I   Thus introducing a single dummy variable xb we can write the
         linear model as
                                         p
                                         X
                         y = α + βb xb +    βi xi + .
                                            i=1
     I   Geometrically, if we plot the expected value of y against
         x1 , ...xp , we will now get two regression surfaces.
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
     I   To illustrate things let us assume that we are dealing with a
         factor having two levels and there are other p numeric
         covariates x1 , x2 , ..., xp .
     I   Thus introducing a single dummy variable xb we can write the
         linear model as
                                         p
                                         X
                         y = α + βb xb +    βi xi + .
                                            i=1
     I   Geometrically, if we plot the expected value of y against
         x1 , ...xp , we will now get two regression surfaces.
     I   They will be parallel to each other, and offset by βb .
Factors along with quantitative covariates
     I   It is perfectly possible that in our linear model some covariates
         are factors and others are numeric variables.
     I   To illustrate things let us assume that we are dealing with a
         factor having two levels and there are other p numeric
         covariates x1 , x2 , ..., xp .
     I   Thus introducing a single dummy variable xb we can write the
         linear model as
                                         p
                                         X
                         y = α + βb xb +    βi xi + .
                                            i=1
     I   Geometrically, if we plot the expected value of y against
         x1 , ...xp , we will now get two regression surfaces.
     I   They will be parallel to each other, and offset by βb .
     I   We thus have a model where each category gets its own
         intercept: α for the baseline level and α + βb for the other
         class.
Why not just split the data?
     I   If we want to give each class its own intercept, why not just
         split the data and estimate two models, one for each class?
Why not just split the data?
     I   If we want to give each class its own intercept, why not just
         split the data and estimate two models, one for each class?
     I   The answer is that sometimes we’ll do just this, especially if
         there’s a lot of data for each class.
Why not just split the data?
     I   If we want to give each class its own intercept, why not just
         split the data and estimate two models, one for each class?
     I   The answer is that sometimes we’ll do just this, especially if
         there’s a lot of data for each class.
     I   However, if the regression surfaces for the two categories really
         are parallel to each other, by splitting the data we’re losing
         some precision in our estimate of the common slopes, without
         gaining anything.
Why not just split the data?
     I   If we want to give each class its own intercept, why not just
         split the data and estimate two models, one for each class?
     I   The answer is that sometimes we’ll do just this, especially if
         there’s a lot of data for each class.
     I   However, if the regression surfaces for the two categories really
         are parallel to each other, by splitting the data we’re losing
         some precision in our estimate of the common slopes, without
         gaining anything.
     I   In fact, if the two surfaces are nearly parallel, for moderate
         sample sizes the small bias that comes from pretending the
         slopes are all equal can be overwhelmed by the reduction in
         variance, so that the resulting MSE of the estimates of
         parameters are less.
Interaction of Categorical and Numerical Variables
     I   If we multiply the indicator variable,say xb for a binary
         category, with an ordinary numerical variable, say x1 , we get a
         different slope on xi for each category:
                          y = α + β1 x1 + β1b xb x1 + 
Interaction of Categorical and Numerical Variables
     I   If we multiply the indicator variable,say xb for a binary
         category, with an ordinary numerical variable, say x1 , we get a
         different slope on xi for each category:
                          y = α + β1 x1 + β1b xb x1 + 
     I   When xb = 0, the slope on x1 is β1 , but when xb = 1, the
         slope on x1 is β1 + β1b
Interaction of Categorical and Numerical Variables
     I   If we multiply the indicator variable,say xb for a binary
         category, with an ordinary numerical variable, say x1 , we get a
         different slope on xi for each category:
                           y = α + β1 x1 + β1b xb x1 + 
     I   When xb = 0, the slope on x1 is β1 , but when xb = 1, the
         slope on x1 is β1 + β1b
     I   The coefficient for the interaction is the difference in slopes
         between the two categories.
Interaction of Categorical and Numerical Variables
     I   If we multiply the indicator variable,say xb for a binary
         category, with an ordinary numerical variable, say x1 , we get a
         different slope on xi for each category:
                           y = α + β1 x1 + β1b xb x1 + 
     I   When xb = 0, the slope on x1 is β1 , but when xb = 1, the
         slope on x1 is β1 + β1b
     I   The coefficient for the interaction is the difference in slopes
         between the two categories.
     I   It says that the categories share a common intercept, but their
         regression lines are not parallel (unless β1b = 0).
Interaction (Contd.)
     I   We could expand the model by letting each category have its
         own slope and its own intercept:
                     y = α + βb xb + β1 x1 + β1b xb x1 + 
Interaction (Contd.)
     I   We could expand the model by letting each category have its
         own slope and its own intercept:
                      y = α + βb xb + β1 x1 + β1b xb x1 + 
     I   This model, where “everything is interacted with the category”,
         is very close to just running two separate regressions, one per
         category.
Interaction (Contd.)
     I   We could expand the model by letting each category have its
         own slope and its own intercept:
                      y = α + βb xb + β1 x1 + β1b xb x1 + 
     I   This model, where “everything is interacted with the category”,
         is very close to just running two separate regressions, one per
         category.
     I   It does, however, insist on having a single noise variance σ 2
         (which separate regressions wouldn’t accomplish).