Lab 8
Lab 8
                                                                                                            2
          Simple regression equation                                     Observed value (y)
                                      𝑌 = 𝛼 + 𝛽𝑥 + 𝑒
                            Intercept           Slope
                         NORMAL EQUATIONS
       The Least Squares Estimators are used to estimate 𝛼 and 𝛽.   It finds the line that minimizes the sum of the
                                       ∑! 𝑥! 𝑌! − 𝑛𝑥̅ 𝑌%            squared differences between the observed
                          %
                      𝐴 = 𝑌 − 𝐵𝑥;̅ 𝐵 =
                                       ∑! 𝑥!" − 𝑛𝑥^2
                                                   ̅                data points and the predicted values on the
                                                                                           line.
                                                                                3
         Simple regression equation – in R
      set.seed(123) # Create some random data
      n <- 10; x = 1:n; y = 1:n + runif(n, -2, 2)
      df <- data.frame(x = x, y = y)
      # Least Squares Estimates
      B <- (sum(x * y) - n * mean(x) * mean(y)) /
          (sum(x^2) - n * mean(x)^2)
      A <- mean(y) - B * mean(x)
      # Plot the regression line
      library(ggplot2)
      ggplot(df, aes(x = x, y = y)) +
          geom_segment(aes(x = x, xend = x, y = A + B*x,
              yend = y), color = "cyan4") +
          geom_point(color = "brown3") +
          geom_abline(aes(slope = B, intercept = A)) +
          labs(title = "Estimated regression line",
               subtitle = paste0("A = ", round(A, 2), ", B = ", round(B, 2)))
                                                                                             4
         The lm() function
         The lm() function uses the least squares regression method to fit a linear model:
         u     y is a vector of dependent variable values
         u     x is a vector of independent variable values
         u     data is the data.frame where x and y variables are contained
           model <- lm(y ~ x, data = df)
           model
           #    Call:
           #    lm(formula = y ~ x, data = df)
           #    Coefficients:
           #    (Intercept)           x
           #        0.2702       1.0078
                                                                                           5
         lm()’s output
         Using the names() on the linear model estimated by lm() we can see the many outputs
         available:
names(model)
         Where:
         u     coefficients is a named vector of coefficients
         u     the residuals are the response minus the fitted.values
         u     fitted.values are the fitted mean values
                                                                                          6
         lm()’s output
         To obtain the estimated coefficients we can use the following codes:
           coefficients(model); coef(model)
           model$coefficients
           # (Intercept) x
           # 0.2702156 1.0077772
             The dataset below contain the weights and flipper lenghts of 10 penguins from the
             palmerpenguins dataset:
             weights <- c(4100, 3800, 4300, 4550, 3775, 2900, 4600, 5400, 4100, 4500)
             flipper_lengths <- c(216, 191, 210, 205, 199, 187, 209, 228, 210, 211)
             A.    Build a simple regression line using the flipper length as the independent variable and the
                   weight as the response variable. Manually compute the least squares estimates for the
                   intercept and slope .
             B.    Use the lm() function and compare the coefficients to those obtained in A.
             C.    Use ggplot2() to plot the weights and flipper lengths as points. Then add the regression
                   line using the geom_abline() function.
             D.    Remove the regression line from the plot above and replace it using the geom_smooth()
                   function with method = "lm" and formula = "weights ~ flipper_length".
                                                     N
                                                                                                       8
                                                 UTIO
                                           SOL
               Exercise 1 – Least squares estimates
             A.    Build a simple regression line using the flipper length as the independent variable and
                   the weight as the response variable. Manually compute the least squares estimates
                   for the intercept and slope.
               weights <- c(4100, 3800, 4300, 4550, 3775, 2900, 4600, 5400, 4100, 4500)
               flipper_lengths <- c(216, 191, 210, 205, 199, 187, 209, 228, 210, 211)
                                                     N
                                                                                                   9
                                                 UTIO
                                           SOL
               Exercise 1 – Least squares estimates
             B.    Use the lm() function and compare the coefficients to those obtained in A.
             C.    Use ggplot2() to plot the weights and flipper lengths as points. Then add the
                   regression line using the geom_abline() function.
               library(ggplot2)
               ggplot(data, aes(x = flipper_lengths,
                       y = weights)) +
                   geom_point() +
                   geom_abline(aes(intercept = intercept,
                       slope = slope))
                                                     N
                                                                                             10
                                                 UTIO
                                           SOL
               Exercise 1 – Least squares estimates
             D.    Remove the regression line from the plot above and replace it using the
                   geom_smooth() function with method = "lm" and formula = "y ~ x".
                                                                                              11
         Assessing the model: plot the regression line
      Graphical evaluation
      of a simple regression
      line involves visually
      inspecting the scatter
      plot of the data                        Strong         Weak             No         No linear
      points and the fitted                relationship   relationship   relationship   relationship
      line to assess the line's
      appropriateness in
      capturing the overall
      trend and
      relationship between
      the variables.
                                                                                                           12
         Assessing the model: amount of variation
                                                                                             .𝟎 + 𝜷
                                                                                        -𝒊 = 𝜷
                                                                                        𝒚         . 𝟏 𝒙𝒊
         The dependent variable 𝑦 has a certain
         amount of variability. When we build a
         regression model we can decompose the
                                                                                   𝒚𝒊
         total variability into two components: a
         portion explained by the independent
         variable 𝑥 and the residual, which remains                                      Unexplained sum of
         unexplained.                                                                    squares 𝑦$ − 𝑦2$ (
                          𝑆!! = 𝑆𝑆" + 𝑆𝑆#
                     '                     '     '
                          ( ( = $(𝑌*$ − 𝑌)
                    $(𝑌$ −𝑌)            ( ( + $(𝑌$ −𝑌*$ )(     Total sum of
                    $%&                    $%&   $%&         squares 𝑦$ − 𝑦(   (
                                                                                         Explained sum of
         u     𝑆## = Total sum of squares                                                squares 𝑦2$ − 𝑦( (
         u     𝑆𝑆$ = Explained sum of squares (sum of
               squares explained thanks to regression)
         u     𝑆𝑆% = Residual sum of squares (unexplained        +
                                                                 𝒚
               sum of squares)
                                                                                                    13
         Coefficient of determination 𝑅 !
         ► By decomposing the errors, we can gain insights into the variability explained by the
           model and the remaining unexplained variation.
         ► The coefficient of determination 𝑹𝟐 measures the proportion of the variance in the
           dependent variable that is explained by the independent variables in a regression
           model.
         ► It ranges from 0 to 1, with 0 indicating no explanation and 1 indicating full explanation.
14
             A.    Estimate a model without the intercept. Minimise the sum of squares for a model with the
                   sole slope and compute its value in the dataset. Then use the lm() function with a formula
                   of the form "y ~ -1 + x" to estimate the same model, named model1, and compare the
                   two values.
             B.    As above, estimate an intercept-only model, named model2. Then use the lm() function
                   with a formula of the form "y ~ 1" to estimate the same model and compare the two
                   values. What do you notice?
             C.    Estimate a third model, named model3, with intercept and slope. Plot the regression lines
                   for model1, model2, and model3 over the scatterplot of the data.
             D.    For model3, compute 𝑆## , 𝑆𝑆$ , and 𝑆𝑆% . Then compute the determination index.
                                                        N
                                                                                                                   15
                                                    UTIO
                                              SOL
               Exercise 2 – Determination index
             A.    Estimate a model without the intercept. Minimise the sum of squares for a model with the
                   sole slope and compute its value in the dataset. Then use the lm() function with a formula
                   of the form "y ~ -1 + x" to estimate the same model, named model1, and compare
                   the two values.
                  '                           '             '       '
                                                         N
                                                                                                                 16
                                                     UTIO
                                               SOL
               Exercise 2 – Determination index
             B.    As above, estimate an intercept-only model, named model2. Then use the lm()
                   function with a formula of the form "y ~ 1" to estimate the same model and
                   compare the two values. What do you notice?
                  '                        '                 '
               mean(scores)
                                                                                                          ∑'$%& 𝑦$
               model2 <- lm(scores ~ 1, data = data)                                                   𝑎=          = 𝑦(
               model2                                                                                        𝑛
                                                     N
                                                                                                    17
                                                 UTIO
                                           SOL
               Exercise 2 – Determination index
             C.    Estimate a third model, named model3, with intercept and slope. Plot the regression
                   lines for model1, model2, and model3 over the scatterplot of the data.
               model3 <- lm(scores ~ hours, data = data)
               ggplot(data, aes(x = hours, y = scores)) +
                   geom_point() +
                   geom_abline(aes(intercept = 0,
                       slope = coef(model1), color = "Slope")) +
                   geom_abline(aes(intercept = coef(model2),
                       slope = 0, color = "Intercept")) +
                   geom_abline(aes(intercept = coef(model3)[1],
                       slope = coef(model3)[2],
                       color = "Intercept and Slope")) +
                   guides(color = guide_legend("Model"))
                                                     N
                                                                                                     18
                                                 UTIO
                                           SOL
               Exercise 2 – Determination index
D. For model3, compute 𝑆!! , 𝑆𝑆" , and 𝑆𝑆# . Then compute the determination index.
               R2 <- 1 - SSr/Syy
               # The 82.85% of the scores variability is explained by the different
               # amount of study hours!
                                                                                   19
         Assessing the model: residuals
                                                                satisfactory    funnel
          Residuals are the differences between the
          observed values 𝑦 of the dependent variable and
          the values predicted by the regression model 𝑦.
                                                       2
                                                                                                  20
         Assessing the model: residuals
         Residual analysis helps assess how well the regression model captures the patterns and
         trends in the data. They are usually standardised by dividing them by the unbiased
         estimate of the standard deviation.
                                                                 𝑦$ − 𝑦2$    Residuals
                                    𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑠𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙$ =
                                                                              (
                                                             ∑'$%& 𝑦$ − 𝑦2$        Unbiased
                                                                  𝑛−2              SD estimate
         They are useful for:
         ► Checking normality: Standardized residuals should be distributed as a standardized
           normal random variable.
         ► Detecting Heteroscedasticity: Plotting standardized residuals against the predicted
           values or the independent variable can reveal patterns, such as a funnel shape or cone
           shape, which may suggest heteroscedasticity.
         ► Identifying Outliers: points with standardized residuals that fall outside a certain range
           (e.g., ±2 or ±3) may indicate observations that have a significant impact on the
           regression model. These outliers might be influential in influencing the estimated
           coefficients or affecting the overall fit of the model.
     Dott. Matteo Calgaro
Non è possibile visualizzare l'immagine.
                                                        21
         Assessing the model: residuals’ plots
         # Define a model
         m1 <- lm(y ~ x, data = data)
         # Extract the residuals
         res <- residuals(m1)
         # Extract the fitted values
         fitted <- fitted(m1)
         # Compute the standardized residuals
         stdres <- residuals(m1) /
            sqrt(sum(residuals(m1)^2)/m1$df.residual)
         # Plot the standardized residuals and the
         # fitted values
         ggplot(df, aes(x = fitted, y = stdres)) +
             geom_point(color = "brown3") +
             geom_hline(aes(yintercept = 0),
                 linetype = "dashed") +
             labs(x = "Fitted values",
                 y = "Standardized Residuals")
     Dott. Matteo Calgaro
Non è possibile visualizzare l'immagine.
                                                                              22
         Assessing the model: residuals’ normality
          Residuals' normality in simple linear regression can be
          assessed using several methods:
          ► Normality test: Shapiro-Wilk or Anderson-Darling tests. P-
            values > 0.05.
          ► Skewness and Kurtosis: values close to zero.
          ► Visual inspection: residuals vs fitted values, residuals
            histogram and density plot, qq-plots.
                                                                                                                          23
         Linear model summary()
         In R, the command summary() used for a linear model object (e.g., produced by lm()
         function) returns all the summary information about the regression model.
coefficient estimates
24
               library(palmerpenguins)
               df <- na.omit(penguins[, c("body_mass_g", "flipper_length_mm")])
             A.    Plot the body weights, the flipper lengths and draw the regression line in red.
             B.    How much variability of the body weights is explained by the model? (hint: compute the determination
                   index)
             C.    Compute the standardised residuals and plot them together with the fitted values. Are they
                   satisfactory? (hint: do you see any pattern?)
             D.    Compute the theoretical and observed quantiles for the standardised residuals. Draw the Q-Q plot.
                   Then use the function plot() on the regression model output. What are you obtaining?
             E.    Do A., B., C., and D., but instead of using the flipper length as independent variable, see if the
                   bill_length_mm variable could be a good regressor for body weight (hint: how is the determination
                   index? How are the standardised residuals distributed?)
                                                     N
                                                                                                     25
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
A. Plot the body weights, the flipper lengths and draw the regression line in red.
                                                     N
                                                                                         26
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
               # Automatically
               summary(m1)
               # Or manually
               SSy <- sum((df$body_mass_g –
                   mean(df$body_mass_g))^2)
               SSexp <- sum((fitted(m1) –
                   mean(df$body_mass_g))^2)
               SSunexp <- sum(residuals(m1)^2)
               R2 <- SSexp/SSy
                                                     N
                                                                                                   27
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
             C.    Compute the standardised residuals and plot them together with the fitted values.
                   Are they satisfactory?
                                                     N
                                                                                                                 28
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
             D.    Compute the theoretical and observed quantiles for the
                   standardised residuals. Draw the Q-Q plot. Then use the function
                   plot() on the regression model output. What are you obtaining?
                                                                                               Assess normality
               probs <- seq(0, 1, length.out = nrow(df))
               df$observed_q <- quantile(df$stdres, probs = probs)
               df$theoretical_q <- qnorm(probs)
               ggplot(df, aes(x = theoretical_q, y = observed_q)) +
                   geom_point() +
                   geom_abline(aes(intercept = 0, slope = 1)) +
                   labs(x = "Theoretical quantiles", "Observed quantiles")
                                                     N
                                                                                      29
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
             E.    Do A., B., C., and D. but instead of using the flipper length as
                   independent variable, see if the bill_length_mm variable could
                   be a good regressor for body weight.
               df <- na.omit(penguins[, c("body_mass_g", "bill_length_mm")])
               m2 <- lm(body_mass_g ~ bill_length_mm, data = df)
               ggplot(data = df, aes(x = bill_length_mm, y = body_mass_g)) +
                   geom_point() + geom_abline(aes(slope = coef(m2)[2],
                       intercept = coef(m2)[1]), color = "red") +
                   labs(x = "Bill length (mm)", y = "Body mass (g)",
                        title = "Palmer penguins",
                        subtitle = "Bill length and body mass")
               summary(m2)
                                                     N
                                                                                                               30
                                                 UTIO
                                           SOL
               Exercise 3 – Regression on penguins
               par(mfrow = c(2,2)
               plot(m2)