home_work_2
lokesh
                                         2023-03-09
1)
Intercept
The comparable test, including the null and alternate hypotheses:
H0:β0=0
Ha:β0≠0
When x1=x2=x3=0, the value that y takes is β0. In other words, these are the sales we
would anticipate if there was no TV, radio, or newspaper advertising.
We have strong evidence to reject H0 and accept the alternative hypothesis since p<
0.0001. This essentially merely implies that there is substantial evidence that sales would
not be zero in the absence of newspaper, radio, and television advertising.
TV & radio
The comparable test, including the null and alternate hypotheses for TV:
H0:β1=0 Ha:β1≠0
Similar to radio:
H0:β2=0 Ha:β2≠0
The null hypothesis in all scenarios states that, when all other predictors are fixed, the
corresponding variable has no effect on y.
We reject H0 and come to the conclusion that a change in the TV or radio budgets does
have some effect on sales because p value is less than 0.0001 in both situations.
Newspaper
The comparable test, including the null and alternate hypotheses for News Paper:
H0:β3=0 Ha:β3≠0
-The null hypothesis states that the newspaper has no impact on y(Sales) when the other
variables (TV and radio) are fixed.
-Given that p = 0.8599, there is insufficient evidence to reject this hypothesis. It appears
that adjusting the newspaper budget has no effect on sales.
3)
a)
The third answer (C) is correct.
To check this case, lets say we have fixed values for GPA and IQ: 𝑋1=𝑎and 𝑋2=𝑏 .
Substituting into the least squares regression model and rearranging, we have 𝑦̂
=50+20𝑎+0.07𝑏+0.01𝑎𝑏+(35−10𝑎)𝑋3.
     •   The coefficient of X3 is positive for lower GPA values (a <3.5), which means that for
         a fixed IQ value and a fixed GPA value of less than 3.5, females earn more on average
         than males since X3=1 for females and is zero for males. The coefficient of X3 turns
         negative for GPA scores above 3.5, which means that at a fixed IQ and a fixed GPA
         score higher than 3.5, men make on average more money. Hence, the model predicts
         that, assuming a high enough GPA, males earn more on average than females for a
         certain value of IQ and GPA.
b)
137.1K
Substituting into the least squares regression model, we have 𝑦̂
=50+20(4.0)+0.07(110)+35(1)+0.01(4.0)(110)−10(4.0)(1)=137.1.
This gives us a estimated salary of $137,100 for a female having an IQ of 110 & a CGPA of
4.0.
c)
False.
     •   The coefficient value for an interaction term does not prove or disprove the
         likelihood of an interaction effect. We would require a p-value for the coefficient of
         the interaction term, which we could either be given or calculate, in order to truly
         determine whether there is evidence of an interaction effect or not.
10)
(a)
library(ISLR)
attach(Carseats)
carseats_lm = lm(Sales~Price+Urban+US,data=Carseats)
summary(carseats_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
##     Min      1Q Median       3Q     Max
##    -6.9206 -1.6220 -0.0564            1.5786      7.0581
##
##    Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
##    (Intercept) 13.043469   0.651012 20.036 < 2e-16 ***
##    Price       -0.054459   0.005242 -10.389 < 2e-16 ***
##    UrbanYes    -0.021916   0.271650 -0.081     0.936
##    USYes        1.200573   0.259042   4.635 4.86e-06 ***
##    ---
##    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 2.472 on 396 degrees of freedom
##    Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
##    F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b)
  •      The average number of car seats sold when all other predictors are ignored is
         represented by the intercept.
  •      The Price coefficient is negative and hence the sales will roughly drop by 54
         seats(0.054x1000)for every unit($1) increase in price.
  •      The Urban=Yes coeff isn’t statistically significant. The US=Yes coeff is 1.2, and this
         implies an average increase in sales of car seat by 1200 units when US=Yes. This
         predictor likely refers to the shops in the USA).
(c)
Dummy variables used by R:
attach(Carseats)
## The following objects are masked from Carseats (pos = 3):
##
##     Advertising, Age, CompPrice, Education, Income, Population, Price,
##     Sales, ShelveLoc, Urban, US
contrasts(US)
##     Yes
## No    0
## Yes   1
contrasts(Urban)
##     Yes
## No    0
## Yes   1
Equation:
      S a l e s=13.04 +−0.05 P r i c e+− 0.02U r b a n ( Y e s :1 , N o: 0 ) +1.20 U S ( Y e s :1 , N o : 0 )
(d)
Using all variables:
carseats_all_lm = lm(Sales~.,data=Carseats)
summary(carseats_all_lm)
##
##    Call:
##    lm(formula = Sales ~ ., data = Carseats)
##
##    Residuals:
##        Min      1Q   Median       3Q       Max
##    -2.8692 -0.6908   0.0211   0.6636    3.4115
##
##    Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
##    (Intercept)        5.6606231 0.6034487    9.380 < 2e-16 ***
##    CompPrice          0.0928153 0.0041477 22.378 < 2e-16 ***
##    Income             0.0158028 0.0018451    8.565 2.58e-16 ***
##    Advertising        0.1230951 0.0111237 11.066 < 2e-16 ***
##    Population         0.0002079 0.0003705    0.561    0.575
##    Price             -0.0953579 0.0026711 -35.700 < 2e-16 ***
##    ShelveLocGood      4.8501827 0.1531100 31.678 < 2e-16 ***
##    ShelveLocMedium    1.9567148 0.1261056 15.516 < 2e-16 ***
##    Age               -0.0460452 0.0031817 -14.472 < 2e-16 ***
##    Education         -0.0211018 0.0197205 -1.070      0.285
##    UrbanYes           0.1228864 0.1129761    1.088    0.277
##    USYes             -0.1840928 0.1498423 -1.229      0.220
##    ---
##    Signif. codes:    0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 1.019 on 388 degrees of freedom
##    Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698
##    F-statistic: 243.4 on 11 and 388 DF, p-value: < 2.2e-16
  •     Null hypothesis can be rejected for the following: CompPrice, Income, Advertising,
        Price, ShelvelocGood, ShelvelocMedium and Age.
(e)
carseats_all_lm2 = lm(Sales~.-Education-Urban-US-Population,data=Carseats)
summary(carseats_all_lm2)
##
## Call:
## lm(formula = Sales ~ . - Education - Urban - US - Population,
##     data = Carseats)
##
## Residuals:
##     Min      1Q Median       3Q     Max
##    -2.7728 -0.6954   0.0282    0.6732    3.3292
##
##    Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
##    (Intercept)      5.475226   0.505005   10.84   <2e-16 ***
##    CompPrice        0.092571   0.004123   22.45   <2e-16 ***
##    Income           0.015785   0.001838    8.59   <2e-16 ***
##    Advertising      0.115903   0.007724   15.01   <2e-16 ***
##    Price           -0.095319   0.002670 -35.70    <2e-16 ***
##    ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
##    ShelveLocMedium 1.951993    0.125375   15.57   <2e-16 ***
##    Age             -0.046128   0.003177 -14.52    <2e-16 ***
##    ---
##    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 1.019 on 392 degrees of freedom
##    Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
##    F-statistic: 381.4 on 7 and 392 DF, p-value: < 2.2e-16
(f)
  •     The RSE goes down from 2.47 model (a) to 1.02 model (e). The R2 statistic
        increases from 0.24 (a) to 0.872 (e) and the F-statistic goes up from 41.52 to 381.4.
  •     The statistical evidence clearly shows that (e) is a much better fit.
(g)
confint(carseats_all_lm2)
##                          2.5 %      97.5 %
##    (Intercept)      4.48236820 6.46808427
##    CompPrice        0.08446498 0.10067795
##    Income           0.01217210 0.01939784
##    Advertising      0.10071856 0.13108825
##    Price           -0.10056844 -0.09006946
##    ShelveLocGood    4.53585700 5.13549250
##    ShelveLocMedium 1.70550103 2.19848429
##    Age             -0.05237301 -0.03988204
(h)
par(mfrow=c(2,2))
plot(carseats_all_lm2)
  •     Since there is no obvious form on the residuals v fitted values chart, the model
        seems to be a good fit for the data.
  •     There seem to be a few outliers. We can use studentized residuals to check as
        before. Observation 358 seems to be an outlier.
      rstudent(carseats_all_lm2)[which(rstudent(carseats_all_lm2)>3)]
##     358
## 3.34075
  •     One high leverage observation seems to be exist.
hatvalues(carseats_all_lm2)[order(hatvalues(carseats_all_lm2), decreasing =
T)][1]
##        311
## 0.06154635
13)
(a) (b) (c) (d) (e) (f)
set.seed(1)
x2 = rnorm(100, mean=0, sd=1)
eps = rnorm(100, mean=0, sd=0.5)
y2 = -1 +(0.5*x2) + eps
 •     Length of y2=100, β 0=− 1, β 1=0.5
plot(y2~x2, main= 'Scatter plot of x2 against y2', col='red')
# Linear regression line.
lm.fit6 = lm(y2~x2)
summary(lm.fit6)
##
##   Call:
##   lm(formula = y2 ~ x2)
##
##   Residuals:
##        Min       1Q   Median          3Q       Max
##   -0.93842 -0.30688 -0.06975     0.26970   1.17309
##
##   Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
##   (Intercept) -1.01885    0.04849 -21.010 < 2e-16 ***
##   x2            0.49947   0.05386   9.273 4.58e-15 ***
##   ---
##   Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##   Residual standard error: 0.4814 on 98 degrees of freedom
##   Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
##   F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
abline(lm.fit6, lwd=1, col ="blue")
# Population regression line and legends.
abline(a=-1,b=0.5, lwd=1, col="red")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population
Line'), col=c('blue','red'), lty = c(1, 1))
  •     X2 and Y2 have a positive linear relationship, with the error terms adding additional
        variation.
  •     ^β =− 1.018 and ^β =0.499 .The regression estimates are pretty much close to the true
          0                 1
        values: β 0=− 1, β 1=0.5 . This is further termed by the fact that the regression and
        population lines are too close to each other. P-values are near to zero and F-statistic
        is large and hence the null hypothesis can be rejected.
(g)
# Polynomial regression
lm.fit7 = lm(y2~x2+I(x2^2))
summary(lm.fit7)
##
##    Call:
##    lm(formula = y2 ~ x2 + I(x2^2))
##
##    Residuals:
##         Min       1Q   Median           3Q        Max
##    -0.98252 -0.31270 -0.06441      0.29014    1.13500
##
##    Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
##    (Intercept) -0.97164        0.05883 -16.517      < 2e-16 ***
##    x2           0.50858        0.05399   9.420      2.4e-15 ***
##    I(x2^2)     -0.05946        0.04238 -1.403         0.164
##    ---
##    Signif. codes: 0 '***'      0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 0.479 on 97 degrees of freedom
##    Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
##    F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
  •     The model fit is not improved by the quadratic term. The p-value for the squared
        term is higher than 0.05 and indicates that it is not statistically significant because
        the F-statistic is reduced.
(h)
eps = rnorm(100, mean=0, sd=sqrt(0.01))
y2 = -1 +(0.5*x2) + eps
plot(y2~x2, main='Reduced Noise', col='red')
lm.fit7 = lm(y2~x2)
summary(lm.fit7)
##
##    Call:
##    lm(formula = y2 ~ x2)
##
##    Residuals:
##          Min        1Q    Median              3Q         Max
##    -0.291411 -0.048230 -0.004533        0.064924    0.264157
##
##    Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
##    (Intercept) -0.99726    0.01047 -95.25    <2e-16 ***
##    x2            0.50212   0.01163   43.17   <2e-16 ***
##    ---
##    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 0.1039 on 98 degrees of freedom
##    Multiple R-squared: 0.9501, Adjusted R-squared: 0.9495
##    F-statistic: 1864 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm.fit7, lwd=1, col ="blue")
abline(a=-1,b=0.5, lwd=1, col="red")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population
Line'), col=c('blue','red'), lty = c(1, 1))
  •     In comparison to a variance of 0.25, the points are more closely spaced, the RSE is
        smaller, and the R2 and F-statistic are substantially greater. As noise is minimized,
        the population and linear regression lines are quite close to one another, and the
        relationship is much more linear.
(i)
eps = rnorm(100, mean=0, sd=sqrt(0.5625))
y2 = -1 +(0.5*x2) + eps
plot(y2~x2, main='Increased Noise', col='red')
lm.fit8 = lm(y2~x2)
summary(lm.fit8)
##
##    Call:
##    lm(formula = y2 ~ x2)
##
##    Residuals:
##         Min       1Q   Median           3Q        Max
##    -1.88719 -0.40893 -0.02832      0.50466    1.40916
##
##    Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
##    (Intercept) -0.95675   0.07521 -12.721 < 2e-16 ***
##    x2           0.45824   0.08354   5.485 3.23e-07 ***
##    ---
##    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##    Residual standard error: 0.7466 on 98 degrees of freedom
##    Multiple R-squared: 0.2349, Adjusted R-squared: 0.2271
##    F-statistic: 30.09 on 1 and 98 DF, p-value: 3.227e-07
abline(lm.fit8, lwd=1, col ="blue")
abline(a=-1,b=0.5, lwd=1, col="red")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population
Line'), col=c('blue','red'), lty = c(1, 1))
  •     The points are more spread out and hence the relationship is not that linear. The
        RSE is higher, the R2 and F-statistic are lower than with variance of 0.25.
(j)
# 95% confidence intervals for original, less noise and more noise datasets
respectively.
confint(lm.fit6)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x2           0.3925794 0.6063602
confint(lm.fit7)
##                  2.5 %     97.5 %
## (Intercept) -1.0180413 -0.9764850
## x2           0.4790377 0.5251957
confint(lm.fit8)
##                  2.5 %     97.5 %
## (Intercept) -1.1060050 -0.8074970
## x2           0.2924541 0.6240169
 •   With the lowest variance model, confidence interval values are the narrowest; for
     the greatest variance model, they are the widest; and for the original model, they are
     in-between these two..