0% found this document useful (0 votes)
5 views16 pages

试卷2

The document outlines a data analysis project focused on understanding the relationship between car prices and various performance indicators using a dataset of 50 observations and 12 variables. It includes multiple linear regression analysis, significance testing, multicollinearity checks, variable selection methods, and random effects modeling. Each section contains specific questions and tasks related to the analysis, emphasizing statistical methods and interpretations.

Uploaded by

zzzjun88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

试卷2

The document outlines a data analysis project focused on understanding the relationship between car prices and various performance indicators using a dataset of 50 observations and 12 variables. It includes multiple linear regression analysis, significance testing, multicollinearity checks, variable selection methods, and random effects modeling. Each section contains specific questions and tasks related to the analysis, emphasizing statistical methods and interpretations.

Uploaded by

zzzjun88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

© The Hong Kong Polytechnic University

The questions from Question 2 to Question 6 are based on the dataset below:

You are a data analyst interested in investigating which types of cars are more expensive on the
market. Therefore, you wish to investigate the relationship between a car’s miles/(US) gallon
(Y ) and other indicators of performance (X1 , . . . , X11 ). The data set contains 50 observations
on 12 variables.

Notation Variable Description


Y mpg Miles/(US) gallon
X1 cyl number of cylinders
X2 hp Gross horsepower
X3 wt Weight (lb/1000)
X4 am Transmission (0 = automatic, 1 = manual)
X5 disp displacement (cu.in.)
X6 drat Rear axle ratio
X7 qsec 1/4 mile time
X8 vs V/S
X9 gear Number of forward gears
X10 carb Number of carburetors
X11 brand Groups of manufacturers (”Ford”, ”Chevrolet”, ”Dodge”)
I

4
© The Hong Kong Polytechnic University
4

To answer Question 2, refer to Appendix: Code and Output Q2.

Question 2: Multiple Linear Regression (MLR) [Total: 30 marks]

We study the association between the response variable mpg (Y ), and the explanatory variables
cyl (X1 ), hp (X2 ), and wt (X3 ), and assume a MLR model to describe their relationship. That
is,
Yi = 0 + 1 X1i + 2 X2i + 3 X3i + "i , i = 1, . . . , 50.

(1)

2.1 Fit a MLR model relating Y to X1 , X2 , and X3 , based on the result of least-squares
estimation (LSE).
[3 marks]

2.2 What is the function of an analysis-of-variance approach? Construct the analysis-of-


variance table based on the model (1). [(1+3)
marks]

2.3 Test for significance of regression by the p-value method.


Hint: give the full procedure, including null and alternative hypotheses, test statistic,
decision rule, and etc.
[4 marks]

2.4 Instead of the p-value method for the test of significance of regression, determine the critical
region. Make your decision regarding the null hypothesis based on the critical region.
[3 marks]

2.5 Use the partial F-test to test for whether X1 and X3 are significant regressors given X2 is
already in the model. [8 marks]

2.6 Use the general linear hypothesis approach to test H0 : 1 2 = 0, 2 + 3 = 0 by


transforming H0 into H0 : T = 0 (↵ = 0.05). [8 marks]
5
6

5
© The Hong Kong Polytechnic University

To answer Question 3, refer to Appendix: Code and Output Q3.

Question 3: Indicator Variables [Total: 10 marks]

We are now interested in involving the indicator variable am (X4 ) and studying its e↵ects on the
response. We still assume a MLR model:
Yi = 0 + 1 X1i + 2 X2i + 3 X3i + 4 X4i + "i , i = 1, . . . , 50. (2)

3.1 Determine the least-sqaures fit between Y and X1 , X2 , X3 , X4 based on model (2).
[2 marks]

3.2 What does it mean when X4 = 0 or when X4 = 1, and how do you interpret the meaning
of the new 4 term, based on the corresponding code and output? [(2+2) marks]

3.3 Add a new interaction term between X1 and X4 to model (2), that is
Yi = 0 + 1 X1i + 2 X2i + 3 X3i + 4 X4i + 14 X1i X4i + "i , i = 1, . . . , 50. (3)
Test the significance of the interaction e↵ect term ( 14 )(↵ = 0.05).
[4 marks]

6
© The Hong Kong Polytechnic University

To answer Question 4, refer to Appendix: Code and Output Q4.

Question 4: Multicollinearity [Total: 10 marks]


We wish to inspect the entire data set for possible multicollinearity issues. Therefore, we now
consider relating Y to X1 , . . . , X10 , the first 10 explanatory variables in the data set, and examine
if there is multicollinearity among the explanatory variables.

4.1 Based on the matrix of correlations between all possible regressors (10 in total), which
variables are highly correlated? Interpret such high correlations.
[(3+2) marks]

4.2 Find the variance inflation factors (VIFs). Based on the VIFs, is there evidence of multi-
collinearity in these data and why?
[(2+3) marks]

4 2
.

7
© The Hong Kong Polytechnic University

To answer Question 5, refer to Appendix: Code and Output Q5.

Question 5: Variable Selection [Total: 15 marks]


We have detected the existence of multicollinearity in the data set. Now we employ variable
selection methods to obtain reduced models to fit the data.

5.1 This sub-question is based on code block 1 under Q5 (Appendix). Based on the output,
which selection method is used to select the appropriate model (forward, backward, or
both)? Which criterion is the selection based on (AIC or BIC)?
[(2+2) marks]

5.2 This sub-question is based on code block 2 under Q5 (Appendix). Based on the output,
which selection method is used to select the appropriate model (forward, backward, or
both)? Which criterion is the selection based on (AIC or BIC)?
[(2+2) marks]

5.3 Determine the selected models by the above two procedures. [(2+2) marks]

5.4 Did the two above procedures lead to the same result? Give one key di↵erence between the
two procedures. [(1+2) marks]

8
© The Hong Kong Polytechnic University

To answer Question 6, refer to Appendix: Code and Output Q6.

Question 6: Random E↵ects Model [Total: 25 marks]


Suppose the cars are produced by three manufacturers (”Ford”, ”Chevrolet”, ”Dodge”). Now
we want to study the fixed e↵ect of cyl (X1 ) on the response mpg (Y ) while accounting for
di↵erences among the three groups of manufacturers (X11 ).
We assume a Laird-Ware form of random e↵ects model:

3
X
Yij = 0 + 1 X1ij + j + "ij , i = 1, . . . , nj ; j = 1, 2, 3; nj = 50,
j=1

(4)
where j ⇠ N (0, d2 ) and j ’s are independent, "ij ⇠ N (0, 2) and "ij ’s are independent, and j ’s

and "ij ’s are independent.

6.1 Based on the part of summary(RAmodel), how much variation is left unexplained after
including the fixed and random e↵ects in the model? [4 marks]

6.2 How many levels does this model contain? Explain what each level is. [(3+3) marks]

6.3 Fit the random e↵ects models for each group j. [6 marks]

6.4 The intra-class correlation coefficient (ICC) is the proportion of variation in individual
cars’ miles/(US) gallon due to di↵erences among groups of manufacturers.
Use the following equation to find the ICC (⇢):

var( j ) d2
= 2 2
=⇢
var(Yij ) d +
[3 marks]

6.5 Which method of estimation should be preferred for random e↵ects models (MLE or
REML)? Explain your answer. [(3+3) marks]

END

9
© The Hong Kong Polytechnic University

Appendix: Code and Output

Q2
data <- read.table("FinalData.txt",header=TRUE)
str(data)

## data.frame : 50 obs. of 11 variables:


## $ mpg : num 17.3 15.2 10.4 22.8 19.2 15.2 17.8 14.3 21.4 27.3 ...
## $ cyl : num 8 8 8 4 8 8 6 8 4 4 ...
## $ disp: num 276 276 472 141 400 ...
## $ hp : num 180 180 205 95 175 150 123 245 109 66 ...
## $ drat: num 3.07 3.07 2.93 3.92 3.08 3.15 3.92 3.21 4.11 4.08 ...
## $ wt : num 3.73 3.78 5.25 3.15 3.85 ...
## $ qsec: num 17.6 18 18 22.9 17.1 ...
## $ vs : num 0 0 0 1 0 0 1 0 1 1 ...
## $ am : num 0 0 0 0 0 0 0 0 1 1 ...
## $ gear: num 3 3 3 4 3 3 4 3 4 4 ...
## $ carb: num 3 3 4 2 2 2 4 4 2 1 ...
head(data,5)

## mpg cyl disp hp drat wt qsec vs am gear carb


## 1 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 2 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 3 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
attach(data)
model1 <- lm(mpg~cyl+hp+wt)
summary(model1)

##
## Call:
## lm(formula = mpg ~ cyl + hp + wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.699 -1.155 -0.407 1.016 5.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.315679 1.355342 28.270 < 2e-16 ***
## cyl -0.718206 0.424885 -1.690 0.0977 .
## hp -0.024945 0.009663 -2.582 0.0131 *
## wt -3.182233 0.557917 -5.704 8.03e-07 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

10
© The Hong Kong Polytechnic University

##
## Residual standard error: 2.342 on 46 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8458
## F-statistic: 90.58 on 3 and 46 DF, p-value: < 2.2e-16
model2 <- lm(mpg~cbind(cyl,hp,wt))
anova(model2)

## Analysis of Variance Table


##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cbind(cyl, hp, wt) 3 1490.58 496.86 90.576 < 2.2e-16 ***
## Residuals 46 252.34 5.49
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
qf(0.95,3,46)

## [1] 2.806845
anova(lm(mpg~hp))

## Analysis of Variance Table


##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 1146.79 1146.79 92.339 9.185e-13 ***
## Residuals 48 596.13 12.42
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
qf(0.95,2,46)

## [1] 3.199582
fmodel <- lm(mpg~cyl+hp+wt)
anova(fmodel)

## Analysis of Variance Table


##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 1270.77 1270.77 231.6568 < 2.2e-16 ***
## hp 1 41.35 41.35 7.5375 0.008593 **
## wt 1 178.46 178.46 32.5330 8.031e-07 ***
## Residuals 46 252.34 5.49
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
z1 <- cyl+hp
z2 <- hp-wt
rmodel <- lm(mpg~z1+z2)
anova(rmodel)

## Analysis of Variance Table


##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)

11
© The Hong Kong Polytechnic University

## z1 1 1158.31 1158.31 182.744 < 2.2e-16 ***


## z2 1 286.69 286.69 45.231 2.134e-08 ***
## Residuals 47 297.91 6.34
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
qf(0.95,2,47)

## [1] 3.195056

1
© The Hong Kong Polytechnic University

Q3
model3 <- lm(mpg~cyl+hp+wt+am)
summary(model3)

##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2600 -1.8061 -0.4625 1.1947 5.6968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.79646 2.22767 16.069 < 2e-16 ***
## cyl -0.54270 0.43821 -1.238 0.22198
## hp -0.03060 0.01036 -2.954 0.00498 **
## wt -2.65241 0.66678 -3.978 0.00025 ***
## am 1.43371 1.01244 1.416 0.16363
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 2.317 on 45 degrees of freedom
## Multiple R-squared: 0.8614, Adjusted R-squared: 0.8491
## F-statistic: 69.92 on 4 and 45 DF, p-value: < 2.2e-16
model4 <- lm(mpg~cyl+hp+wt+am+cyl*am)
summary(model4)

##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am + cyl * am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7203 -1.2476 -0.6787 1.4713 5.3366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.81802 2.31229 15.058 < 2e-16 ***
## cyl -0.47848 0.43602 -1.097 0.278441
## hp -0.02601 0.01076 -2.418 0.019837 *
## wt -2.71064 0.66108 -4.100 0.000175 ***
## am 5.04318 2.76492 1.824 0.074950 .
## cyl:am -0.67974 0.48531 -1.401 0.168341
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 2.293 on 44 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8522
## F-statistic: 57.52 on 5 and 44 DF, p-value: < 2.2e-16

1
© The Hong Kong Polytechnic University

Q4
cor(data[,-1])

## cyl disp hp drat wt qsec


## cyl 1.0000000 0.9082751 0.8570174 -0.7271632 0.7783612 -0.62561185
## disp 0.9082751 1.0000000 0.8074629 -0.7437779 0.8915563 -0.45416093
## hp 0.8570174 0.8074629 1.0000000 -0.4872792 0.6763000 -0.73164607
## drat -0.7271632 -0.7437779 -0.4872792 1.0000000 -0.7218245 0.13472010
## wt 0.7783612 0.8915563 0.6763000 -0.7218245 1.0000000 -0.20760018
## qsec -0.6256119 -0.4541609 -0.7316461 0.1347201 -0.2076002 1.00000000
## vs -0.7959708 -0.7033907 -0.7294454 0.4477173 -0.5401273 0.78012235
## am -0.5405752 -0.6232762 -0.3274658 0.7081947 -0.6973783 -0.13894923
## gear -0.5631739 -0.6358329 -0.2864005 0.7165513 -0.6046845 -0.09239065
## carb 0.5326364 0.3948202 0.7221994 -0.1294226 0.4501591 -0.65813626
## vs am gear carb
## cyl -0.7959708 -5.405752e-01 -0.56317392 5.326364e-01
## disp -0.7033907 -6.232762e-01 -0.63583286 3.948202e-01
## hp -0.7294454 -3.274658e-01 -0.28640052 7.221994e-01
## drat 0.4477173 7.081947e-01 0.71655128 -1.294226e-01
## wt -0.5401273 -6.973783e-01 -0.60468446 4.501591e-01
## qsec 0.7801224 -1.389492e-01 -0.09239065 -6.581363e-01
## vs 1.0000000 1.701706e-01 0.23472935 -5.750294e-01
## am 0.1701706 1.000000e+00 0.78734229 -6.646151e-05
## gear 0.2347293 7.873423e-01 1.00000000 1.942447e-01
## carb -0.5750294 -6.646151e-05 0.19424470 1.000000e+00
library(car)

## Loading required package: carData


vif(lm(mpg~.,data=data))

## cyl disp hp drat wt qsec vs am


## 16.221784 25.766526 10.377170 3.680522 16.531852 8.702772 5.267179 4.050267
## gear carb
## 5.300046 8.074390

1
© The Hong Kong Polytechnic University

Q5
# Code block 1
full <- lm(mpg~.,data=data)
step(full,direction="backward",k=2)

## Start: AIC=92.03
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.085 202.96 90.050
## - carb 1 0.400 203.28 90.128
## - vs 1 1.122 204.00 90.305
## - drat 1 3.076 205.96 90.782
## - gear 1 5.154 208.03 91.284
## <none> 202.88 92.030
## - disp 1 8.427 211.31 92.064
## - qsec 1 10.993 213.87 92.668
## - am 1 11.944 214.82 92.890
## - hp 1 15.941 218.82 93.812
## - wt 1 44.111 246.99 99.866
##
## Step: AIC=90.05
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.331 203.30 88.132
## - vs 1 1.041 204.00 88.306
## - drat 1 3.155 206.12 88.822
## - gear 1 5.385 208.35 89.360
## <none> 202.96 90.050
## - disp 1 9.891 212.86 90.430
## - qsec 1 11.269 214.23 90.752
## - am 1 11.910 214.88 90.902
## - hp 1 15.880 218.84 91.817
## - wt 1 45.327 248.29 98.129
##
## Step: AIC=88.13
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear
##
## Df Sum of Sq RSS AIC
## - vs 1 1.258 204.55 86.440
## - drat 1 3.131 206.43 86.896
## - gear 1 5.218 208.51 87.399
## <none> 203.30 88.132
## - am 1 12.305 215.60 89.070
## - qsec 1 14.194 217.49 89.506
## - disp 1 21.570 224.87 91.174
## - hp 1 28.329 231.62 92.655
## - wt 1 100.141 303.44 106.157
##
## Step: AIC=86.44
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##

1
© The Hong Kong Polytechnic University

## Df Sum of Sq RSS AIC


## - drat 1 3.805 208.36 85.362
## - gear 1 5.509 210.06 85.769
## <none> 204.55 86.440
## - am 1 11.390 215.94 87.150
## - disp 1 20.402 224.96 89.194
## - hp 1 27.193 231.75 90.681
## - qsec 1 34.801 239.35 92.296
## - wt 1 112.388 316.94 106.335
##
## Step: AIC=85.36
## mpg ~ disp + hp + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 8.296 216.65 85.314
## <none> 208.36 85.362
## - am 1 14.690 223.05 86.768
## - disp 1 17.885 226.24 87.480
## - hp 1 26.022 234.38 89.246
## - qsec 1 36.035 244.39 91.338
## - wt 1 116.314 324.67 105.540
##
## Step: AIC=85.31
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 216.65 85.314
## - disp 1 10.283 226.94 85.633
## - hp 1 21.768 238.42 88.101
## - qsec 1 31.085 247.74 90.018
## - am 1 40.246 256.90 91.833
## - wt 1 108.079 324.73 103.549
##
## Call:
## lm(formula = mpg ~ disp + hp + wt + qsec + am, data = data)
##
## Coefficients:
## (Intercept) disp hp wt qsec am
## 16.63563 0.01092 -0.02434 -4.13734 0.93306 2.98110
# Code block 2
null <- lm(mpg~1)
step(null,scope =list(upper=full),direction="forward",k=log(50))

## Start: AIC=181.48
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 1313.75 429.17 115.31
## + disp 1 1273.36 469.55 119.81
## + cyl 1 1270.77 472.15 120.09
## + hp 1 1146.79 596.13 131.75
## + drat 1 819.20 923.71 153.64
## + vs 1 766.31 976.60 156.43

1
© The Hong Kong Polytechnic University

## + am 1 662.25 1080.67 161.49


## + carb 1 578.74 1164.18 165.21
## + gear 1 502.42 1240.50 168.39
## + qsec 1 378.53 1364.39 173.15
## <none> 1742.91 181.48
##
## Step: AIC=115.32
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + hp 1 161.157 268.01 95.686
## + qsec 1 148.765 280.40 97.946
## + cyl 1 140.272 288.90 99.438
## + vs 1 92.750 336.42 107.052
## + carb 1 75.145 354.02 109.603
## + disp 1 55.337 373.83 112.325
## <none> 429.17 115.315
## + drat 1 12.622 416.54 117.735
## + am 1 0.407 428.76 119.180
## + gear 1 0.390 428.78 119.182
##
## Step: AIC=95.69
## mpg ~ wt + hp
##
## Df Sum of Sq RSS AIC
## <none> 268.01 95.686
## + am 1 18.2054 249.80 96.081
## + cyl 1 15.6739 252.34 96.585
## + drat 1 12.7799 255.23 97.155
## + gear 1 11.2232 256.79 97.459
## + qsec 1 9.7038 258.31 97.754
## + vs 1 7.2137 260.80 98.234
## + carb 1 0.6156 267.39 99.483
## + disp 1 0.1881 267.82 99.563
##
## Call:
## lm(formula = mpg ~ wt + hp)
##
## Coefficients:
## (Intercept) wt hp
## 37.20961 -3.67607 -0.03662

1
© The Hong Kong Polytechnic University

Q6
set.seed(123)
data$manufacturer <- sample(c("Ford", "Chevrolet", "Dodge"), size = nrow(data), replace = TRUE)
library(lme4)

## Loading required package: Matrix


RAmodel <- lmer(mpg ~ cyl + (1 | manufacturer), data = data)
summary(RAmodel)

## Linear mixed model fit by REML [ lmerMod ]


## Formula: mpg ~ cyl + (1 | manufacturer)
## Data: data
##
## REML criterion at convergence: 254.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.51307 -0.53749 0.00938 0.34874 2.51611
##
## Random effects:
## Groups Name Variance Std.Dev.
## manufacturer (Intercept) 0.217 0.4658
## Residual 9.686 3.1122
## Number of obs: 50, groups: manufacturer, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 37.3134 1.6144 23.11
## cyl -2.8269 0.2487 -11.37
##
## Correlation of Fixed Effects:
## (Intr)
## cyl -0.947
ranef(RAmodel)

## $manufacturer
## (Intercept)
## Chevrolet 0.06349953
## Dodge -0.26661727
## Ford 0.20311774
##
## with conditional variances for "manufacturer"
var_components <- VarCorr(RAmodel)
d2 <- var_components$manufacturer[1]
d2

## [1] 0.217013
s2 <- sigma(RAmodel)ˆ2
s2

## [1] 9.685797

You might also like