100% found this document useful (1 vote)
122 views33 pages

SB11 - Group 1

[ISB-UEH] Statistics for Business - Final Report (Group assignment)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
122 views33 pages

SB11 - Group 1

[ISB-UEH] Statistics for Business - Final Report (Group assignment)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

GROUP ASSIGNMENT COVER SHEET

STUDENT DETAILS

Student name: Nguyễn Hương Giang Student ID number: 31231020782


Student name: Nguyễn Ngọc Thanh Mai Student ID number: 31231027073
Student name: Vũ Tuyết Nhi Student ID number: 31231022001
Student name: Nguyễn Ngọc Huyền Trân Student ID number: 31231024101
Student name: Trần Thị Cát Tường Student ID number: 31231020394

UNIT AND TUTORIAL DETAILS


Unit name: Unit number:
Tutorial/Lecture: SB - Group Assignment Class day and time:
Lecturer or Tutor name: Trần Minh Hoàng

ASSIGNMENT DETAILS

Title: SB11_Group 1
Length: 34 pages Due date: 28/04/2024 Date submitted: 28/04/2024

DECLARATION
I hold a copy of this assignment if the original is lost or damaged.
I hereby certify that no part of this assignment or product has been copied from any other student’s
work or from any other source except where due acknowledgement is made in the assignment.
I hereby certify that no part of this assignment or product has been submitted by me in another
(previous or current) assessment, except where appropriately referenced, and with prior
permission from the Lecturer / Tutor / Unit Coordinator for this unit.
No part of the assignment/product has been written/ produced for me by any other person except
where collaboration has been authorised by the Lecturer / Tutor /Unit Coordinator concerned.
I am aware that this work may be reproduced and submitted to plagiarism detection software programs
for the purpose of detecting possible plagiarism (which may retain a copy on its database for future
plagiarism checking).
Student’s signature: Nguyễn Hương Giang
Student’s signature: Nguyễn Ngọc Thanh Mai
Student’s signature: Vũ Tuyết Nhi
Student’s signature: Nguyễn Ngọc Huyền Trân
Student’s signature: Trần Thị Cát Tường

Note: An examiner or lecturer / tutor has the right to not mark this assignment if the above declaration has not
been signed.
SB-11 | Group 1

STATISTICS FOR
BUSINESS ASSIGNMENT

I. Problem Solving

Problem 1

a. Find the values of µ and σ.

● Given the information P (𝑋 > 9.5) = 0.9255 → the Z-score corresponding to the
probability of 0.9255 is approximately 1.44.

Since 𝑋 follows a normal distribution, we use the formula for the standard error of mean

which is also the standard deviation of 𝑋:


● SE = n, where σ denotes the standard deviation and 𝑛 represents the sample size.
Given SE = 0.346410 and n = 12 → σ is approximately 1.2
χ−µ 𝑋− µ
● Using the Z-score formula: z = = σ
→z= σ , where 𝑋 is the value to be
𝑛

determined, 𝜇 is the mean value, and SE is the standard error.

Given 𝑋 = 9.5, 1.2, and z = 1.44 → µ is approximately 9.0011.

⇒ Conclusion: σ ≈ 1.2

µ ≈ 9.0011

b. Determine the mean, variance and size of this sample.

The normal distribution is a continuous probability distribution that is symmetric about its
mean, with the highest point of the curve occurring at the mean. This means that the left
and right tails of the distribution are mirror images of each other.

The shape of the normal distribution resembles a bell, with the curve gradually rising from
the mean, reaching its peak at the mean, and then gradually decreasing as it moves away
from the mean in both directions → Mean = Median = Mode.

1
SB-11 | Group 1

The mean of the sample is the midpoint of the confidence interval for the population mean.
Given the confidence interval (9.53055, 10.65263), the mean of the sample is the average
of the two endpoints:
● Mean of sample = Average of confidence interval for mean = (9.53055 +
10.65263)÷ 2 ≈ 10.09159

The variance of the sample can be calculated using the confidence interval for the
population variance. Given the interval (1.12632, 3.57520), the variance is the average of
the two endpoints:
● Variance of the sample = Average of confidence interval for variance = (1.12632
+ 3.57520)÷ 2 ≈ 2.35076

For a two-sided confidence interval, the margin of error is the distance from the estimated
statistic to each endpoint. When a confidence interval is symmetric, the margin of error is
half the width of the confidence interval → E = (10.65263 - 9.53055) ÷ 2 ≈ 0.56104

For 95% confidence intervals, we have α/2 = 0. 05/2 = 0. 025 → 𝑧α/2 = 1. 96


2
Given that we're dealing with a normal distribution, 𝑠 is the sample variance, which is the

same as the given variance 2.35076. So, 𝑠 = 2. 35076 ≈ 1.5332


𝑧α/2.𝑠 2
● Sample size n = ( 𝐸
) and is approximately equal to 29 (28.865).

⇒ Conclusion:
- The mean of the sample ≈ 10.09159
- The variance of this sample ≈ 2.35076
- The size of this sample ≈ 29 (28.865)

c. Test at 90% level of confidence whether the mean and variance of the
population in part b) are the same as those of part a).

2 2
Let X ~ N (µ, σ ) in part a and 𝑋1~ N (µ1, σ1 ) in part b

Take sample of size 𝑛1 of X (part a)

Take sample of size 𝑛2 of 𝑋1 (part b)

2
SB-11 | Group 1

➢ MEAN
Hypothesis
𝐻0: µ = µ1

𝐻1: µ ≠ µ1

We assume that the unknown variances are equal.

Test statistics
2 2
2 (𝑛1− 1)×𝑠 +(𝑛2− 1)×𝑠1 2 2
(12− 1)×1.2 +(29− 1)×1.533
Pooled-variance: 𝑠𝑝 = (𝑛1− 1)+(𝑛2− 1)
= (12− 1)+(29− 1)
= 2.0938

D.f. = 𝑛1 + 𝑛2 − 2 = 29 + 12 – 2 = 39 → Use appendix D, t-critical value = 1.685

Decision rule: Reject 𝐻0 if t-calc < -1.685 or t-calc > 1.685; otherwise do not reject 𝐻0.

𝑋−𝑋1 9.0011 − 10.09159


t-calc = 2 2
= 2 2
≈ -1.5173 < 1.685 ⇒ do not reject the null hypothesis.
𝑠𝑝 𝑠𝑝 2.0938 2.0938
+ 12
+ 29
𝑛1 𝑛2

⇒ Conclusion: There is not sufficient evidence to reject 𝐻0. The mean population in part

b) is not the same as that of part a).

3
SB-11 | Group 1

➢ VARIANCE
Hypothesis
2 2
𝐻0: σ = σ1
2 2
𝐻1: σ ≠ σ1

Test statistics

Using the Statistical Table, we have:

Decision rule: Reject 𝐻0 if F > 𝐹𝑈 or F < 𝐹𝐿; otherwise do not reject 𝐻0.

F = 0.6127 < 𝐹𝑈 = 2.19 ⇒ do not reject the null hypothesis

⇒ Conclusion: There is not sufficient evidence to reject 𝐻0. The population variance in

part b) is not the same as in part a).

4
SB-11 | Group 1

Problem 2

a. Find the probability of the number of defective light bulbs in a box of 12.
Draw the graph of the corresponding probability distribution.

The question about finding the probability of the number of defective light bulbs in a box
of 12 involves a situation where there are only two possible outcomes for each bulb:
defective or not defective.
Moreover, the probability of having defective light bulbs remains unchanged. Therefore,
this question uses Binomial Distribution to solve.

Let X be the number of defective light bulbs


X~ Binomial (12, 0.06)
n be the number of light bulbs in a box, n=12
π be the possibility of finding a defective bulb, π = 6%

12
𝑃(𝑋 = 𝑥) = ∑
𝑥=0
( )0. 06 (1 − 0. 06)
12
𝑥
𝑥 12−𝑥

X P(X)
0 0.4759203148
1 0.3645347092
2 0.1279749511
3 0.027228713
4 0.003910506655
5 0.0003993708924
6 0.0000297403856
7 0.000001627133559
8 0.00000006491224304
9 0.000000001841482072
10 0
11 0
12 0

5
SB-11 | Group 1

b. Find the probabilities of a box containing 0, 1, 2, ..., 12 defective light bulbs


respectively is rejected.

The customer's decision to reject a box depends on how many defective bulbs. Therefore,
the question uses hypergeometric distribution to solve as it helps determine the
probability of getting a specific number of defective bulbs (successes) out of the drawn
samples (2 bulbs) from a finite box (12 bulbs).

Let N be the number of light bulbs in a box, N =12

n be the number of light bulbs being selected, n=2

s be the number of defective bulbs in a box, s ∈ {0; 1; 2; …; 12}

X be the number of defective bulbs being selected


𝑥 𝑛
𝐶𝑠 𝐶𝑁−𝑠
P (Rejected | n, N, s) = 1 – P (Accepted | X=0 | n, N, s) = 1 - 𝑛
𝐶𝑁

● If s = 0 ⇒ x=0
Probability of a box containing 0 defective light bulb is accepted: P (X = 0) = 1
Probability of a box containing 0 defective light bulb is rejected: 1 - P (X = 0) = 0

● If s ≥ 1, the probabilities of a box containing 1; 2; …; 12 defective light bulbs being


rejected are:
0 2
𝐶𝑠 𝐶12−𝑠
P (Rejected | 2, 12, s) = 1 – P(Accepted | X=0 | 2, 12, s) = 1 - 2
𝐶12

s P (Accepted | X=0) P (Rejected)


1 0.833333333 0.166666667
2 0.681818182 0.318181818
3 0.545454545 0.454545455
4 0.424242424 0.575757576
5 0.318181818 0.681818182
6 0.227272727 0.772727273
7 0.151515152 0.848484848
8 0.090909091 0.909090909
9 0.045454545 0.954545455

6
SB-11 | Group 1

10 0.015151515 0.984848485
11 0 1
12 0 1

c. Determine the expected proportion of boxes rejected if the rule is applied

to many independent boxes.


Expected proportion of boxes rejected:
12
∑ 𝑃 (𝑅𝑒𝑗𝑒𝑐𝑡𝑒𝑑 | 𝑋 = 𝑥) × 𝑃 (𝑋 = 𝑥)
𝑥=0

E
x P (Rejected| X=x) P(X=x) (Proportion of rejected
boxes)

0 0 0.4759203148 0

1 0.166666667 0.3645347092 0.06

2 0.318181818 0.1279749511 0.04071930263

3 0.454545455 0.027228713 0.01237668773

4 0.575757576 0.003910506655 0.002251503831

5 0.681818182 0.0003993708924 0.0002722983357

6 0.772727273 0.0000297403856 0.00002298120706

7 0.848484848 0.000001627133559 0.000001380598171

8 0.909090909 0.00000006491224304 0.00000005901113003

9 0.954545455 0.000000001841482072 0.000000001757778341

10 0.984848485 0 0

11 1 0 0

12 1 0 0

Total 0.1164

⇒ Conclusion: The expected proportion of boxes rejected is also the probability of a box
being rejected is 0.1164.

7
SB-11 | Group 1

d. The customer examines each box until accepts one. What is the expected
and variance of the number of boxes he needs to examine?

Because the question asks the number of boxes examined before the first success (the box
is accepted), we use Geometric distribution to solve the problem

● Probability of a box being accepted:

π = 1 - Probability of a box being rejected: 1 - 0.1164 = 0.8836

● Expected number of boxes he needs to examine

1 1
μ= π
= 0.8836
= 1,1317

⇒ On average, he needs to examine between 1 or 2 boxes

● Variance of the number of boxes he needs to examine

2 1−π 1−0.8836
σ= 2 = 2 = 0.1491
π 0.8836

e. Suppose 1000 boxes are examined by the above rule. Find the probability
that the number of boxes rejected is between 20 and 40 (inclusive). You
should find your answers by two different methods.

Let n be the number of boxes examined, n = 1000

X be the number of boxes being rejected

π be the probability of a box being rejected, π = 0.1164

➢ Using Binomial Distribution

X~ Binomial (1000, 0.1164)

P (120 ≤ 𝑋 ≤ 140) = P (X = 120) + P (X = 121) + P (X = 122) + … + P (X = 140)

X P(X)

120 0.03645
121 0.03492

8
SB-11 | Group 1

122 0.03314
123 0.03116
124 0.02904
125 0.02681
126 0.02452
127 0.02223
128 0.01997
129 0.01779
130 0.0157
131 0.01373
132 0.01191
133 0.01024
134 0.00873
135 0.00738
136 0.00618
137 0.00513
138 0.00423
139 0.00346
140 0.0028
...

⇒ P (120 ≤ 𝑋 ≤ 140) = 0.36552

➢ Using Normal Approximation to the Binomial Distribution

nπ = 1000 x 0.1164 = 116,4 > 10


n(1- π) = 1000 x 0.8836 = 883.6 > 10

⇒ X can be approximately Y where Y ~ N (116,4; 102. 85104 )

P (120 ≤ 𝑋 ≤ 140) ≈ P (119. 5 ≤ 𝑌 ≤ 140. 5)


119.5−116.4 𝑌−116.4 140.5−116.4
= P( ≤ ≤ )
102.85104 102.85104 102.85104

= P (0.31 ≤ 𝑍 ≤ 2.38)
= P (Z ≤ 2.38) – P (Z ≤ 0.31)
= 0.9913 – 0.6217 = 0.3696

9
SB-11 | Group 1

⇒ Conclusion: the probability that the number of boxes rejected between 120 and 140 is
around 0.36552, there is a small error in Normal Approximation method ( less than 0.01)
because what we take is only a random sample from a population to estimate the binomial
probability.

II. Data Analysis

1. Provide descriptions for the variables Manufacturer and Propulsion Type.

a. Manufacturers

The diagram illustrates the proportion of 5 car manufacturers among 500 vehicles.

In details:

This is a demonstration of categorical data because it represents a distinct category or


group, which in this instance shows different brands of car manufacturers.

Ford is the manufacturer that most frequently appears, accounting for 28.8% of the total
proportion. Vauxhall follows closely behind with 28.0%, a difference of 0.8%. Next in a

10
SB-11 | Group 1

descending order are Volkswagen at 18.2% and BMW at 16.8%. Toyota appears least
often among the 500 vehicles, accounting for only 8.2%.

In general, the market share among these 5 manufacturers is distributed unevenly.

b. Propulsion type

The diagram illustrates the percentage of each fuel type that is used among 500 vehicles.

The codes of its propulsion type are decoded in the pie chart as follows:

Code 1 - Petrol; Code 2 - Diesel; Code 8 - Electric/Petrol

In details:

Out of 5 propulsion types that are provided, 3 types are seen among 500 vehicles, which
are Petrol, Diesel, and Electric/Petrol.

This is a demonstration of categorical data because there are no numerical values


associated with it in a meaningful way. It also represents a distinct category, which in this
instance shows the ratio of each fuel type’s usage among 500 vehicles.

11
SB-11 | Group 1

Petrol is the most used among 3 types of fuels, dominating more than half of the overall
proportion (58.2%). Next, ranking second is Diesel at 37.6%. Lastly, Electric/Petrol’s
proportion is recorded at 4.2%, meaning that usage of this propulsion type is the lowest
out of 500 vehicles. The differences in proportions of the 3 propulsion types are not
extremely large, but certainly significant.

2. Provide descriptions for the variables Engine Size, Mass, and CO2 Emission.

a. Engine size

The diagram illustrates the distribution of engine capacity among 500 vehicles in cubic
centimeters.

12
SB-11 | Group 1

In details:

This is an example of numerical data.

The histogram reveals a positively skewed distribution (1.87 > 0), which means a higher
concentration towards smaller engine sizes. The kurtosis value of 8.14 indicates a highly
leptokurtic distribution, meaning a sharper peak and heavier tails than a normal
distribution. Therefore, this distribution is asymmetrical.

● The average engine size is 1667.2 cc and the median of 1596 cc provided
another measure of central tendency, lower than the mean due to the positive skew.
Moreover, the minimum value (647 cc) being closer to the mean compared to the
maximum value (4951 cc) further proves the positive skew.

● The standard deviation of 490.4 demonstrates a spread of engine sizes around


the mean value. But due to the positive skew in the distribution, it underestimates
the prevalence of larger engine sizes.

● The interquartile range is 600, providing a stronger measure of spread,


highlighting the concentration of values in the lower half.

● Two formulas are used to identify potential outliers: [Q3 + 1.5 * IQR] and [Q1 - 1.5
x IQR]. After analyzing, there are 2 extreme outliers present in the data set, these
outliers are located on the right tail of the diagram.

In conclusion, this diagram does not represent a perfectly normal distribution, with
factors that strongly prove this:

- Positively skewed and highly leptokurtic, hence not bell shaped.


- The mean and median are not equal.
- The presence of extreme outliers.

b. Mass

The diagram illustrates different masses of vehicles among 500 vehicles in kilograms,
including the average driver’s mass at 75 kilograms.

13
SB-11 | Group 1

In details:

The statistics of vehicle mass represent numerical data.

This histogram demonstrates a positively skewed distribution (0.7 > 0), suggesting
there are more vehicles with lighter masses. A kurtosis of 0.64 suggests a slightly
leptokurtic distribution, with a flatter distribution of data points around the mean
compared to a normal distribution. Therefore, this distribution is asymmetrical.

● The average vehicle mass is 1413 kilograms (kg) and the median of 1395 kg
provided another measure of central tendency, slightly lower than the mean due to
the positive skew.

14
SB-11 | Group 1

● The standard deviation of 262.5 indicates a dispersion in vehicle masses around


the mean value. However, due to the positive skew in the distribution, it understates
the prevalence of larger masses.

● The interquartile range is 428, which provides a more accurate estimate of


spread and emphasizes the concentration of data in the lower half.

● Two formulas are used to identify potential outliers: [Q3 + 1.5 * IQR] and [Q1 - 1.5
x IQR]. After analyzing, there are approximately 6 extreme outliers located on
the right tail of the diagram.

In conclusion, this diagram does not represent a perfectly normal distribution, with
factors that strongly prove this:

- Positively skewed and slightly leptokurtic, hence not bell shaped.


- The mean and median are not equal, even though they are quite close to each other.
- The presence of extreme outliers.

c. CO2 emission

The diagram depicts the different amount of carbon dioxide (CO2) emission among 500
vehicles in grams per kilometer.

15
SB-11 | Group 1

In details:

The diagram displays an example of numerical data.

From the statistics, skewness at 1.13 proves that the distribution is positively skewed
(1.13 > 0), an interpretation of slightly higher concentration towards lower values. With the
kurtosis value at 2.73, it is an indication of a leptokurtic distribution, meaning a sharper
peak than a normal distribution would have. Therefore, this distribution is
asymmetrical.

● The average amount of CO2 emission from 500 vehicles is 135.35 grams per
kilometer (g/km) which provides a baseline understanding of 500 vehicles typical
CO2 output. The median value is 125 g/km, lower than the mean due to the
positively skewed distribution.

● At 36.97, the standard deviation shows how spread out the data of CO2 emission
is to the mean value.

● The interquartile value is at 42.00. This suggests that the middle half of the data
points falls within this range.

● Two formulas are used to identify potential outliers: [Q3 + 1.5 * IQR] and [Q1 - 1.5
x IQR]. After analyzing, there are approximately 7 to 15 extreme outliers located
on the right tail of the diagram.

In conclusion, this diagram does not represent a perfectly normal distribution, with
factors that strongly prove this:

- Positively skewed and leptokurtic, hence not bell shaped.


- Mean and median are not equal.
- The presence of many extreme outliers.

16
SB-11 | Group 1

3. Investigate the relationship between Mass and Engine Size.

The matrix plot illustrates the correlation between engine size and mass in a dataset of 500
vehicles.

A positive correlation is evident between mass and engine size. This implies that, as
mass increases, the engine size also tends to increase. This is demonstrated by the upward
slope of data points in the diagram. However, certain dispersion around the upward trend
also implies that other factors besides engine size might affect the mass.

Given a sample correlation of efficient of +0.645 (r), the diagram suggests a


moderately strong positive linear relationship. The 95% confidence interval explains that
the true correlation of efficient falls between 0.591 and 0.694 ( CI = (0.591, 0.694) ).
Ultimately, this indicates a clear trend where larger engine capacities have the tendency to
be larger in vehicle mass.

In conclusion, the data above suggests a moderately strong positive correlation between
engine size and mass, with larger engines generally corresponding to heavier vehicles.

17
SB-11 | Group 1

4. Test at 1% significance level whether the variance of the CO2 emission level in

the year 2002 and year 2016 are the same.

● The F-test: This test assumes the two samples come from populations that are
normally distributed.

● In this case, since the tests we will be considering are based on a normal
distribution, we are expecting to use the F-test.

99% CI for variance:

● You can be 99% confident that the true population variance for CO2 Emission in
2002 lies between (790.229, 1476.920).
● You can be 99% confident that the true population variance for CO2 Emission in
2016 lies between (604.363, 887.415).

⇒ The confidence intervals for the variance of CO2 emission in 2002 (790.229, 1476.920)
and 2016 (604.363, 887.415) do not overlap considerably, indicating that the variances
are likely different.

18
SB-11 | Group 1

Estimated Ratio of variances: The ratio of variances is the variance of the CO2
Emission in 2002 divided by the variance of the CO2 Emission in 2016.

99% CI for Ratio using F: We can be 99% confident that the ratio of the two
population variances is between 1022 and 2.130. Because the interval does not
contain the value 1, we can conclude that the population variances differ.

The graph below illustrates the 99% CI for the ratio of the two population variances that
can not reach “1”. In other words, the 2 population variances are different.

19
SB-11 | Group 1

Test statistic
We can conclude from the test statistic by finding the critical value and applying the
decision rule.

We can use Minitab to draw the graph and the critical values:

Because F > Fu (1.45595 > 1.425), there is sufficient evidence at 1% significance level to
reject Ho.

20
SB-11 | Group 1

P - value:

We can conclude from the p-value. The p-value is 0.006, which is less than the
significance level (1%), so the decision is to reject the null hypothesis.

⇒ Conclusion: The null hypothesis states that the ratio between the variance is 1.
Because the p-values are less than the significance level α = 0.01, we reject the null
hypothesis. In other words, we have enough evidence to conclude that the
variance between the CO2 emissions in 2002 and 2016 are different. The
variance of 2002 is found to be significantly higher than 2016 at significance level of α
= 0.01.

5. Test at 1% significance level whether there is a significant decrease in the mean CO2

emission level from year 2002 to 2016.

The result of 4 shows that the two population variances are different. Therefore, when
using minitab, we do not assume equal variances.

21
SB-11 | Group 1

SE Mean:
● SE mean for the CO2 emissions in 2016 is smaller than SE mean for the CO2
emissions in 2002 (1.4 compared to 2.8) as the sample size for CO2 emissions in
2016 is larger.
● It means the CO2 emissions in 2016 provides more precise estimates of the
population mean.

Difference:
● Difference is the difference between the means of the two samples (CO2 emissions
in 2002 and CO2 emissions in 2016).
● 52.45 = 173.2 - 120.8

99% Lower Bound for Difference:


● You can be 99% confident that the population mean for the difference is greater
than 45.16

µ1 is the population mean of CO2 Emission in 2002


µ2 is the population mean of CO2 Emission in 2016

22
SB-11 | Group 1

T value:

We can compare the t-value to critical values of the t-distribution to determine whether to
reject the null hypothesis.

t212, 0.01 = 2.344


Because t > t212, 0.01 ⇒ There is sufficient evidence at 1% significance level to reject Ho.

P value:

However, using the p-value of the test to make the same determination is usually more
practical and convenient. Because the p-value is 0.000, which is less than the significance
level of 0.01, the decision is to reject the null hypothesis. In other words, we can conclude
that the amount of CO2 Emissions in 2002 and 2016 are different.

⇒ Conclusion: The null hypothesis states that the ratio between the variance is 1.
Because the p-values are less than the significance level α = 0.01, we reject the null
hypothesis. In other words, we have enough evidence to conclude that the mean
between the CO2 emissions in 2002 and 2016 are different. The mean of 2002 is found
to be significantly higher than 2016 at the significance level of α = 0.01.

23
SB-11 | Group 1

6. Investigate at 1% significance level whether there is a significant difference in CO2

emission among different manufacturers.

Hypotheses

To determine whether there exists a significant difference in CO2 emissions among 5


different manufacturers, we set up the following hypotheses to test the equality of means:

● Null Hypothesis (H0): The mean CO2 emissions for all vehicle manufacturers are
equal.

● Alternative Hypothesis (H1): At least one manufacturer's mean CO2 emission differs
from the others.

Mathematically, this is expressed as:

● H0: μ1 = μ2 = μ3 = μ4 = μ5

(where μ1, μ2, μ3, μ4, μ5 represent the population means of the 5 manufacturers)

● H1: At least one of the means μi differs from the others.

Assumptions

● Each observation of CO2 emission is independent of others.

● CO2 emissions for each manufacturer follow a normal distribution, with the same
but unknown variances.

ANOVA test

a. CO2 emission in 2002

The results are summarized in the table below:

24
SB-11 | Group 1

● The P-value (p = 36 x 10-10) is substantially lower than the significance level (α =


0.01), indicating a highly significant difference in CO2 emissions among the
manufacturers.

● To further support this conclusion, we conduct the F Test using α = 0.01 with F4,
133 (Numerator: df1 = 4; Denominator: df2 = 133). The decision rule is:

○ If F > F4, 133, 0.01, we reject the null hypothesis H0

The F-Distribution Plot below illustrates the probability of the emissions rates between
manufacturers being equal.

● The F-value (F = 13.34) significantly exceeds the F critical value of 3.463.

⇒ Conclusion: At a 1% significance level, there is sufficient evidence to reject the null


hypothesis. Therefore, there is a significant difference between CO2 emissions
from different manufacturers of vehicles registered in 2002.

25
SB-11 | Group 1

b. CO2 emission in 2016

The results are summarized in the table below:

● The P-value (p = 12 x 10-5) is substantially lower than the significance level (α =


0.01), indicating a highly significant difference in CO2 emissions among the
manufacturers.

● To further support this conclusion, we conduct the F Test using α = 0.01 with F4, 357
(Numerator: df1 = 4; Denominator: df2 = 357). The decision rule is:

○ If F > F4, 357, 0.01, we reject the null hypothesis H0.

The F-Distribution Plot below illustrates the probability of the emissions rates between
manufacturers being equal.

26
SB-11 | Group 1

The F-value (F = 5.95) significantly exceeds the F critical value of 3.372.

⇒ Conclusion: At a 1% significance level, there is sufficient evidence to reject the null


hypothesis. Therefore, there is a significant difference between CO2 emissions
from different manufacturers of vehicles registered in 2016.

7. Determine the regression equation for CO2 emission with independent variables

Engine Size and Mass. Comment on the fit of the model. Give the interpretation of the

coefficients of the regression equation.

● CO2 Emission: the anticipated amount of CO2 emissions


● Coef:

○ b0 = 83.57 is the expected CO2 emission for those with 0 Mass and 0 Engine
Size. It is not meaningful because an engine size of zero is not physically
possible.

○ b1 = 0.03744: For every increase in the engine size, the CO2 Emission increases
on average by 0.03744 (g/km), holding other variables constant.

27
SB-11 | Group 1

○ b2 = -0.00760: For every increase in Mass volume, the CO2 Emission decreases
on average by 0.00760 (g/km), holding other variables constant.

● SE Coef: The standard error of the coefficient

○ The standard error of the coefficient measures the precision of the estimate of
the coefficient. The smaller the standard error, the more precise the estimate.

○ The standard error of the Engine Size coefficient is smaller than that of Mass
(0.00392 compared to 0.00733). Therefore, the estimate of the coefficient for
the Engine Size has greater precision.

● P-value:

○ The p-value of Engine Size (0) is less than the significance level (0.01), we can
conclude that there is a statistically significant association between the
Engine Size and CO2 Emissions.

○ The p-value of Mass (0.3) is greater than the significance level (0.01), we cannot
conclude that there is a statistically significant association between the Mass
and the CO2 Emissions.

● VIF:

VIF of Engine Size and Mass is 1,71, which is between 1 and 5. We can conclude that both
Engine Size and Mass are moderately correlated.

28
SB-11 | Group 1

● P-value:

○ The P-value of Regression equals 0, which is less than the significance level
(0.01). This means the model explains variation in the response. At least one
coefficient is different from 0.

○ The P value for the estimated coefficient of Engine Size is less than the
significance level (0.01). It means Engine Size is significantly related to CO2
Emissions.

○ The P value for the estimated coefficient of Mass is higher than the
significance level (0.01). It means Mass is not related to CO2 Emissions at
the level of 0.01.

○ The P value for Lack-of-Fit is 0, which is less than the significance level. We
conclude that the model does not correctly specify the relationship between
the response and the predictors.

When measuring fit, we use R-sq (R2):

29
SB-11 | Group 1

● Since the CO2 Emission regression yields R2 = 21.5%, we could say that X (“Engine
Size” and “Mass”) “explains” 21.5 percent of the variation in Y (CO2 Emission). On
the other hand, 78.5 percent of the variation in CO2 Emission is not explained by
Engine Size and Mass.

● The R-sq is extremely low (21.50%), indicating that our model does not have a
strong fit and must be improved.

We can use S (standard error) to measure overall fit:


● A smaller value of S indicates a better fit.
● The value of S is 32.8223, which is likely to be large regarding the range of values of
CO2 emissions. Therefore, the model does not have a strong fit.

8. Test at 1% significance level whether the proportion of cars meeting the CO2

emission target in 2016 is greater than that in 2002.

Data

Number of vehicles Proportion of cars


Registered Number of
meeting the CO2 meeting the CO2
year vehicles
emission target emission target

56
2016 n1 = 362 x1 = 56 p1 = 362
= 0.154696

0
2002 n2 = 138 x2 = 0 p2 = 138
= 0.000000

Hypotheses

To determine whether the proportion of cars meeting the CO2 emission target in 2016 is
greater than that in 2002, we set up the following hypotheses:

● Null Hypothesis: The population proportion of cars meeting the CO2 emission
target in 2016 is smaller equal to that in 2002.

30
SB-11 | Group 1

● Alternative Hypothesis: The population proportion of cars meeting the CO2


emission target in 2016 is greater than that in 2002.

Mathematically, this is expressed as:

● H0: π1 ≤ π2
(where π1, π2 represent the population proportion of cars meeting the target in 2016 and
2002, respectively)

● H1: π1 > π2

Assumptions

For a test of two proportions, the criterion for normality is nπ ≥ 10 and n(1 - π) ≥ 10 for
each sample

● n1p1 = 56 > 10
● n1(1 - p1) = 306 > 10
● n2p2 = 0 (does not meet the requirement for normality)
● n2(1 - p2) = 0 (does not meet the requirement for normality)

However, using Minitab for calculations will likely minimize the error.

Test statistic

● The P-value (p ≈ 0), which is less than the significance level α = 0.01 → reject H0

● Given the Z-value (z = 4.90), , the decision rule is: If z > z0.01, we reject H0
At α = 0.01, the right-tail critical value is z0.01 = 2.576, which is less than z = 4.90,
confirming that we should reject H0 .

31
SB-11 | Group 1

⇒ Conclusion: There is sufficient evidence at a 1% significance level to reject H0 ,


indicating that the population proportion of cars meeting the CO2 emission target in 2016
is not equal to that in 2002.

Confidence interval

To determine whether the proportion in 2016 is greater than that in 2002, we construct the
confidence interval of the difference between these two proportions:

0.154696 (1−0.154696) 0
= (0.154696 - 0) 土 2.576 362
+ 138

= [ 0.10574, 0.20365 ]

The lower bound for the difference of 0.10574 is greater than zero, confirming that the
proportion of cars meeting the CO2 emission target in 2016 is higher than that
in 2002.

⇒ Conclusion: From 2002 to 2016, there has been a remarkable improvement in the
percentage of vehicles meeting the CO2 emission target. These five manufacturers have
made substantial progress in reducing CO2 emissions from their vehicles.

32

You might also like