DSC551: Programming for Data
Science (R Programming)
6. Basic Inferential Statistics
Lecturer, Department of Statistics
2024-10-01
Asmui Rahim, DSC551:R , Oct 2024
Introduction
Aims:
1. Discussing about the assumptions of normality.
2. Introduce basic hypothesis testing (one sample t-test).
Asmui Rahim, DSC551:R , Oct 2024
Assumptions of Normality
Plot the distribution of the data by using;
1. Histogram
2. Boxplot
3. Density plot
4. QQ (Quantile-Quantile) plot
Normality test
1. Kolmogorov-Smirnov test.
2. Shapiro-Wilk test
Data preparation;
1 mydata1 <- read.csv("telco.csv",
2 stringsAsFactors = TRUE)
Asmui Rahim, DSC551:R , Oct 2024
Density plot
Visualize the distribution of a continuous variable.
See the “shape” of the data, highlighting where most of the values cluster and how spread
out they are.
Identifying patterns;
Skewness: Normal, skewed to the left or right?
Multimodality: Are there multiple peaks in the distribution?
1 plot(density(mydata1$Usage_GB),
2 main="Density estimate of data")
Asmui Rahim, DSC551:R , Oct 2024
QQ-plot
Asses if a dataset follows a normal distribution.
If the points fall approximately along a straight diagonal line, this suggests that the data
follows the normal distributions.
1 qqnorm(mydata1$Usage_GB)
2 qqline(mydata1$Usage_GB,
3 col="red",lwd=3)
Asmui Rahim, DSC551:R , Oct 2024
Normality test (Kolmogorov-Smirnov test)
1 ks.test(mydata1$Usage_GB, "pnorm",
2 mean=mean(mydata1$Usage_GB),
3 sd=sd(mydata1$Usage_GB))
Asymptotic one-sample Kolmogorov-Smirnov test
data: mydata1$Usage_GB
D = 0.076064, p-value = 0.957
alternative hypothesis: two-sided
pnorm specified the theoretical distribution (normal in this case)
Need to provide estimated parameters mean and the standard deviation.
Conclusion;
p-value>0.05, Usage_GB data to be normally distributed for KS test.
Asmui Rahim, DSC551:R , Oct 2024
Normality test (Shapiro-Wilk test)
1 shapiro.test(mydata1$Usage_GB)
Shapiro-Wilk normality test
data: mydata1$Usage_GB
W = 0.9878, p-value = 0.9119
Conclusion;
p-value>0.05, Usage_GB data to be normally distributed for SW test.
How to make the normality conclusion?
The normality test suggest that the internet usage data distribution does not differ from
normal distribution, which might have assumed from the density based on qq-plots, KS and
SW test.
Asmui Rahim, DSC551:R , Oct 2024
Hypothesis testing
Asmui Rahim, DSC551:R , Oct 2024
Asmui Rahim, DSC551:R , Oct 2024
Steps to do hypotesis testing
Asmui Rahim, DSC551:R , Oct 2024
Hypothesis test with one sample
Question: There was a claim that the usage of internet quota by the students was different
from the average of 15 GB. A study was conducted to investigate the claim and 45 students
were selected at random. Test at 5% level of significance.
Step 1: State the hypothesis and identify the claim
H0 : μ = 15
H1 : μ ≠ 15
Step 2: State the level of significance
α = 0.05 (5% level of significance, confidence level at 95%)
Asmui Rahim, DSC551:R , Oct 2024
Step 3: Find the p-value
1 t.test(mydata1$Usage_GB, mu=15)
One Sample t-test
data: mydata1$Usage_GB
t = 3.2358, df = 44, p-value = 0.002306
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
16.18011 20.07766
sample estimates:
mean of x
18.12889
Step 4: Make the decision
Reject H0 if p − value ≤ α. Since p − value = 0.0023 < α = 0.05, reject H0 .
Step 5: Summarize the result
At 5% significance level, the internet usage by the students is different from 15 GB.
Asmui Rahim, DSC551:R , Oct 2024
Confidence interval
1 t.test(mydata1$Usage_GB, mu=15)
One Sample t-test
data: mydata1$Usage_GB
t = 3.2358, df = 44, p-value = 0.002306
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
16.18011 20.07766
sample estimates:
mean of x
18.12889
We are 95% confident that the mean of internet usage by the students is between 16.1801
and 20.0777.
Note
For testing the alternative hypothesis of less or more than (testing directional H1 ), we can
specify the argument alternative="less" or alternative="greater". You can
specify just the initial letter.
Asmui Rahim, DSC551:R , Oct 2024
Exercise
Question 1: There was a claim that the usage of internet quota by the students was less from
the average of 20 GB. A study was conducted to investigate the claim and 45 students were
selected at random. Test at 10% level of significance.
Step 1: Hypothesis
H0 : The usage of internet quota by the students is equal to 20 GB
H1 : The usage of internet quota by the students is less than 20 GB
Step 2: Level of significance
α = 0.10
Step 3: Find the p-value
1 t.test(mydata1$Usage_GB, mu=20, alternative = "less", conf.level = 0.9)
One Sample t-test
data: mydata1$Usage_GB
t = -1.9351, df = 44, p-value = 0.02971
alternative hypothesis: true mean is less than 20
90 percent confidence interval:
-Inf 19.38699
sample estimates:
mean of x
18.12889 Asmui Rahim, DSC551:R , Oct 2024
Step 4: Decision rule
Reject H0 if p-value<α. Since p-value=0.02971 < α = 0.1, we reject the H0 .
Step 5: Conclusion
At 10% level of significance, the usage of internet quota by the students is less than 20 GB.
Question 2: A claim has been made that the average daily mobile phone usage are more than
the average of 5 hours. Using the datasets from Question 1, first assess the normality of the
daily hour per day of mobile phone usage. Then, conduct a hypothesis test at a 5% significance
level if there is sufficient evidence to support the claim.
Note
You can change the significance level by specifying the argument conf.level= . For 10%,
set the value as 0.9. For 1%, set the value as 0.99. By default the values is set for 5% level of
significance.
Asmui Rahim, DSC551:R , Oct 2024
End of Slides
Asmui Rahim, DSC551:R , Oct 2024