0% found this document useful (0 votes)
32 views16 pages

C6 - DSC551 - R Programming

chapter 6 r programming

Uploaded by

fakhrizul Afif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

C6 - DSC551 - R Programming

chapter 6 r programming

Uploaded by

fakhrizul Afif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DSC551: Programming for Data

Science (R Programming)
6. Basic Inferential Statistics

Lecturer, Department of Statistics


2024-10-01

Asmui Rahim, DSC551:R , Oct 2024


Introduction
Aims:

1. Discussing about the assumptions of normality.


2. Introduce basic hypothesis testing (one sample t-test).

Asmui Rahim, DSC551:R , Oct 2024


Assumptions of Normality
Plot the distribution of the data by using;
1. Histogram
2. Boxplot
3. Density plot
4. QQ (Quantile-Quantile) plot
Normality test
1. Kolmogorov-Smirnov test.
2. Shapiro-Wilk test
Data preparation;

1 mydata1 <- read.csv("telco.csv",


2 stringsAsFactors = TRUE)

Asmui Rahim, DSC551:R , Oct 2024


Density plot
Visualize the distribution of a continuous variable.
See the “shape” of the data, highlighting where most of the values cluster and how spread
out they are.
Identifying patterns;
Skewness: Normal, skewed to the left or right?
Multimodality: Are there multiple peaks in the distribution?

1 plot(density(mydata1$Usage_GB),
2 main="Density estimate of data")

Asmui Rahim, DSC551:R , Oct 2024


QQ-plot
Asses if a dataset follows a normal distribution.
If the points fall approximately along a straight diagonal line, this suggests that the data
follows the normal distributions.

1 qqnorm(mydata1$Usage_GB)
2 qqline(mydata1$Usage_GB,
3 col="red",lwd=3)

Asmui Rahim, DSC551:R , Oct 2024


Normality test (Kolmogorov-Smirnov test)
1 ks.test(mydata1$Usage_GB, "pnorm",
2 mean=mean(mydata1$Usage_GB),
3 sd=sd(mydata1$Usage_GB))

Asymptotic one-sample Kolmogorov-Smirnov test


data: mydata1$Usage_GB
D = 0.076064, p-value = 0.957
alternative hypothesis: two-sided

pnorm specified the theoretical distribution (normal in this case)


Need to provide estimated parameters mean and the standard deviation.
Conclusion;
p-value>0.05, Usage_GB data to be normally distributed for KS test.

Asmui Rahim, DSC551:R , Oct 2024


Normality test (Shapiro-Wilk test)
1 shapiro.test(mydata1$Usage_GB)

Shapiro-Wilk normality test


data: mydata1$Usage_GB
W = 0.9878, p-value = 0.9119

Conclusion;
p-value>0.05, Usage_GB data to be normally distributed for SW test.

How to make the normality conclusion?

The normality test suggest that the internet usage data distribution does not differ from
normal distribution, which might have assumed from the density based on qq-plots, KS and
SW test.

Asmui Rahim, DSC551:R , Oct 2024


Hypothesis testing

Asmui Rahim, DSC551:R , Oct 2024


Asmui Rahim, DSC551:R , Oct 2024
Steps to do hypotesis testing

Asmui Rahim, DSC551:R , Oct 2024


Hypothesis test with one sample
Question: There was a claim that the usage of internet quota by the students was different
from the average of 15 GB. A study was conducted to investigate the claim and 45 students
were selected at random. Test at 5% level of significance.
Step 1: State the hypothesis and identify the claim

H0 : μ = 15

H1 : μ ≠ 15

Step 2: State the level of significance


α = 0.05 (5% level of significance, confidence level at 95%)

Asmui Rahim, DSC551:R , Oct 2024


Step 3: Find the p-value
1 t.test(mydata1$Usage_GB, mu=15)

One Sample t-test


data: mydata1$Usage_GB
t = 3.2358, df = 44, p-value = 0.002306
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
16.18011 20.07766
sample estimates:
mean of x
18.12889

Step 4: Make the decision


Reject H0 if p − value ≤ α. Since p − value = 0.0023 < α = 0.05, reject H0 .
Step 5: Summarize the result
At 5% significance level, the internet usage by the students is different from 15 GB.

Asmui Rahim, DSC551:R , Oct 2024


Confidence interval
1 t.test(mydata1$Usage_GB, mu=15)

One Sample t-test


data: mydata1$Usage_GB
t = 3.2358, df = 44, p-value = 0.002306
alternative hypothesis: true mean is not equal to 15
95 percent confidence interval:
16.18011 20.07766
sample estimates:
mean of x
18.12889

We are 95% confident that the mean of internet usage by the students is between 16.1801
and 20.0777.

Note

For testing the alternative hypothesis of less or more than (testing directional H1 ), we can
specify the argument alternative="less" or alternative="greater". You can
specify just the initial letter.

Asmui Rahim, DSC551:R , Oct 2024


Exercise
Question 1: There was a claim that the usage of internet quota by the students was less from
the average of 20 GB. A study was conducted to investigate the claim and 45 students were
selected at random. Test at 10% level of significance.
Step 1: Hypothesis

H0 : The usage of internet quota by the students is equal to 20 GB


H1 : The usage of internet quota by the students is less than 20 GB
Step 2: Level of significance
α = 0.10
Step 3: Find the p-value
1 t.test(mydata1$Usage_GB, mu=20, alternative = "less", conf.level = 0.9)

One Sample t-test


data: mydata1$Usage_GB
t = -1.9351, df = 44, p-value = 0.02971
alternative hypothesis: true mean is less than 20
90 percent confidence interval:
-Inf 19.38699
sample estimates:
mean of x
18.12889 Asmui Rahim, DSC551:R , Oct 2024
Step 4: Decision rule
Reject H0 if p-value<α. Since p-value=0.02971 < α = 0.1, we reject the H0 .
Step 5: Conclusion
At 10% level of significance, the usage of internet quota by the students is less than 20 GB.
Question 2: A claim has been made that the average daily mobile phone usage are more than
the average of 5 hours. Using the datasets from Question 1, first assess the normality of the
daily hour per day of mobile phone usage. Then, conduct a hypothesis test at a 5% significance
level if there is sufficient evidence to support the claim.

Note

You can change the significance level by specifying the argument conf.level= . For 10%,
set the value as 0.9. For 1%, set the value as 0.99. By default the values is set for 5% level of
significance.

Asmui Rahim, DSC551:R , Oct 2024


End of Slides

Asmui Rahim, DSC551:R , Oct 2024

You might also like