0% found this document useful (0 votes)
10 views9 pages

Mock Test

Mock test

Uploaded by

zasurmahmudov934
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Mock Test

Mock test

Uploaded by

zasurmahmudov934
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Westminster International University in Tashkent

School of Business and Economics

4ECON006C_n
Introduction to Statistics and Data Science

2024 10:00 – 12:15 / 15:00 – 17:15

Venue: WIUT Time allowed: 2 hours and 15 minutes

The following items are provided:

One Examination Question Paper 10 printed pages in total inclusive of cover page

Candidates are permitted to bring into the examination room:

 Calculator

 Black or blue pen

 Ruler

Instructions:

Candidates must answer all questions. Round your solutions to at least 2 (two) decimals in Section C.

 The paper consists of 3 sections, Section A, B and C respectively. The total weight of the sections are
40, 10 and 50 for Section A, B and C respectively.
 All answers must be written in the designated area ONLY.
 All rough work must be written in this book ONLY.
 A line should be drawn through any rough work to indicate to the examiner that it is not part of the
work to be graded.
Black or Blue pen only must be used for written answers and pencil for all drawings and sketches. The use
of pencils or erasable pens is not permitted without approval.

Ensure you have the correct number of pages on your examination question paper. Note to
Candidates: Please check the Module Code and Module Title to ensure that you are
attempting the correct examination. If in doubt, please contact the Invigilator.

DO NOT TURN OVER THIS PAGE UNTIL INSTRUCTED TO DO SO BY THE INVIGILATOR


Page 1 of 10
SECTION A:
Answer all questions in the answer booklet. Provide only correct answer choice in this section.
No explanation or solution is needed. Each question is worth 1 mark.

1. How do you install a new package in R?


A. install_package("package_name")
B. install.packages[package_name]
C. packages.install("package_name")
D. install.packages("package_name")
E. add_package("package_name")

2. Which of the following functions is used to create a data frame object?


A. dataframe()
B. df()
C. c()
D. data.frame()
E. data_frame()

3. When calculating the average of a numeric vector ‘x’ in R, considering the presence of missing
values, which option below ensures an accurate calculation while excluding the missing values?
A. average(x)
B. mean(x, na.rm = FALSE)
C. mean(x, na.rm = omit)
D. average(x, na.rm = TRUE)
E. mean(x, na.rm = T)

4. What will be the output from the following R operations?


x <- list(1:4, c("a", "b"), c(T, F))
x[[1]][[2]]
A. 2
B. "a" "b"
C. 1234
D. 1
E. 1 2 3 4 "a" "b"

5. Consider a scenario where you have a dataset containing information about car colors, and you need
to analyze them in R. Which statement best describes the primary advantage of representing car
colors as a factor rather than a character variable?
A. Factors save memory compared to characters.
B. Factors allow for efficient sorting and ordering of levels.
C. Characters support additional statistical operations not available with factors.
D. Factors enable better compatibility with external data formats.
E. Characters provide faster access times in data retrieval.

6. When sub-setting data in R, what is the key distinction between using the base package and the dplyr
package?
A. The base package is faster for large datasets and it is very user friendly.
B. The base package supports conditional filtering more efficiently.
C. dplyr provides more concise syntax and a pipeline approach.

Page 2 of 10
D. dplyr is limited to data frames and cannot handle other data structures.
E. The base package is better suited for data visualization tasks.

7. Suppose you have “Survived” as dependent and “Fare” as an independent variable from “titanic”
dataset. Which of the following functions is used to create a logistic model?
A. glm(Survived ~ Fare, data = titanic)
B. glm(Survived ~ Fare, data = titanic, family = “binomial”)
C. glm(Survived ~ Fare, data = titanic, family = “logistic”)
D. lm(Survived ~ Fare, data = titanic, family = “binomial”)
E. lm(Survived ~ Fare, data = titanic, family = “logistic”)

8. You are tasked with creating a visually engaging plot in R using the base package. To achieve this, you
aim to create a scatter plot with a regression line, integrate a legend, and emphasize a specific horizontal
point. Which sequence of functions and options would you use to accomplish this task in R?
A. lines(), plot(), legend(), abline()
B. abline(),legend(), lines(), plot()
C. plot(),legend(), abline(), lines()
D. plot(), abline(), lines(), legend()
E. C and D

9. In R, how do you create a sequence of numbers from 1 to 10 with a step size of 2?


A. seq(1, 10, by = 2)
B. sequence(1, 10, by = 2)
C. str(1, 10, by = 2)
D. range(1, 10, by = 2)
E. 1:10

10. What you have a dataframe named my_data with three columns: "ID," "Profit," and "QuantitySold."
Your goal is to filter out rows where the profit is above the median profit, and the quantity sold is
greater than 100 units. Which combination of base R functions and conditions would you use to
achieve this task?
A. subset(my_data, Profit > median(my_data$Profit) & my_data$QuantitySold > 100)
B. my_data[my_data$Profit > median(my_data$Profit) | my_data$QuantitySold > 100, ]
C. filter(my_data, Profit > median(my_data$Profit) ! my_data$QuantitySold > 100)
D. my_data %>% filter(Profit >= median(my_data$Profit) & QuantitySold >= 100)
E. select(my_data, Profit > median(my_data$Profit) | my_data$QuantitySold > 100)

11. What is the primary purpose of the %/% operator in R when used for numeric operations?
A. Exponential calculation
B. Integer division
C. Percentage calculation
D. Logical operators
E. none of the above

12. In base package of R, what distinguishes the frequency counting behavior between the barplot() and
hist() functions?
A. barplot() automatically calculates frequencies; hist() requires the use of the table() function.
B. barplot() requires manual grouping using table(), while hist() automatically counts frequencies.
C. hist() relies on the table() to count frequencies, while barplot() performs automatic binning.

Page 3 of 10
D. Both barplot() and hist() automatically count frequencies without the need for additional functions.
E. barplot() and hist() both require the use of the table() function for frequency calculation.

13. When working with integers in R, what is the significance of using the 'L' suffix?
A. Counting the length of vector
B. Signifying a logarithmic calculation
C. Connoting a substantial integer
D. Indicating a literal integer value
E. Representing the list.

14. In the context of the plot() function in R, what does the ‘cex’ argument determine when creating a
scatter plot?
A. Plot color
B. Point size
C. X-axis label
D. Y-axis label
E. Line type

15. You are tasked with comparing the mean scores of two groups, group1 and group2, in an experiment
using a t-test in R. However, there's a twist – the assumption of equal variances is in question. Which
R function would you choose to perform a t-test, accounting for potential unequal variances?
A. t.test(group1, group2, paired = FALSE, var.equal = TRUE)
B. t.test(group1, group2, paired = TRUE, var.equal = FALSE)
C. t.test(group1, group2, alternative = "less")
D. t.test(group1, group2, alternative = "greater")
E. t.test(group1, group2, paired = FALSE, var.equal = FALSE)

The famous R-in-built iris data set gives the measurements in centimeters of the variable sepal length and
width and petal length and width, respectively, for 50 flowers from each species of iris. The species are iris
setosa, versicolor, and virginica. The variable names are Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width and Species. Refer to the R-in-built iris dataset to answer the Q16 - Q20.

16. Which of the following code(s) assist(s) us loading iris dataset into R?
A. iris <- read.csv(iris.csv)
B. iris <- read.csv[“iris.csv”]
C. data(“iris”)
D. iris <-readcsv(“iris.csv”)
E. none of the above

17. If Species variable in our dataset is a character, which of the following code(s) convert(s) Species
variable into a factor?
A. iris$Species <- factor(iris$Species)
B. iris$Species <- as.factor(iris$Species)
C. iris[,5] <- factor(iris[,5])
D. All the above
E. none of the above

Page 4 of 10
18. Knowing that Petal.Length is the fourth variable of iris dataset; how can we find the value of the 14th
observation for the Petal.Length variable only?
A. iris[14, ]
B. iris[14, 4]
C. iris$Petal.Length[14, ]
D. iris$Petal.Length[14]
E. b and d

19. Based on iris dataset, write a code which helps you to assess whether Petal Length of the flower
differs depending on the types of the species.
A. summary(aov(Petal.Length ~ Species, data = iris))
B. summary(aov(Petal.Length & Species, data = iris))
C. summary(aov(iris$Petal.Length ~ iris$Species))
D. A and C
E. None of the above

20. Which of the following commands correctly creates a boxplot of Sepal.Length for each Species in the
iris dataset using base R?
A. boxplot(Sepal.Length ~ Species, data = iris)
B. boxplot(Sepal.Length ~ Species, dataset = iris)
C. plot(Sepal.Length ~ Species, data = iris)
D. boxplot(iris$Sepal.Length, iris$Species)
E. plot(Sepal.Length ~ Species, data = iris, type = “boxplot”)

[END OF SECTION A]

ENTER YOUR ANSWERS TO THE MULTIPLE-CHOICE QUESTIONS IN THE BELOW TABLE!


NOTE: ONLY ANSWERS IN THE TABLE WILL BE MARKED!

Question № 1 2 3 4 5 6 7 8 9 10

Answers

Question № 11 12 13 14 15 16 17 18 19 20

Answers

Page 5 of 10
SECTION B: Answer all fill-in-the-blank questions in the designated space below. For each question,
provide your concise, correct answer ONLY. R programming is case-sensitive. Any answer with
incorrect capitalization or spelling will be marked as incorrect (0 mark)."

You are given a dataset df containing customer transaction data. Your goal is to clean the data, perform
exploratory analysis, fit a regression model, and visualize the results using R base packages.

1. To check for missing values in df, use the function:


sum(____________(df))
2. To remove all rows with missing values, use:
df_clean <- ____________(df)
3. To convert a categorical variable customer_type into a factor, use:
df_clean$customer_type <- ______________(df_clean$customer_type)
4. To compute summary statistics for all numerical variables, use:
_______________(df_clean)
5. To find the mean transaction amount after excluding the missing values, use:
mean(df$amount, __________ = TRUE)
6. To compute the standard deviation of amount after excluding the missing values, use:
______(df$amount, __________ = TRUE)
7. To fit a linear regression model predicting amount based on income and age, use:
model <- __________(amount ~ income + _______, data = df_clean)
8. To check model summary statistics, use:
__________(model)

Please, be kindly be informed that Section B is 100% based on R-bootcamp Week 5 (Data
Visualization) materials and 100% arguments based!

Page 6 of 10
SECTION C:

Answer ALL questions. Provide all steps of the solutions for full credit. Round your
calculations to at least three decimal values (e.g. 2.023).

1) An automobile financing company uses a rather complex credit rating system for car loans. The
questionnaire requires substantial time to fill out, taking sales staff time and risking alienating the customer.
The company decides to see whether four variables will reproduce the credit score reasonably accurately.
The variables are given below:
 Age in years
 Monthly income in thousand US dollars
 Credit Balance in US dollars
 Ethnicity: Asian, African American, Caucasian.
Data were obtained on a sample (with no evident biases) of 400 applications. The complicated rating score
was calculated and served as the dependent variable in a multiple linear regression.
The partial results from R output for regression are shown next:

a. Fit a multiple linear regression model to predict Credit Score with the following explanatory variables:
Age, Income, Balance, and Ethnicity. Provide your fitted models for each Ethnicity.
b. What is the predicted Credit Score of a customer who is Caucasian, 30 years old, with income of $45000
and credit balance of $500?
c. Provide interpretations for partial slopes.
d. Test for overall significance of the model at 1%. Provide your hypotheses, decision and interpretations.
Provide the degrees of freedoms for the critical value.
e. Test whether Age and Income are significant factors in determining the Credit Score. Provide your
hypotheses, decision, and interpretations. Use  = 0.10.
f. Comment on the adjusted R-squared value.

2) To optimize the e-purchase process, the UI Manager of an e-commerce firm introduced a new interface. A
randomized trial involved 27 customers, split into a Control Group (12) using the standard UI and a Focus
Group (15) utilizing the new UI. Transaction times (in minutes) were recorded (see Table 1).
Customer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
control 7 10 9 11 5 5 10 8 8 11 9 9
focus 7 3 4 7 7 4 4 4 4 4 5 6 6 9 2
Table 1: time spent on completing e-purchase transactions (in minutes)

Page 7 of 10
Students are challenged to conduct an independent t-test, assessing if the new UI significantly reduces
transaction time, assuming that the variances were equal. Use the 0.01 significance level. Use the following R
output for independent t-sample when needed.

a. Formulate the null and alternative hypotheses for testing whether the new UI significantly reduces
transaction time.
b. Provide a detailed list of assumptions and conditions necessary for valid inferences.
c. Using the t-statistic and sample means provided in the R output, calculate the standard error.
d. Based on p-value, decide about the null hypothesis. Interpret the findings in practical terms regarding
the impact of the new UI on purchase transaction times.

3) We want to predict the probability that a movie will be profitable using a multiple logistic regression
model. The explanatory variables are opening-weekend revenue in mln USD (Opening), the cost of
producing the movie in mln USD (Budget), the number of theaters movie was presented in the opening
weekend (Theaters), and the age restrictions of the movie (Rating). Two dummies were created for
Rating: RatingPG-13 and RatingR. PG-rated movies are taken as the reference group. The response
variable is a binary with 1 for profitable and 0 for not profitable. The multiple logit regression results
are given on the following table.

a. Suppose there is a new PG13-rated movie which was released two weeks ago. What is the log odd and
odds ratio of being profitable if it cost 25 mln USD to produce, presented in 3000 theaters in opening
weekend and generated 10 mln USD in opening weekend? Comment on the values of ratios.
b. What is the probability that this new movie in part a will be profitable? Would you classify this movie
as profitable or not profitable if the cutoff point is 0.5?
c. Which independent variable(s) is/are significant at 5% level and Why?
d. You applied your model to 20 movies. If the accuracy ratio is 0.7, sensitivity ratio is 0.8 specificity
ratio is 0.4 and precision ratio is 0.8, create a confusion matrix.
i. True positive [2 marks]
ii. True negative [2 marks]
iii. False positive [2 marks]
iv. False negative [2 marks]

Page 8 of 10
4) A retail company is interested in understanding if there are differences in the mean prices of mens, womens,
and childrens athletic shoes. The company collected random samples of 18 shoes for each category and
performed an analysis of variance (ANOVA) on the rounded dollar amounts of the shoe prices. The
calculated between-group sum of squares (SSB) is 2210.1, and the within-group mean sum of squares
(MSW) is 99.67. Use  = 0.05.
a. Utilize the provided SSB and MSW to construct an ANOVA output table, including degrees of freedom,
mean squares, and F-statistic.
b. Formulate the null and alternative hypotheses for testing the equality of mean prices across the three
types of athletic shoes.
c. Define the rejection rule and provide your decision along with practical business interpretations,
considering a critical value of 3.23.

5)
You are given a binary classification problem where a bank wants to predict whether a loan will default (1)
or not (0). You have the following dataset of actual vs. predicted values from a model:

Actual 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0
Predicted 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1

a. Create the confusion matrix using the table above.


b. Using the confusion matrix, compute the following:
 Accuracy
 Sensitivity
 Specificity
 Precision

[END OF EXAM]

SOME USEFUL FORMULAS AND TABLES

𝐓𝐏 𝐓𝐍 𝐓𝐍
Accuracy = Specificity =
𝐓𝐏 𝐓𝐍 𝐅𝐏 𝐅𝐍 𝐓𝐍 𝐅𝐏

𝐓𝐏 𝐓𝐏
Sensitivity = Precision =
𝐓𝐏 𝐅𝐍 𝐓𝐏 𝐅𝐏

Multiple Logit Model: ln( ) = 𝛽 + 𝛽 ∗ 𝑥 + ⋯+ 𝛽 ∗ 𝑥 + 𝜀


( ∗ ∗ … ∗ )
P(Y = 1| x1, x2, … xk) = p = ( ∗ ∗ … ∗ )

[END OF PAPER]

Page 9 of 10

You might also like