STATISTICS Ph.D.
QUALIFYING EXAM
STATISTICAL INFERENCE
January 2024
General Instructions: Write your ID number on all answer sheets.
Do not put your name on any of your answer sheets. Show all work, write
down all definitions and used theorems - without them a problem may be
not counted even if the answer is correct. Please write neatly so it is easy
to read your solution.
Problem 1. (10 points) Let X and Y be two random variables such
that Y has the binomial distribution with size N and probability π and,
given Y = y, X has the binomial distribution with size y and probability p.
N is known. It is given that π ∈ (0, 1) and p ∈ (0, 1) but otherwise these
are unknown parameters of interest. Find a minimal sufficient statistic for
(π, p).
Problem 2. (5 points) Let X1 , X2 , . . . Xn be iid according to a uniform
distribution U (θ − 1/2, θ + 1/2). Find a complete sufficient statistic or show
that it does not exist.
Problem 3. (10 points) Formulate and prove the Cramer-Rao inequal-
ity.
Problem 4. (15 points). You are asked to prove the famous Character-
ization Theorem of Linnik. Let X have a density pθ (x) = p0 (x − θ) where
x ∈ R1 , θ ∈ R1 and:
(i) p0 (x) is continuously differentiable and positive; (ii) x2 p0 (x)dx ≤ σ 2 ;
R
(iii) |x|p0 (x) → 0 asR|x| → ∞.
Denote by Jp (θ) = (d ln(p0 (x))/dx)2 p0 (x)dx the Fisher information quan-
tity. Prove that min Jp = σ −2 , where min is over all distributions sat-
isfying
R
(i) - (iii). (Hint: It suffices to show that 1 ≤ σ 2 Jp . Begin with
1 = p0 (x)dx, then use integration by parts, and then use the idea of the
proof of the Cramer-Rao inequality which is actually the Cauchy-Schwarz
inequality.)
Problem 5. (10 points) Let X be distributed according to Binomial
distribution B(n, p). Formulate two classical methods of finding UMVU
estimate and then use one of them to find UMVU estimate of p2 . Do not
forget to write down all involved definitions and theorems.
1
Problem 6. (10 points). Suggest a Bayes estimate for the probability
of success p of a binomial random variable B(p, n). Explain its meaning for
small and large n. Hint: Think about an appropriate conjugate prior.
Problem 7. (10 points) Consider a sample of size n from from the
density f (x|θ) = θ(1 − x)θ−1 I(x ∈ (0, 1)) and θ ∈ [1, ∞). Find the MLE.
Problem 8. (10 points). Consider a sample of size n from Pareto(θ, ν)
distribution with density f (x|θ, ν) = θν θ x−θ−1 I(x ≥ ν), θ > 0, ν > 0. Find
the LRT for testing H0 : θ = 1 and ν is unknown versus Ha : θ 6= 1 and ν
is unknown.
Problem 9. (10 points) Consider a sample of size n from the density
(a/θ)(x/θ)a−1 I(x ∈ (0, θ)). Here a > 0 is known. Use the CDF of the mini-
mal sufficient statistic to find a confidence interval for θ with the confidence
coefficient 1 − α.
Problem 10. (10 points). Let p0 and p1 be two distinct densities for
X. Consider a sample of size n from X given density p0 . Prove that
n
Y p1 (Xi )
Pp 0 < 1 = 1 + on (1).
i=1
p0 (Xi )
Hint: Recall that this is the first step in asymptotic theory of MLE.
2
STATISTICS Ph.D. QUALIFYING EXAM
Probability
January 2024
General Instructions: Write your ID number on all answer sheets. Do
not put your name on any of your answer sheets. Show all work/proofs/references/definitions.
Please write neatly so it is easy to read your solution.
Problem 1. (15 points) Consider a Brownian motion {Bt , t ≥ 0}. Let
0 = t0 < . . . < tn . Prove that for any a > 0
P ( max Btk > a) ≤ 2P (Btn > a).
k=0,1,...,n
Problem 2. (5 points) Formulate the dominated convergence theorem.
Problem 3. (5 points) Definition of Lebesgue integral.
Hint: Recall several steps in the definition.
Problem 4. (5 points) Formulate Radon-Nikodym Theorem. If in its
assumptions you use some notions - define them.
Problem 5. (10 points) Let f1 , f2 , . . . be real-valued Borel measurable
and uniformly integrable functions on the probability space (Ω, F, P ). Show
that Z Z
(lim inf fn )dP ≤ lim inf fn dP.
Ω n n Ω
Hint: Give definitions of Borel measurable and uniformly integrable func-
tions. In the proof you may use Fatou’s lemma without proof, only formulate
it.
Problem 6. (5 points) Let (Ω, F, P ) be a probabiity space, G is a sub
sigma-field of F. Fix B ∈ F. Define the conditional probability of B given
G and prove its uniqueness.
Problem 7. (5 points) Formulate and prove Borel-Cantelli Lemma.
1
Problem 8. (10 points) Let {Xn , Fn } be a martingale, g is a convex
and increasing function from R to R. Suppose that g(Xn ) is integrable for
all n. What can you say about {g(Xn ), Fn }? Prove your assertion. Note:
Please give definitions of all notions mentioned in the problem.
Problem 9. (35 points) Formulate Lindeberg’s Central Limit Theorem
and prove it.
Problem 10. (5 points) Consider two sigma-fields G1 ⊂ G2 , simplify
E{E{X|G1 }|G2 }, and prove your assertion.
2
QE ID (given by Kisa)
January 2024 Qualifying Exam in Linear Models
January 3rd, 2024
Instruction:
• This is a closed-book test.
• There are four questions; each has multiple parts.
• Answer each question as fully as possible.
• Show and justify all steps of your solutions.
• Refer clearly to any known results that you are using, stating such results precisely.
• Show how the assumptions of a result you are using are satisfied in your application of the result.
• Indicate how the assumptions given in the question are used in the solution.
• Write your solutions on the blank sheets of paper that are provided.
• Write your QE ID number (given to you by Kisa) on all answer sheets. DO NOT put your name,
UTD ID, or any other identifying information on any of your answer sheets.
• On each sheet, identify which question and part is being answered.
• Begin each question on a new sheet.
• When finished, arrange your sheets in order, number each sheet, and be sure that your QE ID number
(given by Kisa) is on each sheet.
• Although the notations used in Q1, Q2, Q3, and Q4 are similar, they are independent, standalone
problems.
• The total possible points is 100.
Q1. (20%) Let Y ∼ Nn (µ1n , Σ), where Σ = (1 − ρ)In + ρJn and ρ > −1/(n − 1). Matrix In is the n × n
identity matrix and Jn is the n × n matrix of 1’s.
Pn
(a) (10%) Find ρ such that the sample mean Y and i=1 (Yi − Y )2 are not independent.
Pn
(b) (10%) When ρ = 0, find the distribution of i=1 (Yi − Y )2 .
Q2. (20%) Consider a linear regression model
k
X
y i = β0 + xij βj + εi ,
j=1
where X = (x> > >
1 , . . . , xn ) is a n × (p + 1) and full-rank matrix with xi = (1, xi1 , . . . , xip ) for i = 1, . . . , n
iid
and εi ∼ N (0, σ 2 ) for i = 1, . . . , n and σ 2 is unknown. Denote β̂ = (β̂0 , β̂1 , . . . , β̂p )> as the maximum
likelihood estimator (MLE) of β = (β0 , β1 , . . . , βk )> , ŷi as the fitted value of the ith observation, and
ei = yi − ŷi as corresponding the residual.
(a) (10%) Construct a level α test for H0 : β1 = β2 = · · · = βk = 0 versus H1 : βj 6= 0 for some
j = 1, . . . , k.
(b) (10%) Determine the smallest sample size n such that V ar(β̂j ) < 0.1 for all j = 1, . . . , k.
Q3. (25%) Consider the same setting in Q2, we further denote H = X(X> X)−1 X as the projection matrix.
(a) (5%) Denote β̂(i) = (β̂0(i) , β̂1(i) , . . . , β̂p(i) )> be the MLE of (β0 , β1 , . . . , βk )> using all observations
except the ith one. Show that V ar(β̂j(i) ) ≥ V ar(β̂j ) for all j = 0, 1, . . . , k. Hint: show that both
β̂ and β̂(i) are linear unbiased estimators of β.
(b) (10%) Denote ŷ(i) the fitted value of the response for the ith data point, using all observations
except the ith one. Using the fact that β̂ − β̂(i) = (X> X)−1 x> i ei /(1−hii ) where hii is the diagonal
element of H, find the distribution of difference of the fitted values ŷi − ŷ(i) .
(c) (10%) Denote S(i) = `6=i (y` −x` β̂(i) )2 /(n−p−1), show that S(i) and yi − ŷ(i) are independent and
P
p
the externally studentized residual (yi − ŷ(i) )/ S(i) (1 − hii ) follows a t-distribution with degrees
of freedom n − p − 1.
Q4. (35pts) Given a linear regression model without intercept:
yi = β1 xi1 + εi ,
iid
wherePβ1 is known to bePnonnegative, εi ∼ N (0, σ 2 ) for i = 1, . . . , n and σ 2 > 0 is known. Assume
n n
that i=1 x2i1 = 1, and i=1 xi1 = 0. Consider the following penalized loss function:
n
X
Q(β1 , p, λ) = (yi − β1 xi1 )2 + λgp (β1 ),
i=1
where 0 ≤ λ < ∞, 0 ≤ p < ∞,
β1p , β1 ≥ 0;
gp (β1 ) =
∞, β1 < 0,
and define 00 = 1. Denote β̂1 (p, λ) be the minimizer of Q(β1 , p, λ).
(a) (5pts) Show that β̂1 (p, λ) ≥ 0 for all 0 < λ < ∞.
(b) (10pts) Find β̂1 (0, λ) and derive its distribution.
(c) (10pts) Determine β̂1 (1, λ) and β̂1 (2, λ) and show that both estimators are biased estimators of β1
when λ > 0.
(d) (10pts) Given that the data {(xi1 , yi )}ni=1 have positive sample correlation C > 0, find the region
Pn
of λ such that β̂1 (1, λ) ≤ β̂1 (2, λ). What if C ≤ 0? Hint: C = i=1 yi xi1 .
You may use the following facts:
•
dA> β dβ > Aβ
= A, = 2Aβ (A is symmetric.)
dβ dβ
• if all inverses exist,
−1
A−1 −1
−B12 B−1
A11 A12 11 + B12 B22 B21 22
= −1 ,
A21 A22 −B22 B21 B−1
22
where B22 = A22 − A21 A−1 −1 >
11 A12 , B21 = A21 A11 , and B12 = B21 .
Qualifying Exam January 2024 – Statistical Methods
Instructions
• NOTE: You are not allowed to use the internet except for downloading the data, uploading your report,
or using SAS OnDemand for Academics. To use it for any other purpose, ask the proctor.
• Log on to eLearning to download the data. Let the proctor know if you have any problems with this
step.
• You can use any software of your choice. You can use the lab machines or your own laptop.
• Your report should clearly explain the steps, results, conclusions, and justification for the conclusions.
Also include your codes (with brief but adequate comments explaining each step) and outputs (ONLY
relevant parts, highlighted wherever possible). Do not attach the parts of the output that are not used
in answering the questions.
• Submit an electronic report in eLearning. Upload only one single PDF file with the whole report. DO
NOT submit separate files for codes or outputs.
• Write your QE ID number provided by Kisa on all answer sheets. DO NOT put your
name, UTD ID, or any other identifying information on any of your answer sheets. DO
NOT email or share your exam with any one.
Problems
1. Parts (a) - (d) are related to the following “Los Angeles Smoking” study. [18 points]
Researchers in Los Angeles conducted a study to explore the relationship between maternal smoking
during pregnancy and higher blood pressure in children. The study involved 200 mothers and their
newborns, focusing on each newborn’s systolic blood pressure. The following is the full model employed
in the study:
Y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + ϵ, ϵ ∼ N(0, σ 2 ),
where Y represents the infant’s blood pressure in mmHg. The model considers the infant’s age in
weeks (x1 ), weight in kg (x2 ), and two variables for the mother’s smoking status: passive exposure to
smoking (x3 ), and active smoking during pregnancy (x4 ), where “Passive exposure to smoking” refers
to non-smoking mothers regularly breathing in secondhand smoke from others. The study’s findings,
including the analysis of the full model and two reduced models focusing on two different variable sets,
are detailed in Figures 1 and 2. Additionally, Figure 3 illustrates the residual plots for the full model.
Figure 1: R output of regression analysis for the full model
1
Figure 2: ANOVA tables for the full model (including x1 , x2 , x3 , and x4 ), the reduced model 1 (including
x1 and x2 ), and the reduced model 2 (including x3 and x4 )
Figure 3: Residual plots for the full model
(a) Interpret all the estimated coefficients in the full model. [3 points]
(b) Calculate and interpret the coefficient of determination, R2 , for the full model (Show your work
step-by-step to compute this value). [3 points]
(c) Enumerate all assumptions underlying the model. For each assumption, comment on whether it
is satisfied. In cases an assumption appears to be violated, suggest appropriate remedial measures
to address these discrepancies. [6 points]
(d) Use appropriate sum of squares or mean squares in Fighre 2 to test whether the infant’s systolic
blood pressure is associated with the mother’s smoking status given that the age and weight of the
infant are in the model at the significance level of 0.05 (Show your work step-by-step). [6 points]
2
Parts (e) and (f ) are related to the following “Dallas Smoking” study.[12 points]
A researcher in Dallas hypothesized that the effect of the age on the systolic blood pressure of an infant
may vary dependent on different levels of mother’s smoking statuses and the racial background. To
investigate this, data were gathered from n = 1, 236 local mother-newborn pairs. The dataset, provided
in the file smoking_dallas.csv, includes the newborns’ systolic blood pressure and nine explanatory
variables as follows.
Variable name Description
1 gestation_days Length of gestation in days
2 gender Infant’s sex: 1 for male and 2 for female
3 baby_weight_kg Infant’s birth weight in kilograms
4 mother_race Mother’s race: 1 for White, 2 for Black, and 3 for Asian
5 mother_age_years Mother’s age in years at termination of pregnancy
6 mother_smoking_status Does mother smoke? 0 for never, 1 for smokes now,
2 for until current pregnancy, and 3=once did, not now
7 mother_smoking_freq Number of cigarettes smoked per day
0 for never, 1 for 1-4, 2 for 5-9, 3 for 10-14,
4 for 15-19, 5 for 20-29, 6 for 30-39, 7 for 40-60,
8 for 60+, 9 Unknown
8 passive_exposure_to_smoking 1 for passive exposure to smoking and 0 otherwise
9 baby_age_weeks Infant’s age in weeks when measurement taken
Y baby_blood_pressure_mmhg Infant’s blood pressure in mmHg
(e) Propose a new model that allows the slope of the age to differ among the four levels of mother’s
smoking statuses. Write down the null and alternative hypotheses that would be used to address
this research question, and draw your conclusion based on the new data. [8 points]
(f) Can you improve the analysis in (e) by applying a transformation on the blood pressure? If yes,
investigate the utility of the transformation and show the model has a better fit; if no, explain
why and briefly describe a solution. [4 points]
2. In a two-sample problem, two independent simple random samples with sample sizes of n1 and n2
are drawn from two populations. To test the equality of the two population variances, assuming both
populations are normal, one can conduct a two-sided F-test with the following test statistic:
s21
F = ,
s22
where s21 and s22 are the two sample variances. We would like to estimate the true type I error rates at
the nominal level of α = 0.05 using simulation for three distributions: (1) Gaussian, (2) exponential,
(3) t20 . For simplicity, assume equal sample sizes, i.e., n1 = n2 = n. Set n to be 10 and 100, and for
each value of the sample size, generate 1,000 replicates and report the estimated type I error rate. Set
any unknown parameters of these distributions to whatever values you believe will serve the purpose.
At the end, you will obtain 6 type I error rates (resulting from 3 distributions and 2 sample sizes).
Compare them with the nominal level and comment on the reason(s) for differences, if any. [20 points]
Remainder of page intentionally left blank
3
3. Consider the mpg data in mpg.csv. We would like to perform inference on the median (θ) of mpg. Use
nonparametric bootstrap with at least 1,000 resamples and report the following: [20 points]
• histogram and Q-Q plot of the bootstrap distribution of θ̂ with comments on the shape of the
distribution
• bias and standard error of θ̂
• 2.5th and 97.5th percentiles of the sampling distribution of θ̂
• 2.5th and 97.5th percentiles of the sampling distribution of θ̂ − θ
• 95% confidence interval (CI) for θ using the Studentized and percentile methods. For the Studen-
tized method, use the following standard error formula: σmedian = 1.253 √σn .
• Comparison of the above two CIs with the CI assuming the normality, which can be computed
directly using the standard error formula above.
(Important Note: You are required to write your own code for bootstrap and not use any package in
R, e.g., boot, or macro in SAS, e.g., %BOOT)
4. In an agricultural trial, 3 varieties of cotton (Factor A) were grown in 4 locations (Factor B) and the
yield (response Y ) was recorded. There were 3 replicates of each variety at each location for a total of
3 × 4 × 3 = 36 observations. Investigators treated both variety and location to be random factors in
their analysis of the yield. Use the information below to answer the following questions. [30 points]
Source DF Sum of Squares Mean Square F Value Pr > F
Model 11 2753.638889 250.330808 13.23 <.0001
Error 24 454.000000 18.916667
Total 35 3207.638889
R-Square Coeff Var Root MSE Y Mean
0.858463 18.27023 4.349329 23.80556
Source DF Type III SS Mean Square F Value Pr > F
A 2 676.055556 338.027778 17.87 <.0001
B 3 230.305556 76.768519 4.06 0.0182
A*B 6 1847.277778 307.87963 16.28 <.0001
(a) Write down the statistical model, and clearly define all model components and assumptions. [4
points]
(b) How many variance parameters are there in the model? What are the estimated variances? [6
points]
(c) For testing the main effect of variety (Factor A), state the null and the alternative hypotheses.
Show steps to calculate the test statistic and specify the null distribution. [6 points]
(d) Calculate a 95% confidence interval for the average cotton yield. [6 points]
(e) Construct a 95% confidence interval for the variance component associate with variety. [8 points]