Biostatistics
Lecture 6
Estimation/Confidence Intervals
                2024-1 Fall Semester
               Instructor: Min Jin Ha
  Department of Health Informatics and Biostatistics
         Graduate School of Public Health
                 Yonsei University
Reading
• Pagano and Gauvreau, Chapter 9
Statistical Inference
• We have investigated the theoretical properties of a distribution of
  sample means, we’re ready to take the next step and apply this
  knowledge to the process of statistical inference
• Aim: estimate some characteristic of a continuous random variable
  (e.g., mean) using information contained in a sample of observations
Interval Estimation
• Point estimation: using sample data to calculate a single number to
  estimate the parameter of interest
   • Sample mean 𝑥̅ to estimate the population mean 𝜇
• The problem is that two different samples are very likely to result in
  different sample means à there is some degree of uncertainty
• A point estimate does not provide any information about the inherent
  variability of the point estimator
• From CLT, we know that 𝑥̅ is more likely to be near the true population
  mean if it is based on large sample
• Interval estimation provides a range of reasonable values that are intended
  to contain the parameter of interest, a certain degree of confidence
What is Confidence Interval
• A confidence interval provides a range of reasonable values that are
  intended to contain the parameter of interest with a certain degree of
  confidence. It often takes the form
                           Point estimate ± margin of error
   and is written
   (point estimate – margin of error, point estimate + margin of error)
Caveat
• For illustration, we start by assuming 𝜎 is known.
   •   When 𝜎 is known?
   •   Almost never!
   •   However, it’s easier to understand if we assume that to start.
   •   By the end of the class, we’ll get rid of this assumption
Two-sided 95% Confidence Intervals (𝜎 known)
• A random variable 𝑋 has mean 𝜇 and standard deviation 𝜎
• The CLT states that
                                 𝑋& − 𝜇
                            𝑍=            ∼ 𝑁(0,1)
                                 𝜎/√𝑛
• From Lecture 5, we know 𝑃 −1.96 < 𝑍 < 1.96 = 0.95
                             !
                            "#$
• Equivalently, 𝑃 −1.96 <   %/√(
                                   < 1.96 =0.95
• Given this, we are able to manipulate the inequality inside the parentheses
  without altering the probability statement to the form
                             𝑃 𝐿 < 𝜇 < 𝑈 =0.95
                        Show how L and U are derived
Two-sided 95% Confidence Intervals (𝜎 known)
                                      !                         !
                   𝑃 𝑋# − 1.96             < 𝜇 < 𝑋# + 1.96          =0.95
                                       "                       √"
                                 !                       !
  • The quantities 𝑋" − 1.96      "
                                    and "
                                        𝑋   + 1.96 √" are 95% confidence limits for the
    population mean 𝜇
     • we are 95% confident that the interval will cover 𝜇
     • If we were to select 100 random samples from the population and use these samples to
       calculate 100 Cis for 𝜇, approximately 95% of the Cis would cover the true population mean 𝜇
       and 5 would not
  • Wrong Interpretations:
     • There is 95% chance that 𝜇 lies in the interval
     • Why it’s wrong? 𝜇 is fixed and does not move
Two-sided 1 − 𝛼 ×100% Confidence Intervals (𝜎 known)
• A generic confidence interval for 𝜇 can be obtained
                                                              $
• Let 𝑧$/& be the upper 𝛼/2 quantile, i.e., P 𝑍 > 𝑧   !   =       = P 𝑍 < −𝑧!
                                                      "       &             "
• The generic form 1 − 𝛼 ×100% CI for 𝜇 is
                               𝜎           𝜎
                      (𝑋# − 𝑧%   , 𝑋# + 𝑧% )
                             & 𝑛         & 𝑛
• when 𝛼 = 0.05, the 1 − 𝛼 ×100% CI is the 95% CI that we found
99% Confidence Interval
• For a 99% interval, we need the z-value that cuts off the top 0.5% or
  0.005 of the distribution, which is           ?
When can we use this CI?
                          !                !
• The CI given by (𝑋 − 𝑧! , 𝑋# + 𝑧!
                   #                            ) is safe to use in the following
                        " "       "         "
  circumstances when 𝜎 is known
   • X is normal (regardless of sample size)
   • X is non-normal but the sample size is large
• It is typically not safe to use this CI when the sample size is small and
  X is not normal random variable.
How can we get a more narrow CI?
                                      𝜎           𝜎
                          (𝑋# − 𝑧%         #
                                         , 𝑋 + 𝑧% )
                                  &    𝑛        & 𝑛
• Known 𝜎
• Decrease the margin of error
   1. Compromise on our level of confidence, e.g., 90% interval
   2. Increase the sample size n!
What if 𝜎 is unknown?
                                𝜎            𝜎
                            (𝑋# − 𝑧%  #
                                    , 𝑋 + 𝑧% )
                             & 𝑛           & 𝑛
• The CI cannot be computed if 𝜎 is unknown
• We use the sample standard deviation 𝑠 as an estimate of 𝜎
• We never know 𝜎, and if we replace 𝜎 with 𝑠, then
   • We can’t use the CLT
What do you do when 𝜎 is unknown?
         • While working for Guiness brewery in Dublin, William
            Sealy Gosset published a paper on the t distribution, which
            became known as Student’s t distribution.
         (He published under “Student” because the brewery didn’t
         allow him to use his own name)
         • The t distribution is appropriate for constructing a
           confidence interval for the mean when we need to
           account for the additional variability due to estimating 𝜎
           with s
                   !
   𝑺𝒕𝒖𝒅𝒆𝒏𝒕 𝒔 𝒕-distribution
• The Student ' 𝑠 𝑡-distribution to account
for the additional variability due to
estimating 𝜎 with 𝑠
• The t distribution looks a lot like the normal
except that it has fatter tails
• The parameter for t distribution is called degrees of freedom (df)
• As the df (denoted by 𝜈 in the figure)gets bigger, the t distribution looks
  more and more like the normal
t distribution for CI
• The df measure the amount of information available in the data to
  estimate 𝜎
                   ̅
                  )*+
• The statistic 𝑡 = ,/ " has a t distribution with 𝑛 − 1 df (denoted by
  𝑡"*. ). We use 1 df by estimating the sample mean 𝑥̅
• Thus, 𝑛 gets larger à 𝑠 gets to be a better estimate of 𝜎 à the
  distribution of the t statistic looking more like the normal
• With large enough 𝑛, normal approximation can be used to construct
  CI.
• CI from t distribution is wider, accounting for the uncertainty on 𝜎
 What if 𝜎 is unknown?
• We use CI given by
                                 𝑠                     𝑠
                  P 𝑋# − t "*.,%    < 𝜇<𝑋# + t "*.,%
                               & 𝑛                  & 𝑛
                                                       %
        Where t "*.,! is the quantile of probability 1- from t "*.
                     "                                &
• In R, use qt(p=0.975,df=n-1)
Applications
• Consider the distribution of heights for the population of individuals
  between ages of 12 and 40 who suffer from fetal alcohol syndrome.
  Fetal alcohol syndrome is the severe end of the spectrum of
  disabilities caused by maternal alcohol use during pregnancy. The
  distribution of heights has unknown mean 𝜇. A random sample of 31
  patients is selected from the underlying population; the average
  height for these individuals was 𝑥̅ = 147.4𝑐𝑚.
   1. When 𝜎 is known to be 6cm, construct 90% and 99% confidence intervals
      for 𝜇. Interpret the results.
   2. When 𝜎 is not known and the sample standard deviation calculated to be
      6cm, construct 90% and 99% confidence intervals for 𝜇. Interpret the results.