Point Estimation
Samatrix Consulting Pvt Ltd
Point Estimation
• The framework of statistical inference allows us to draw conclusions
or make decisions about a population based on the information
extracted from the sample data.
• Several types of inferences such as point estimation, interval
estimation, and hypothesis testing, can be made about the
population parameters.
• The primary goal of the statistical inference is to find a good estimate
of population parameters.
Point Estimation
• We have seen how a probability distribution can be characterized
using certain parameters.
• For example, for normal distribution, we need two parameters 𝜇, 𝜎 2 ,
for Poisson distribution, we need 𝜆 and for binomial distribution we
need 𝑝.
• With the help of these parameters, we can characterize the entire
population.
• However, these parameters are unknown and our objective is to
estimate them.
Point Estimation
• The point estimation is type of inference where we calculate a single
statistic from the sample data and use the statistic to estimate the
unknown but fixed parameter.
• For the inference problems, we use 𝜃 to represent the parameter
Point Estimator
• If 𝑋 is a random variable with the probability distribution 𝑓(𝑥).
• An unknown parameter 𝜃 is used to characterize the distribution.
• We have taken 𝑋1 , 𝑋2 , … , 𝑋𝑛 random samples of size 𝑛 from 𝑋.
• The statistic Θ = ℎ(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) is called the point estimator of 𝜃.
• After the sample is selected, Θ takes on the value 𝜃, መ which is known
as point estimate of 𝜃.
Point Estimator
• Example: Suppose normally distributed random variable 𝑋 has
unknown mean 𝜇.
• We can use sample mean as point estimator for unknown population
mean 𝜇. So 𝜇ො = 𝑋.ത
• After we select the sample, 𝑥ҧ becomes the point estimator of 𝜇.
• If 𝑥1 = 25, 𝑥2 = 30, 𝑥3 = 29, 𝑥4 = 31, the point estimate of 𝜇 is
25 + 30 + 29 + 31
𝑥ҧ = = 28.75
4
Point Estimator
• Similarly, if the population variance 𝜎 2 is also unknown, we can use
the point estimator of 𝜎 2 , the sample variance 𝑆 2 .
• We can calculate the numerical value 𝑠 2 = 6.9 from the sample data.
𝑠 2 is the point estimate of 𝜎 2
• The approach for estimating the population parameter should
provide the precise estimates of the population.
• Among all the approaches, we need to select the best approach.
• The statistical concepts such as bias, variability, consistency,
efficiency, sufficiency, and completeness can help us compare various
estimators.
General Concepts of Point
Estimation
Unbiased Estimator
• The estimator should close to the true value of the unknown parameter.
• We can call Θ an unbiased estimator of 𝜃 if the expected value of Θ is
equal to 𝜃.
• We can also say that the mean of the probability distribution of Θ is equal
to 𝜃
• The point estimator Θ is unbiased estimator of 𝜃 if
𝐸 Θ =𝜃
− 𝜃 is called the bias of
• If the estimator is not unbiased, the value of 𝐸 Θ
the estimator
• For an unbiased estimator 𝐸 Θ −𝜃 =0
Variance of Point Estimator
• If we have two different unbiased estimators (Θ 1 and Θ 2 ) of 𝜃.
• This indicates that even though the distribution for both the
estimators are centered at true value of 𝜃, the variance may be
different.
1 has smaller variance than Θ
• If Θ 2 , the estimator Θ
1 will give estimate
that is closer to the true value 𝜃.
• Hence if we have multiple unbiased estimators of 𝜃, we should
choose the estimator with the minimum variance.
• The estimator with minimum variance is known as minimum variance
unbiased estimator (MVUE)
Theorem
• Let 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 ) be an i.i.d (random) sample of random
variable 𝑋 with population mean 𝐸 𝑋𝑖 = 𝜇 and population variance
𝑉𝑎𝑟 𝑋𝑖 = 𝜎 2 , for all 𝑖 = 1,2, … , 𝑛. Then
1 𝑛
ത
1. The sample mean 𝑋 = σ𝑖=1 𝑋𝑖 is unbiased estimator of 𝜇
𝑛
1
2
2. The sample variance 𝑆 = σ𝑛𝑖=1 𝑋𝑖 − 𝑋 2
is unbiased estimator
𝑛−1
2
of 𝜎
Theorem
1. For an i.i.d. sample from a normally distributed population 𝑁(𝜇, 𝜎 2 ), the
1 𝑛
ത
sample mean 𝑋 = 𝑛 σ𝑖=1 𝑋𝑖 is an unbiased point estimator for 𝜇
2. For an i.i.d. sample from a normally distributed population 𝑁(𝜇, 𝜎 2 ), the
1
sample variance 𝑆 2 = 𝑛−1 σ𝑛𝑖=1 𝑋𝑖 − 𝑋 2 is an unbiased point estimator
for 𝜎 2
3. For an i.i.d. sample from a normally distributed population 𝑁(𝜇, 𝜎 2 ), the
1 𝑛
መ
sample variance 𝑆 = 𝑛 σ𝑖=1 𝑋𝑖 − 𝑋 2 is an unbiased point estimator
2
for 𝜎 2 . But as the sample size tends to infinity, the bias tends to zero
4. For an i.i.d. sample from a Bernoulli distributed population 𝐵(1, 𝑝), the
1 𝑛
ത
sample mean 𝑋 = 𝑛 σ𝑖=1 𝑋𝑖 is an unbiased point estimator for 𝑝
Method of Evaluating Estimator
Mean Square Error of an Estimator
• Sometimes we prefer a biased estimator over an unbiased estimator. In
such cases, mean square error of the estimators becomes important.
and 𝜃
• Mean square error is the expected squared difference between Θ
2
=𝐸 Θ
𝑀𝑆𝐸 Θ −𝜃
• The mean square error can also be written as
=𝑉 Θ
𝑀𝑆𝐸 Θ + 𝑏𝑖𝑎𝑠 2
𝑀𝑆𝐸 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑏𝑖𝑎𝑠 2
Mean Square Error of an Estimator
• For an unbiased estimator of 𝜃, mean square error of Θ is equal to
.
variance of Θ
• The mean square error is an important criterion for comparing two
estimators.
• Suppose we have two estimators, Θ 1 and Θ 2 , of parameter 𝜃. The
𝑀𝑆𝐸(Θ 1 ) and 𝑀𝑆𝐸(Θ 2 ) are the mean square errors of Θ 1 and Θ2.
• The relative efficiency of Θ 2 to Θ
1 can be defined as
𝑀𝑆𝐸 Θ 1
𝑀𝑆𝐸 Θ 2
Mean Square Error of an Estimator
• If the relative efficiency is less than 1, we can conclude Θ 1 is more
efficient estimator than Θ 2
• On several occasion, in order to reduce mean square error and hence
the variance, we need to introduce a small bias.
• If the reduction in variance is greater than increase in bias, we will
improve on the mean square error.
• From the figure below, we can see that the biased estimator Θ 1 has
smaller variance than the unbiased estimator Θ 2.
• But if the 𝑀𝑆𝐸 Θ 1 < 𝑀𝑆𝐸(Θ 2 ), Θ
1 will produce estimator that will
be closer to the true value of 𝜃
Best Unbiased Estimator
• There is no “Best Estimator”, because the class of the estimator is too
large.
• We can find the best estimator by reducing the class of estimators.
• We can restrict the class of estimators by considering the unbiased
estimators only.
• In that case the mean square error would equal to their respective
variance.
• We can therefore choose the estimator with the smallest variance.
Sufficient Estimator
• Let’s understand the concepts of sufficiency of estimators through
some examples.
• Suppose we have two independent variables 𝑋 and 𝑌. Both of them
follow 𝑁(𝜇, 1) distribution. So, we can say that both of them contain
the information about 𝜇.
• During our experiment, we use two estimators of 𝜇. 𝜇ො1 and 𝜇ො2 . We
can define 𝜇ො1 = 𝑋 + 𝑌 and 𝜇ො2 = 𝑋 − 𝑌. We want to know whether
we should use 𝜇ො1 or 𝜇ො2 to estimate 𝜇.
Sufficient Estimator
𝐸 𝜇ො1 = 𝐸 𝑋 + 𝐸 𝑌 = 𝜇 + 𝜇 = 2𝜇
𝑉𝑎𝑟 𝜇ො1 = 𝑉𝑎𝑟 𝑋 + 𝑉𝑎𝑟 𝑌 = 1 + 1 = 2
𝐸 𝜇ො2 = 𝐸 𝑋 − 𝐸 𝑌 = 𝜇 − 𝜇 = 0
𝑉𝑎𝑟 𝜇ො2 = 𝑉𝑎𝑟 𝑋 + 𝑉𝑎𝑟 𝑌 = 1 + 1 = 2
• We can say 𝜇ො1 ∼ 𝑁(2𝜇, 2) and 𝜇ො2 ∼ 𝑁(0,2).
• So, 𝜇ො1 contains information about 𝜇, where as 𝜇ො2 does not contain
information about 𝜇.
• We can also say that 𝜇ො2 has lost information about 𝜇. We call this
property as “loss of information”.
Sufficient Estimator
• Let’s take another example. Suppose was have sample data 1,2,3,4,5.
1+2+3+4+5
The 𝑥ҧ is = 3. You can estimate the population mean 𝜇
5
using 𝑥.ҧ
• Let’s assume, that you do not know the individual values. You only
know sample mean i.e. 𝑥ҧ = 3. Without knowing the whole sample
data set, you can estimate population mean using sample mean.
• We can say that if all the information about 𝜇 contained in the sample
of size 𝑛 can be obtained, for example, through the sample mean
then it is sufficient to use this one-dimensional summary statistic to
make inference about 𝜇.
Method of Evaluating Estimator
Method of Moments
• The method of moments is the oldest method of finding the point
estimators.
• The method of moments is very simple to use.
• It always yields some sort of estimates.
• However, the estimators that it yields, can be improved upon.
• Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be the random sample from a population with discrete
probability mass function or continuous probability density function 𝑓 𝑥 .
• We can find the method of moments estimators by equating the first 𝑘
sample moments with the first 𝑘 population moments.
• The population moments are defined in the terms of expected values.
Method of Moments
• If the 𝑘th population moment is 𝐸(𝑋 𝑘 ), where 𝑘 = 1,2, … . The
1 𝑛
corresponding 𝑘th sample moment is σ𝑖=1 𝑋𝑖𝑘 where 𝑘 = 1,2, … .
𝑛
• For example, the first population moment is 𝐸 𝑋1 = 𝐸 𝑋 = 𝜇.
1 1
• The first sample moment is σ𝑛𝑖=1 𝑋𝑖1 = σ𝑛𝑖=1 𝑋𝑖 = 𝑋.
ത
𝑛 𝑛
ത
• By equating population and sample moments, we can find 𝜇ො = 𝑋.
• So, we can say sample mean is the moment estimator of population
mean.
Method of Moments
• Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be the random sample from an
exponential distribution that has parameter 𝜆.
• In this case only one parameter needs to be estimated. Hence 𝐸(𝑋)
can be equated to 𝑋. ത
• In the case of the exponential distribution 𝐸 𝑋 = 1/𝜆.
1
ത
• Since 𝐸 𝑋 = 𝑋, = 𝑋. ത
𝜆
• Therefore, the moment estimator of 𝜆 is 𝜆መ = 1/𝑋.
ത
Method of Moments
• As an example, during an experiment, the team tested time to failure
of an electronic module that has been used in an automobile engine
controller at an elevated temperature to accelerate the failure
mechanism.
• In this case the time to failure is exponentially distributed.
• The test was conducted on eight randomly selected samples.
• The failure time was as follows 𝑥1 = 20.5, 𝑥2 = 8.9, 𝑥3 = 77.3, 𝑥4 =
30.8, 𝑥5 = 22.6, 𝑥6 = 56.3, 𝑥7 = 9.1, 𝑥8 = 29.1.
• Therefore 𝑥ҧ = 31.825.
1 1
• The moment estimates of 𝜆 is 𝜆 = = = 0.0314
𝑥ҧ 31.825
Method of Maximum Likelihood
• In 1920s, the famous British statistician Sir R. A. Fisher developed the
method of maximum likelihood that is one of the best methods of
obtaining a point estimator of a parameter.
• Before getting into the concepts of Maximum Likelihood, let’s
understand the concepts of likelihood function.
Likelihood Function
• The concept of likelihood is one of the most important concepts of
modern statistics.
• While studying the random variable distributions, we studied that the
distribution is dependent on one or two parameters.
• For a normal distribution, 𝜇 and 𝜎 are the parameters of interest
whereas for a Poisson distribution 𝜆 is the parameter of interest.
• In the case of binomial distribution, the parameter is 𝑝.
• We can denote the unknown parameter by 𝜃.
• Since the probability distribution depends on 𝜃, we can denote the
dependence by 𝑓(𝑥|𝜃).
Likelihood Function
• In the case of Bernoulli distribution, the parameter is 𝜋 = 𝜃. The
distribution is
𝑓 𝑥 𝜃 = 𝜋 𝑥 1 − 𝜋 1−𝑥 𝑥 = 0,1
• If we have observed values of 𝑥, we have the function of 𝜋 only.
• For example, if we feed the value of 𝑥 = 1, the function is 𝜋
• For the value of 𝑥 = 0, the function is 1 − 𝜋
• So, we can say that when we plug the observed values of 𝑥 in
function 𝑓(𝑥|𝜃), we get a function of parameter 𝜃, we call it
likelihood function. We denote the likelihood function by
𝑛
𝐿 𝜃 𝑜𝑟 𝐿(𝜃 𝑥 = Π𝑖=1 𝑓(𝑋𝑖 |𝜃)
Likelihood Function
• In the above example Likelihood function when 𝑥 = 1 is 𝐿 𝜋 𝑥 = 𝜋.
For 𝑥 = 0, the likelihood function would be 𝐿 𝜋 𝑥 = 1 − 𝜋
Likelihood Function
• Whereas for a Bernoulli distribution for any fixed value of 𝜋 = 0.5, we
get
Likelihood Function
• Algebraically, the likelihood 𝐿(𝜃|𝑥) is same as 𝑓(𝑥|𝜃) but they are
very different.
• 𝐿(𝜃|𝑥) is a function of 𝜃 and is probability density over 𝑥, where as
𝑓 𝑥 𝜃 is a function of 𝑥 and probability density over 𝜃.
• Hence the graph of 𝐿(𝜃|𝑥) is different from that of 𝑓(𝑥|𝜃)
• For a discrete random variable, the probability distribution 𝑓(𝑥|𝜃)
has spikes for some specific values of 𝑥 whereas the likelihood 𝐿(𝜃|𝑥)
results into a continuous curve or a line over the parameter space (all
possible values of 𝜃)
Maximum Likelihood Estimator (MLE)
• Suppose we conduct 𝑛 = 5 independent Bernoulli trials where the
probability of success for each trial is 𝑝. The total number success in
the trial is 𝑋. So, we can say, 𝑋 ∼ 𝐵𝑖𝑛(5, 𝑝). For the 𝑋 = 3, the
likelihood is
𝑛!
𝐿 𝑝𝑥 = 𝑝 𝑥 1 − 𝑝 𝑛−𝑥
𝑥! 𝑛 − 𝑥 !
5!
= 𝑝3 1 − 𝑝 5−3
3! 5 − 3 !
∝ 𝑝3 1 − 𝑝 2
Maximum Likelihood Estimator (MLE)
• We can ignore the constant. A graph for 𝐿 𝑝 𝑥 = 𝑝3 1 − 𝑝 2
over
interval 𝑝 ∈ (0,1) is as follows
Maximum Likelihood Estimator (MLE)
• From the graph we can infer that the maximum value of function is at
𝑝 = 0.6.
• So we can estimate the value of unknown parameter 𝜃 by the value
for which the value of likelihood function 𝐿(𝜃|𝑥) is the largest.
• This approach is known as maximum likelihood (ML) estimation.
• The value of parameter 𝜃 for which the value of likelihood function is
maximum is denoted by 𝜃መ and is called as the maximum likelihood
estimate (MLE) of 𝜃
Maximum Likelihood Estimator (MLE)
• We can calculate the MLE using following steps
• Calculate the derivate of 𝐿(𝜃|𝑥) with respect to 𝜃
• Set the derivative equal to zero
• Find the value of 𝜃 by solving the resulting equation
• We can simplify the computations by maximizing the loglikelihood
function
𝑙 𝜃 𝑥 = log 𝐿(𝜃|𝑥)
• Where “log” means natural log (logarithm to the base 𝑒). The natural log is
an increasing function, if we maximize the loglikelihood, we maximize the
likelihood. The advantage with loglikelihood is its simpler form and easier
to differentiate.
Bayes’ Estimator
• We again consider the example of the Bernoulli trials as shown above.
• The conditional distribution of a random variable 𝑋, with 𝑛 trials and
parameter 𝜋 is binomial(𝑛, 𝜋).
• The conditional probability function for 𝑥 given 𝜋 is
𝑛!
𝑓 𝑥𝜋 = 𝜋 𝑥 1 − 𝜋 𝑥−𝑛 for 𝑥 = 1, … , 𝑛
𝑥! 𝑛 − 𝑥 !
• Here the parameter 𝜋 is fixed and we are examining the probability
distribution of all the possible values of 𝑥
Bayes’ Estimator
• We can look into this relationship between 𝜋 and 𝑥 by fixing the value
of 𝑥 at the number of successes we observed and allow the
parameter 𝜋 to vary over all its possible values.
• As a result, we get likelihood function
𝑛!
𝑓 𝑥𝜋 =𝐿 𝜋𝑥 = 𝜋𝑥 1 − 𝜋 𝑥−𝑛
for 0 ≤ 𝜋 ≤ 1
𝑥! 𝑛 − 𝑥 !
Bayes’ Estimator
• In this case the relationship as the distribution of observation 𝑥 given
the value of parameter 𝜋 remain same, but now we consider the
parameter for the observations that are help at a value that actually
occurred.
• For Bayes’ theorem, we use prior distribution 𝑔 𝜋 that gives our
belief about the possible values that 𝜋 can take before taking the
data.
• The prior cannot be constructed from the data. Bayes’ theorem can
be written as
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝑝𝑟𝑖𝑜𝑟 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑔 𝜋 𝑥 ∝ 𝑔 𝜋 × 𝑓(𝑥|𝜋)
Bayes’ Estimator
• This relationship can be justified only when the prior is independent
of likelihood.
• So the choice of prior is not influenced by the observed data.
• To get the exact posterior density, we need to divide the relationship
by some constant 𝑘 so that the area under the posterior integrates to
one.
• 𝑘 is the integration of 𝑔 𝜋 × 𝑓(𝑥|𝜋) over the whole range.
𝑔 𝜋 ×𝑓 𝑥 𝜋
𝑔 𝜋𝑦 = 1
0 𝑔 𝜋 × 𝑓 𝑥 𝜋 𝑑𝜋
Bayes’ Estimator
𝑔 𝜋 ×𝑓 𝑥 𝜋
𝑔 𝜋𝑦 = 1
0 𝑔 𝜋 × 𝑓 𝑥 𝜋 𝑑𝜋
• The posterior distribution is a conditional distribution.
• It is conditional upon observing the sample.
• We can use the posterior distribution to make statement about a
parameter 𝜃 that still is a random quantity.
• We can use the mean of the posterior distribution as the point
estimate of 𝜃
Thanks
Samatrix Consulting Pvt Ltd