0% found this document useful (0 votes)
14 views72 pages

Part 2 Estimation

The document discusses estimation theory, focusing on the best estimates for hidden values or parameters based on available data. It outlines three main estimation problems: regression, estimating hidden inputs, and building statistical relations without knowing the mapping functions. Additionally, it covers concepts such as unbiased, consistent, and efficient estimators, and introduces different estimation criteria including least mean square error, minimum absolute error, and maximum likelihood estimators.

Uploaded by

ahbaramohammed96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views72 pages

Part 2 Estimation

The document discusses estimation theory, focusing on the best estimates for hidden values or parameters based on available data. It outlines three main estimation problems: regression, estimating hidden inputs, and building statistical relations without knowing the mapping functions. Additionally, it covers concepts such as unbiased, consistent, and efficient estimators, and introduces different estimation criteria including least mean square error, minimum absolute error, and maximum likelihood estimators.

Uploaded by

ahbaramohammed96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Probability and Stochastic

Process
PART 2

Concepts of Estimation Theory

Mohammed S. Elmusrati – University of VAASA 1


Introduction
• In estimation theory we are looking to the best
estimate for hidden (unseen) value(s) or parameters
based on available data (measurements) or
observations.
• Let’s define the problem as:
( )
y = h x;q + n
Where x is the hidden input, n is the random unknown noise
or biasing, h is vectors of mapping function based on
parameters θ, and N is the number of samples (observations)

• Where
Mohammed S. Elmusrati – University of VAASA 2
Estimation Problems
• For the previous equation we have at least the
following three interesting cases in estimation:
1. We know (at least partially) x and the observation y, and
we are looking for the best set of functions h and their
parameters θ. We call this Regression problem (or curve
fitting) and it is one of the tools in many applications like
Machine learning. We may know the general form of the
mapping h, for example the increasing exponential
functions in population models, but we need to know (or
estimate) the parameters of the exponential function.
Sometimes, we do not know even the mapping, and we
need to test/modify some general shapes to fit the
available x and y data. Or using black box modelling.

Mohammed S. Elmusrati – University of VAASA 3


Estimation Problems
2. The second case when we have the observations y and
we know (or at least we can assume) h with its
parameters, and we are looking to estimate x. Since x
vector is unknown for us (fully or partially) and it is
corrupted by noise then we should deal with it as a
random process.
3. The third interested case when we have only the
observations y and we do not know the mapping
functions h nor the inputs x. However, we may build
some statistical relations between the inputs and the
outputs. What is the best estimate of statistical
parameters based on the available information (e.g., y).

Mohammed S. Elmusrati – University of VAASA 4


Introduction
• We may simplify the general form of our estimation
problem to include mainly the third case as:
y=h x +n ()
x ÎÂ M´1
, and y;n ÎÂ N´1

• Here, we consider only one mapping function and we


are interested to estimate deterministic vector x
based on the N observations y.

Mohammed S. Elmusrati – University of VAASA 5


Unbiased Estimation
• How to assess the estimation process? How to decide which
estimate is the best? What are our evaluation criteria?
• Usually we estimate the hidden parameter or inputs x est based on
the observations {y}.
• However, since the estimate x est is based on random variables y,
hence, the estimate will be a random variable as well.
• If the expected value of the estimate E éë x est ùû equals the actual
estimated value, we call this estimate as unbiased estimate.
• x est is unbiased estimate of x if E éë x est ùû = x
• This means that the expected value of the error between the
estimate and the actual value is zero.

Mohammed S. Elmusrati – University of VAASA 6


Consistent Estimator
• The estimator x est is called consistent if when increasing
the number of observations that will reduce its
variance, i.e., when N is the number of observations
(or samples) used in the estimation
é
( ù
)
2
lim E ê x est - E éë x est ùû ú = 0
N®¥ ë û
• Hence, it is highly desirable for the estimator to be
unbiased and consistent. Because we will be close to
the actual parameter when we have large enough N.

Mohammed S. Elmusrati – University of VAASA 7


Efficient Estimator
• It is not possible to have an infinite number of
samples in order to make the estimate extremely
close to the actual parameter.
• Furthermore, it is possible to have several
different unbiased and consistent estimators.
Hence, Which one we should select?
• We may define the efficient estimator is that one
which has the minimum error variance for a
certain finite number of samples N.

Mohammed S. Elmusrati – University of VAASA 8


Efficient Estimator
• In order to decide if the estimator is efficient or not,
we should be able to know the minimum variance for
estimators of certain problem.
• Fortunately, the minimum variance or more precisely
the lower bound of the variance of the unbiased
estimators is given by Cramer-Rao (CR) lower bound.
• Considering the same problem in Slide 2 that we are
looking to estimate vector of parameters in inputs
x=[x1,..,xM]T based on the observations y =[y1,..,yN]T,
Mohammed S. Elmusrati – University of VAASA 9
Efficient Estimator
• Define the covariance matrix of the vectors of
estimates as R = E éêë( x - x )( x - x ) ùúû
T
x est est est

• The element (i,j) of the Fisher Information matrix


I(x) are given by

• Where f(y;x) is the joint probability density function


between the observations y and the parameters x
Mohammed S. Elmusrati – University of VAASA 10
Cramer-Rao Lower Bound
• Under the following regularity condition

• We may define the lower bound of the estimate variance

( ) ()
as é -1
ù
Var éë x est ùû ³ I x
i êë úûii
• This means that the variance of the ith parameter estimate
is lower bounded by the diagonal of the inverse Fisher
information matrix under the regularity condition.
Mohammed S. Elmusrati – University of VAASA 11
Cramer-Rao Lower Bound
• We can reduce the general form given for the Cramer-Rao Lower Bound
(CRLB) in the previous slide for a single estimated parameter as:
-1
( )
Var x est ³
é ¶2 ù
E ê 2 log f y;x ú ( )
ë ¶x û
• The regularity condition is
é ¶ ù
E ê log f y;x ú = 0( )
ë ¶x û
• The proof is not difficult and could be found in many books in estimation
theory. However, it is not given here. Nevertheless, we will use the
bound to prove the efficiency of certain estimators.
Mohammed S. Elmusrati – University of VAASA 12
Least Mean Square Error Estimate
• Let’s derive our first estimation algorithm. We
assume a single scalar parameter, i.e., we have N
observations based on yi = h( x ) + ni , i=1,..,N
• We don’t know the function h(.) and ni is additive
random measurement noise.
• One possible criteria is to find the best estimate
xest that minimizes the mean square error as:
é
( ( )) ù
2
minE ê x - x est y úû
ë
Mohammed S. Elmusrati – University of VAASA 13
Least Mean Square Error Estimate
• We may formulate the expected value as
¥ ¥
é
( ( )) ù
( ( )) f ( x,y )dxdy
2 2
E ê x - x est y
ë úû = ò ò x - x est y XY
-¥ -¥

• Where fXY(x,y) is the joint distribution between the


measurements and the parameter x. We may modify this
formula, since f XY ( x,y ) = fY ( y ) f X Y ( x y )
¥ ¥
é
( ( )) ù
( ( )) f ( y) f ( x y )dxdy
2 2
E ê x - xest y
ë úû = ò ò x - xest y Y XY
-¥ -¥

• We can find xest that minimize this function as


d é
( ( )) ù
2
E ê x - x est y úû = 0
dx est ë
Mohammed S. Elmusrati – University of VAASA 14
Least Mean Square Error Estimate
¥
é¥ ù
() ( ( )) f ( )
d 2

• Hence, dxest ò-¥ fY y êê -¥ò x - xest y XY


x y dx údy = 0
úû
ë
• But since fY(y) is always positive, then, it is true to
say d é ¥
ù ¥

( ( )) ( ) ( ( )) ( )
2
ê ò x - xest y f X Y x y dx ú = 0Þ -2 ò x - xest y f X Y x y dx = 0
dxest êë -¥ úû -¥

¥ ¥ ¥ ¥

ò xf

XY ( x y )dx - ò x ( y )f ( x y)dx = ò xf ( x y )dx - x ( y ) ò f ( x y )dx = 0

est XY

XY est

XY

¥
• Therefore, ( ) ò xf ( x y )dx = E éë x y ùû
x est y = XY

Mohammed S. Elmusrati – University of VAASA 15


Least Mean Square Error Estimate
• The previous result is very important since it says that
the best estimate of the parameter x is the mean of
the conditional probability density function of the
parameter given the observations, i.e., f(x|y).
• This function is not always available.
• Is there other ways to estimate the parameter x based
on the observations?
• Yes, we can use several other norms like minimizing
the the absolute value of the error or minimizing the
maximum value of the error as in the next slide:
Mohammed S. Elmusrati – University of VAASA 16
Minimum Absolute Error
• What is the optimum parameter estimate xest that
minimizes error = E éë x - xest ( y ) ùû ?
• Let’s do it in similar way as we did to minimize the
means square error:
¥ ¥ ¥
é¥ ù
( ) ò ò x - x (y) f (y) f ( )
E é x - x est y ù =
ë û est Y XY ()
êë -¥
() ( )
x y dxdy = ò fY y ê ò x - xest y f X Y x y dx údy
úû
-¥ -¥ -¥

¥
é¥ ù
d
dx est ë
()
E é x - x est y ù = 0 Þ
û
d
dx est ò-¥ Y êê -¥ò
f y () x - x est
y f () ( )
XY
x y dx údy = 0
úû
ë
x (y)
d é est ù
¥ ¥

Þ
d
dx est ò x - x (y) f
est XY ( )
x y dx = 0 =
dx est ê -¥
( ( )) ( )
XY (
XY ( )) ( )
ê - ò x - x est y f x y dx + ò x - x est y f x y dx ú = 0
ú
ë est ( ) û
-¥ x y

Mohammed S. Elmusrati – University of VAASA 17


Minimum Absolute Error
• Now we differentiate with respect to xest using the
following Leibniz Integral Rule
b( z ) b( z )
¶ ¶ f ( x, z ) ¶b ¶a
ò f ( x, z ) dx = ò dx + f ( b ( z ) , z ) - f ( a ( z ) , z )
¶z a( z ) a( z )
¶z ¶z ¶z
• We obtain:
()
x est y ¥ ()
x est y ¥

ò ( )
f X Y x y dx - ò ( )
f X Y x y dx = 0Þ ò ( )
f X Y x y dx = ò ( )
f X Y x y dx
-¥ ()
x est y -¥ ()
x est y

• This means that the optimum estimate is the median of


the conditional probability density function f(x|y)

Mohammed S. Elmusrati – University of VAASA 18


MinMax Error Criteria
• The third famous criteria for the optimization of
the estimated parameter is that one minimizes the
{
maximum error, i.e., min maxE éë x - xest ( y ) ùû }
• This could be formulated as
d ìï üï
¥

ímax ò x - x ( y ) f ( x y )dx ý
est
dx ïî
est -¥
XY
ïþ
• Roughly speaking the maximum of x is achieved at
the maximum of f(x|y). In other words, the best
estimate in this case is the mode of f(x|y).

Mohammed S. Elmusrati – University of VAASA 19


Different Estimators
• We have seen so far three different estimators
according to the criteria. However all of them are
based on the conditional probability density function:

– Minimizing the error variance L-2 Norm: the mean of f X Y x y ( )


– Minimize the absolute value L-1 Norm: the median f X Y x y ( )
– Minimize the maximum of the error L-inf Norm: The mode
( )
of f X Y x y

Mohammed S. Elmusrati – University of VAASA 20


Different Estimators

( )
fX Y x y

x̂MVx̂MM x̂MAP x
It is interesting to know that if the conditional probability is symmetric like the
Normal distribution, then all these three estimators are identical

Mohammed S. Elmusrati – University of VAASA 21


Maximum Likelihood Estimator
• It has been approved that the best estimator should be based
on the conditional probability of the parameter we are
looking to estimate given the observations or measurements

( )
fX Y x y

• However, unfortunately, it is a real challenge to have accurate


formulation of this a posterior density. However, using Bayes
rule, we may express it as: f yx f x ( ) ()
( )
fX Y x y =
YX

()
fY y
X

Mohammed S. Elmusrati – University of VAASA 22


Maximum Likelihood Estimator
• In terms of estimation theory we may express the parameters of
the previous Bayes’ formula as:
( )
– The density f X Y x y represents the distribution of the unknown
parameter x after collecting the measurements y. Hence, it is called
the posterior probability density function.
()
– The density f X x represents our believes about the possible values
of x before we watch any observations or collecting any
measurements. This could be based on assumptions and/or physical
behavior. It represents the priori statistical knowledge about x.
( )
– The density fY X y x is called the likelihood density which express how
the measurement or observations should behave at certain parameter
x.
– Finally the density fY ( y ) represents the general distribution of the
measurements regardless of the parameter x.
Mohammed S. Elmusrati – University of VAASA 23
Maximum Likelihood Estimator
• As we have seen from the different estimation
techniques, one method is by taking the maximum
value of the posterior probability density f X Y x y ( )
• Therefore, it is named as maximum a posteriori MAP
estimation.
( )
• However, since f X Y x y is generally very hard to know,
let’s see how to find some other equivalent estimator.
• Taking the logarithm of the posterior probability
density we obtain
Mohammed S. Elmusrati – University of VAASA 24
Maximum Likelihood Estimator
( ( )) ( ( ))
log f X Y x y = log fY X y x + log f X x - log fY y ( ( )) ( ( ))
• It is clear that taking the logarithm makes the density function
easier to handle. Moreover, the logarithm function is monotonic
increasing function, i.e., it will not change the location of the
maximum point.
• Generally speaking
if g(x)>0 for all x, and the xmax=Maximum[g(x)]
Then it is always true that xmax=Maximum[log(g(x))]
• Therefore, MAP estimate could be formulated as

{ ( ( ))} = max{log ( f ( y x )) + log ( f ( x ))}


max log f X Y x y YX X

Mohammed S. Elmusrati – University of VAASA 25


Maximum Likelihood Estimator
• In the last formulation we have dropped fY(y) because
it is not function in the parameter x and it will not
have any effect in finding the point which maximizes
the estimate.
• It is clear that to find the MAP point we will need to
know the likelihood density function as well as the
priori statistical knowledge about the parameter fX(x).
• In case we ignore the a priori part, and we maximize
only the likelihood density, we call this estimate the
maximum likelihood.
Mohammed S. Elmusrati – University of VAASA 26
Maximum Likelihood Estimator
• In other words the maximum likelihood estimator is
defined as:
{ ( )}
é ù
x ML = argmax fY X y x = argmax log ê fY X y x ú
ë û { ( )}
• We have seen that MAP is an optimum estimator
according to a certain criteria.
• Is MLE (maximum likelihood estimator) optimum in any
sense?
• ML estimator can be the optimum solution as the MAP in
some cases and can be suboptimal estimators in other
cases.
Mohammed S. Elmusrati – University of VAASA 27
Maximum Likelihood Estimator
• To see when MLE=MAP, let’s revisit the Bayes rule as
assume that all measurements (y1, y2, ..,yN) are
independent, then
max {log ( f ( x y ))} = max {log ( f ( y x ) f ( x ))} = max ílog ç Õ f ( y x ) f ( x )÷ ý
ìï æ N
ö üï
X k X
ï è øï
XY YX YX
î k=1 þ
ìN ü
ïî k=1
( ) ()
= max íå log é fY X yk x f X x ù ý
êë úû ï
þ
• It is clear that fX(x) is weighting the likelihood function.
Hence, if fX(x) is uniformly distributed over the whole
range of interest, then, it will not have any effects on the
location of the optimum x. In that case MLE=MAP.

Mohammed S. Elmusrati – University of VAASA 28


Maximum Likelihood Estimator
• The previous status of MAP=MLE could be shown
mathematically as:
d æ
dx è ( ( ))
log f X Y x y ö
ø
x=x MAP
d æ N
dx è k=1
é
êë
ù
úû ( )
= 0 = ç å log fY X yk x + log ë f X x û÷
é ù
ö
ø
()
x=x MAP
=

( ) ()
N

åf
1 d
dfY X yk x +
1 d
fX x = 0 ()
k=1
YX ( )
yk x dx f X x dx

• Hence, when fx(x)=constant, then its differentiation is


zero, so that
( )
N
1 d
åf dfY X yk x =0
k=1
YX ( y x ) dx
k
x=x ML

Mohammed S. Elmusrati – University of VAASA 29


Maximum Likelihood Estimator
• Hence, the MLE is in most cases is suboptimal. However, it can be
considered also as optimal solution when no priori information is
available about the parameter to be estimated.
• In this case, the best thing to assume is that, the parameter is
uniformly distributed. In other words, our pre-knowledge uncertainty
is the same to be any value.
• What is your pre-knowledge about throwing a coin to be landed head
or tail without looking to any observations? For sure, the best
assumption is that each one has probability 0.5 (uniform). Then the
MLE estimation is the optimum like the MAP. But if you know in
advance (based on some pre-knowledge) that the probability of head
is for example larger than 0.7, hence, MLE becomes just a suboptimal
estimator. One should use MAP which will give a better estimate.
• These concepts and more will be described through some examples.
Mohammed S. Elmusrati – University of VAASA 30
Example (1)
• Assume we are looking for a process of two outcomes
(Success) or (Fail). It may express too many practical
applications, few examples:
– Heading a target or mis-heading
– Correct or incorrect receiving of a transmitted symbol or
message.
– Positive or negative revenue
• Based on some historical independent observations,
we like to estimate the process parameter (in this case
the probability if success p). Let’s first assume that we
have no pre-knowledge about the process.

Mohammed S. Elmusrati – University of VAASA 31


Example (1)
• Assume that we have N observations, where M of them were
Successes (S) with yk=1 and (N-M) were Fails (F) with yk=0, as:
SSFSFSSFFSFFFSSFSSFFFFSSSFFS …..
• Since all observations are independent, hence,

• Although it is easy to find the MLE estimation of this problem by


taking the differentiation, however, in other more complicated
problems it could be very boring and lengthy. However, covert
the multiplications into summation by taking the logarithm,
makes it much more handy. We may call it the Likelihood
function (not density!) as
( ) ( ) () (
l y;x = p = log é P y x = p ù = Mlog p + N - M log 1- p
ë û ) ( )
Mohammed S. Elmusrati – University of VAASA 32
Example (1)
• Now we find the estimate of p which maximizes
the likelihood function (or density) as:
M N-M
d
dp
(
l y;x = p ) = 0Þ -
p̂ 1- p̂
=0
p= p̂

M N-M M
Þ = Þ Np̂ - Mp̂ = M - Mp̂ Þ p̂ =
p̂ 1- p̂ N
• Hence, the expected result to estimate the
probability is actually the MLE estimation.
Mohammed S. Elmusrati – University of VAASA 33
Example (1)
• Actually we may express what we have done in the
previous slide mathematically as:
1 N
( )
p̂ = å yk ; P yk = 1 = p, and P yk = 0 = 1- p
N k=1
( )
• In some applications, we may call yk as an indicator
function.
• Is this MLE estimator biased or unbiased?
• Is it consistent or not?
• Is it an efficient estimator?

Mohammed S. Elmusrati – University of VAASA 34


Example (1)
• Based on the definition of unbiased estimators, we
should find the expected value of estimated value as:
(
E éë yk ùû = 1´ P yk = 1 +0´ P yk = 0 = p ) ( )
• Therefore, E éë yk ùû = 1´ P ( yk = 1) + 0´ P ( yk = 0) = p
1 éN ù 1 N 1 N Np
E éë p̂ ùû = E ê å yk ú = å E éë yk ùû = å p = =p
N ë k=1 û N k=1 N k=1 N
• Hence, this estimator in unbiased.

Mohammed S. Elmusrati – University of VAASA 35


Example (1)
• Now let’s compute the variance of the estimated
value to see if it is consistent or not! The
derivation is given in step-by-step next:
éæ 1 N ö 2 ù
() (é
) ù
( )
Var p̂ = E ê p̂ - E éë p̂ ùû ú = E é p̂ - p ù = E éë p̂2 ùû - p2 = E êç å yk ÷ ú - p2
2 2

ë û êë úû êè N k=1 ø ú
ë û
éæ 1 N ö 2 ù é 1 æ N ö2ù é 1 N N ù 1 N N
Þ E êç å yk ÷ ú = E ê 2 ç å y k ÷ ú = E ê 2 å å yk yi ú = 2 å å E éë yk yi ùû
êè N k=1 ø ú ê N è k=1 ø ú ë N k=1 i=1 û N k=1 i=1
ë û ë û
ì E é y2 ù , k=i ìï p, k = i
ï ë kû
E éë yk yi ùû = í ( ) ( )
;E éë yk ùû = 1 P yk = 1 + 0 P y k = 0 = p Þ E éë yk yi ùû = í 2
2 2 2

ï E éë yk ùû E éë yi ùû , k ¹ i ïî p , k ¹ i
î
p N -1 2 ( ) p N -1 2 2 p 1- p ( ) ( )
1 N N
N k=1 i=1
é ù
N
1
( (
Þ 2 å å E ë y k y i û = 2 Np + N - N p = +
2
Mohammed
2
S. N
) )
Elmusrati – N
p ÞVar p̂ = +
University of VAASA N N
()
p -p =
36N
Example (1)
• From the previous slide result it is clear that

()
lim Var p̂ = lim
(
p 1- p ) =0
N®¥ N N®¥

• Hence the estimator is also consistent ☺


• Is it possible to have a better unbiased and consistent
estimator than this one? To answer this, we should find
the variance lower bound (CRLB)
• This is left as an exercise! -1
( )
Var x est ³
é ¶2 ù
( )
E ê 2 log f y;x ú
Mohammed S. Elmusrati – University of VAASA ë ¶x û37
Example (2)
• If we have some extra priory knowledge or different
uncertainty level about the parameter x that it could
be represented by the following density function:

• Find the value of α


()
f X x = 4e -a x
, 0 £ x £1

• How this knowledge might affect our optimum


estimator about the probability of x based on the
observations.
• Compare both results.
Mohammed S. Elmusrati – University of VAASA 38
Example (2)
• If we ignore our uncertainty about the parameter
before looking to any measurements or
observations, we will have the MLE estimator.
• However, this might not be the optimum as we
ignored important uncertainty information.
• Looking to the priori probability density function
of the unknown parameter, it looks like shown in
the next slide

Mohammed S. Elmusrati – University of VAASA 39


Example (2)
1 1

ò ( )
0
f X x dx = 1 Þ 4 ò e -a x dx = 1
0

Þ a = 3.9207 Prove it?


()

Looking to the priori distribution, the


fX x

probability of being Success of Fail is not


uniform. Now we have more accurate
impression about the uncertainty. Actually,
we know that the probability for the
Success case is less than 0.3 with a chance
of about 70%. This kind of information
x should have impact to improve our
estimation about the parameter x.

Mohammed S. Elmusrati – University of VAASA 40


Example (2)
• From slide 29
d æ N ö
å
dx çè k=1
log é f( ) y x
êë Y X k úû
ù + log éf x ù
ë X û÷ø () =0
x=x MAP

d
dx
( () ( ) (
M log x + N - M log 1- x + log 4 - a x ) x=x MAP
()
=0 )
N -M
M
-
x MAP 1- x MAP
(
- a = 0 Þ a x MAP 2 - a + N x MAP + M = 0 )
( 3.92+ N ) ± (3.92+ N ) -15.68M
2

Þ x MAP = , 0£ x MAP
£1
7.84
Mohammed S. Elmusrati – University of VAASA 41
Example (2)
• In the previous result, we have two results, and we
should always select the one which is between 0
and 1.
• Let’s assume that in our observation we have
M/N=0.5. In MLE estimation, we will estimate as
p=0.5. But how it will be in MAP estimate with the
availability of a priori density function.
• Next figure shows the MAP estimate for M/N=0.5
for several N values.

Mohammed S. Elmusrati – University of VAASA 42


Example (2)
From this figure we can see
easily the impact of the pre-
knowledge on the estimation
of the parameter.
p̂ = x MAP

If we have only two


M 1
= observations, i.e., N=2, and
N 2 we have M=1, in MLE the best
estimate of the probability
p=0.5. But with MAP, we can
see that the probability is just
0.2. However, if we repeat the
experiment many times, i.e., N
is very large and we have
N M/N=0.5, then we approach
the believe that p=0.5.

Mohammed S. Elmusrati – University of VAASA 43


Exercise
• In the previous estimation example, if as a priori
information we know that the probability is
uniformly distributed from 0.4 to 0.8.
• Find the MAP estimation in this case.

Mohammed S. Elmusrati – University of VAASA 44


Example (3)
• Assume that we are interested to estimate the actual value
of a constant x. However, the observation (or measurement)
is always corrupted by a zero mean Gaussian noise with
known variance 𝝈2 .
• The mathematical model of this problem is
yi = x + ni
• Where yi=y1,y2, … ,yN are N different available measurements
and ni=n1,n2, … ,nN are independent identical distributed
zero mean Gaussian (Normal) distribution.

Mohammed S. Elmusrati – University of VAASA 45


Example (3)
• This example presented in the previous slide is very
important as it gives foundation for many concepts in
Estimation theory.
• Since ni samples are zero mean Normal distributed random
process, hence, it is clear that the measurements yi are also
Normal distributed process but with mean equals the
constant x and with same variance as ni. Therefore,

( ) 1 ( )
2
- yi -x 2s 2
f y x yi x = e
2ps
Mohammed S. Elmusrati – University of VAASA 46
Example (3)
• In this example, we assume that we have no priori
knowledge about the parameter x that we are looking to
estimate. Therefore, the optimum estimator is the
maximum likelihood estimation.
• Assume we have N measurements, y1, y2, .., yN. Therefore,
the likelihood density becomes (due to the independence
assumptions of ni)

Mohammed S. Elmusrati – University of VAASA 47


Example (3)
• Again we compute the likelihood function as
N

å( y - x )
2

( ) ( ) N
( ) ( )
i
l y;x = log éê f y x ù
y x ú = - log 2p - N log s - i=1
ë û 2 2s 2
N N

å( y - x ) åy
d
( )
i i
Þ l y;x = i=1
= 0 Þ x ML = i=1
dx x=x ML
s2 N

• Hence, the conventional mean computation is the


MLE estimation for the actual mean value.

Mohammed S. Elmusrati – University of VAASA 48


Example (3)
• It is quite easy to prove that the previous MLE of
the mean is unbiased and consistent estimator as:
é N ù N N

ê å y i ú å E éë yi ùû å x Nx
E éë x ML ùû = E ê i=1 ú = i=1 = i=1 = = x Þ unbiased
ê N ú N N N
ê ú
ë û
éæ N ö ù
2

ê åy ú é ù
ç ÷ æ ö
2

( )
N

( ) é ù ê ú 1 ê
2 i
Var x ML = E ê x ML - E éë x ML ùû ú = E êç i=1
- x ÷ ú = 2 E ç å y i - Nx ÷ ú
ë û
êçç
N ÷ ú N êè i=1 ø ú
÷ø ë û
êë è úû
1 éN 2 N N N
ù
= 2 E ê å yi + å å yi y k - 2Nx å yi + N 2 x 2 ú
N ë i=1 i=1 k¹i i=1 û
Ns 2 s 2
1 é
N ë
2
(2
) (
2 2 2
)
2 2ù
= 2 N s + x + N N -1 x - 2N x + N x = 2 =
û N N
Þ Consistent
Mohammed S. Elmusrati – University of VAASA 49
Example (3)
• Let’s check if the estimator is efficient or not:
-1
( )
Var x est ³
é ¶2 ù
E ê 2 log f y;x ú ( )
ë ¶x û
It should be quit easy for
( ) ( ) ()
fY,X y ; x = fY X y x f X x Þ log éë fY,X ( ) ë ( )
y ; x ùû = log éê fY X y x ()
ù + log é f x ù
úû ë X û you to prove the
N regularity condition:
å( y - x )
2

N
( ) ( ) ( ( ))
i
= - log 2p - N log s - i=1
+ a constant representslog éë f X x ùû
2 2s 2

(
å y i - x ¶2 ) é ¶ ù

( )
Þ log éë fY,X y ; x ùû = i=1 2
N
Þ 2 log éë fY,X y ; x ùû = - 2 ( ) ( )
E ê log f y;x ú = 0
¶x s ¶x s ë ¶x û
s2
( )
ÞVar x est ³
N
, Hence, x ML is an efficient estimator

Mohammed S. Elmusrati – University of VAASA 50


Example (4)
• In the same example, assume that we are
interested also to estimate the noise variance.
N

å( y - x )
2

( ) ( ) N
( ) ( )
i
l y;x = log éê f y x ù
y x ú = - log 2p - N log s - i=1
ë û 2 2s 2
N N

å( y - x ) å( y - x )
2 2

d
( ) N i i ML
Þ l y;x =- + i=1
= 0Þ s ML
2
= i=1
ds s =s ML
s s3 N

Mohammed S. Elmusrati – University of VAASA 51


Example (4)
• Is the previous estimate of the variance unbiased?

∵ E éë x ML
2
ù=
1
E
é æ N
ê åy ú=
ö
2
( ) ( = =
)
ù N s 2 + x 2 + N N -1 x 2 s 2 + Nx 2 s 2
+ x 2
û N 2 êçè i÷
ø ú N2 N N
ë i=1
û
é N

N
E éës ML
2
û N
1
( ) 1
ù = E ê å yi - x ML ú = å E é yi2 - 2 yi x ML + x ML
N ë
2
ù
û
ë i=1 û i=1

Né 2 2 s ù N +1 2
( )
2
= ê s + x - 2x +
2
+x ú=
2
s Þ Biased
Në N û N
N

å( y - x )
2
• To have unbiased estimator: i ML
s ML
2
= i=1
Is it a big problem? No, especially for large N! N -1
Mohammed S. Elmusrati – University of VAASA 52
Example (5)
• In the same previous example, let’s assume that we have a
priori knowledge about the parameter to be estimated.
• For example, assume x itself has Normal distribution with
known mean 𝜇x and variance 𝝈x.
• It might be the same problem that x is not fixed but
changing in a random manner, however, we may assume
that it is fixed during the measurement period. One
example is tracking a moving object in unpredictable way.
Hence, we will collect data to estimate its updated location
• Since in this problem we have some extra knowledge even
with high uncertainty, we should use MAP instead of MLE.

Mohammed S. Elmusrati – University of VAASA 53


Example (5)
• Using the MAP formulation given before in slide 29:

Mohammed S. Elmusrati – University of VAASA 54


Example (5)
• The result of the previous example is rather interesting.
• If 𝝈x is very small and close to zero, this means that the
uncertainty of x is very small and it should be very close to 𝜇x.
Look to the MAP estimation with β is close to zero. You can
see that xMAP –> 𝜇x regardless the number of samples N and
the values of yi.
• When 𝝈x is not small, but N is very large, then our estimate
will be closer to the MLE (summation of measurements
divided by N.
• Actually the MAP estimation is an optimized compromise
between the information gained from the measurements and
the priori information covered by fX(x).

Mohammed S. Elmusrati – University of VAASA 55


Example (6)
• Assume that we are interested to estimate a slowly
changing unknown random process. However, we
know that it follows normal distribution N(1,8).
However, our measurements are corrupted with zero
mean random noise as N(0,1).
• Write a simulation code to assist both MAP and MLE
estimation methods for N=1 to 100. Compute the
average error for 20 random values of the unknown
parameter.
• Plot the results.
Mohammed S. Elmusrati – University of VAASA 56
Example (7)
• A random variable x is to be estimated on the basis of a priori
information and the ith noisy measurement is expressed as
yi=x+ni, where ni is the ith noise sample
• Moreover, x and ni are assumed to be independent. The
distributions functions of x and ni are shown next.
• Find the optimum estimate of x?

fX ( x ) ()
fn n

2
x -
1 1 n
2 2

Mohammed Salem Elmusrati 57


Example (7)
• As we have done before, let’s first construct fX|yi
fy ( y x) f (x)
f X y x yi =
i
( ) i
X i

( )
f y yi
X

• Since fy is not function in the parameter x, then we may ignore


it, also fX is fixed from 0 to 2, then it is useful only in the
determination of the range of the admissible values of x.
• Therefore, as for the MLE estimation, we may find the
optimum x by looking to one of moments of fyi|X
• It is clear that ì 1 1
fy
i
X ( ) ( ï 1 - £ yi - x £
yi x = fn yi - x = í
ï 0
2 ) 2
î eleswhere
Mohammed Salem Elmusrati 58
Example (7)
• From the previous equation, we have
ì

)
1 1

fn yi - x
fy ( ) ( ) ï 1 - + yi £ x £ + yi
yi x = fn yi - x = í 2 2

(
X
ï 0
i

î eleswhere 1 y 1
- + yi i
2
+y
2 i
x
• Therefore, the parameter x could be determined based on the
measurement. For example, if we have a single measurement,
say yi=1, hence, the conditional probability will be uniform
from 0.5 to 1.5. It is clear that this uniform does not have a
single mode (maximum) value. Hence, we may consider the
mean or the median which are equal, and x̂ = 1 , i.e., x̂ = yi.
However, we should keep in mind that 0≤x≤2 as well.

Mohammed S. Elmusrati – University of VAASA 59


Example (7)
• For example, what will be our estimate if yi=2.2? Here, we
know from our priori information that the maximum of x is
2. Hence it cannot be 2.2? Therefore, we should truncate
the maximum to 2, and the minimum is 2.2-0.5=1.7.
Therefore, the average is (1.7+2)/2=1.85.
• We may construct the optimum estimate of the parameter
based on the measurements as: ì
ï y i , 0.5 £ y i £ 1.5
ï
ï y i + 0.5
x̂ = í , yi £ 0.5
ï 2
ï yi +1.5
ï , yi > 1.5
î
Mohammed S. Elmusrati – University of VAASA 2 60
Example (8)
• In Example (3), if the noise samples ni are
correlated. How this dependence will affect the
estimation of the parameter x?
• The conditional probability of the measurements
based on the estimated parameter is:

Mohammed S. Elmusrati – University of VAASA 61


Example (8)

Mohammed S. Elmusrati – University of VAASA 62


Example (9)
• In example (8) find analytically the ML estimation
for two measurements and the following three
cases:
é s2 0 ù é s2 0 ù é s2 a ù
R nn = ê ú R =ê 1 ú R =ê 1 ú
ê 0 s2 ú nn ê 0 s 2 ú nn ê a s 2 ú
ë û ë 2 û ë 2 û

• Compare and comments the results

Mohammed S. Elmusrati – University of VAASA 63


Example (9)

Mohammed S. Elmusrati – University of VAASA 64


Example (9)

Mohammed S. Elmusrati – University of VAASA 65


Example (9)

Mohammed S. Elmusrati – University of VAASA 66


Example 10
• Assume a system consists of complex interconnected
subsystems. Those subsystems may have faults independently
with exponential distributed time and parameter λ. The
system is robust so that it will have general fault only of k of
subsystems have failed to operate.
• Therefore, the time until the system has general failure is give
k
by
y=å x i
i=1
Where xi is exponential distributed random variable with
parameter λ. Hence, the distribution of y is Gamma with the
following probability density function y k-1l k - l y
( )
fY y =
()
G k
e
Mohammed S. Elmusrati – University of VAASA 67
Example 10
• We have database history about a certain system with several
time of failures record as y=[y1,y2,…,yN].
• Assume we do not know k nor λ. Based on the observations y,
find the MLE estimation of k and λ.
• Solution:
• Since we assume that all records are independent, hence

• The log-likelihood function is given by

Mohammed S. Elmusrati – University of VAASA 68


Example 10
• Now we can find the optimum parameters that maximizes the
log-likelihood function as

• Substituting the last result in the above equation we obtain


æ ö æ N ö
( ) å
( ) (( )) ()
ç Nk ÷ G¢ k G ç yi ÷
N ¢ k 1 N dG k
log ç N ÷ -
ç
ML

( )
ML

÷ G kML
1
N i=1
( )
+ å log yi = 0Þ log kML -
G kML
ML
= log ç i=1

ç N ÷ N i=1
( ) ( )
÷ - å log yi , where G¢ kML =
dk
çè å y i÷ ç ÷ø k=kML
i=1 ø Mohammed S. Elmusrati – Universityèof VAASA 69
Example 10
• It is clear from the previous result that estimating
the parameters of the gamma distribution, we
need to solve non-linear function. There are many
efficient numerical methods could be used to
solve the previous equation.
• If we have the following database of y=[2, 3, 7, 9,
3, 5], estimate the parameters k and λ.

Mohammed S. Elmusrati – University of VAASA 70


Exercise
• You have database history about a certain system
as y=[y1,y2,…,yN]. We believe that yi represents
Chi-Square random variables (slide 116 in Part 1).
Find the MLE estimation of its number of freedom
and variance.

Mohammed S. Elmusrati – University of VAASA 71


THANK YOU

Mohammed S. Elmusrati – University of VAASA 72

You might also like