Part 2 Estimation
Part 2 Estimation
Process
PART 2
• Where
Mohammed S. Elmusrati – University of VAASA 2
Estimation Problems
• For the previous equation we have at least the
following three interesting cases in estimation:
1. We know (at least partially) x and the observation y, and
we are looking for the best set of functions h and their
parameters θ. We call this Regression problem (or curve
fitting) and it is one of the tools in many applications like
Machine learning. We may know the general form of the
mapping h, for example the increasing exponential
functions in population models, but we need to know (or
estimate) the parameters of the exponential function.
Sometimes, we do not know even the mapping, and we
need to test/modify some general shapes to fit the
available x and y data. Or using black box modelling.
( ) ()
as é -1
ù
Var éë x est ùû ³ I x
i êë úûii
• This means that the variance of the ith parameter estimate
is lower bounded by the diagonal of the inverse Fisher
information matrix under the regularity condition.
Mohammed S. Elmusrati – University of VAASA 11
Cramer-Rao Lower Bound
• We can reduce the general form given for the Cramer-Rao Lower Bound
(CRLB) in the previous slide for a single estimated parameter as:
-1
( )
Var x est ³
é ¶2 ù
E ê 2 log f y;x ú ( )
ë ¶x û
• The regularity condition is
é ¶ ù
E ê log f y;x ú = 0( )
ë ¶x û
• The proof is not difficult and could be found in many books in estimation
theory. However, it is not given here. Nevertheless, we will use the
bound to prove the efficiency of certain estimators.
Mohammed S. Elmusrati – University of VAASA 12
Least Mean Square Error Estimate
• Let’s derive our first estimation algorithm. We
assume a single scalar parameter, i.e., we have N
observations based on yi = h( x ) + ni , i=1,..,N
• We don’t know the function h(.) and ni is additive
random measurement noise.
• One possible criteria is to find the best estimate
xest that minimizes the mean square error as:
é
( ( )) ù
2
minE ê x - x est y úû
ë
Mohammed S. Elmusrati – University of VAASA 13
Least Mean Square Error Estimate
• We may formulate the expected value as
¥ ¥
é
( ( )) ù
( ( )) f ( x,y )dxdy
2 2
E ê x - x est y
ë úû = ò ò x - x est y XY
-¥ -¥
( ( )) ( ) ( ( )) ( )
2
ê ò x - xest y f X Y x y dx ú = 0Þ -2 ò x - xest y f X Y x y dx = 0
dxest êë -¥ úû -¥
¥ ¥ ¥ ¥
ò xf
-¥
XY ( x y )dx - ò x ( y )f ( x y)dx = ò xf ( x y )dx - x ( y ) ò f ( x y )dx = 0
-¥
est XY
-¥
XY est
-¥
XY
¥
• Therefore, ( ) ò xf ( x y )dx = E éë x y ùû
x est y = XY
-¥
¥
é¥ ù
d
dx est ë
()
E é x - x est y ù = 0 Þ
û
d
dx est ò-¥ Y êê -¥ò
f y () x - x est
y f () ( )
XY
x y dx údy = 0
úû
ë
x (y)
d é est ù
¥ ¥
Þ
d
dx est ò x - x (y) f
est XY ( )
x y dx = 0 =
dx est ê -¥
( ( )) ( )
XY (
XY ( )) ( )
ê - ò x - x est y f x y dx + ò x - x est y f x y dx ú = 0
ú
ë est ( ) û
-¥ x y
ò ( )
f X Y x y dx - ò ( )
f X Y x y dx = 0Þ ò ( )
f X Y x y dx = ò ( )
f X Y x y dx
-¥ ()
x est y -¥ ()
x est y
ímax ò x - x ( y ) f ( x y )dx ý
est
dx ïî
est -¥
XY
ïþ
• Roughly speaking the maximum of x is achieved at
the maximum of f(x|y). In other words, the best
estimate in this case is the mode of f(x|y).
( )
fX Y x y
x̂MVx̂MM x̂MAP x
It is interesting to know that if the conditional probability is symmetric like the
Normal distribution, then all these three estimators are identical
( )
fX Y x y
()
fY y
X
( ) ()
N
åf
1 d
dfY X yk x +
1 d
fX x = 0 ()
k=1
YX ( )
yk x dx f X x dx
M N-M M
Þ = Þ Np̂ - Mp̂ = M - Mp̂ Þ p̂ =
p̂ 1- p̂ N
• Hence, the expected result to estimate the
probability is actually the MLE estimation.
Mohammed S. Elmusrati – University of VAASA 33
Example (1)
• Actually we may express what we have done in the
previous slide mathematically as:
1 N
( )
p̂ = å yk ; P yk = 1 = p, and P yk = 0 = 1- p
N k=1
( )
• In some applications, we may call yk as an indicator
function.
• Is this MLE estimator biased or unbiased?
• Is it consistent or not?
• Is it an efficient estimator?
ë û êë úû êè N k=1 ø ú
ë û
éæ 1 N ö 2 ù é 1 æ N ö2ù é 1 N N ù 1 N N
Þ E êç å yk ÷ ú = E ê 2 ç å y k ÷ ú = E ê 2 å å yk yi ú = 2 å å E éë yk yi ùû
êè N k=1 ø ú ê N è k=1 ø ú ë N k=1 i=1 û N k=1 i=1
ë û ë û
ì E é y2 ù , k=i ìï p, k = i
ï ë kû
E éë yk yi ùû = í ( ) ( )
;E éë yk ùû = 1 P yk = 1 + 0 P y k = 0 = p Þ E éë yk yi ùû = í 2
2 2 2
ï E éë yk ùû E éë yi ùû , k ¹ i ïî p , k ¹ i
î
p N -1 2 ( ) p N -1 2 2 p 1- p ( ) ( )
1 N N
N k=1 i=1
é ù
N
1
( (
Þ 2 å å E ë y k y i û = 2 Np + N - N p = +
2
Mohammed
2
S. N
) )
Elmusrati – N
p ÞVar p̂ = +
University of VAASA N N
()
p -p =
36N
Example (1)
• From the previous slide result it is clear that
()
lim Var p̂ = lim
(
p 1- p ) =0
N®¥ N N®¥
ò ( )
0
f X x dx = 1 Þ 4 ò e -a x dx = 1
0
d
dx
( () ( ) (
M log x + N - M log 1- x + log 4 - a x ) x=x MAP
()
=0 )
N -M
M
-
x MAP 1- x MAP
(
- a = 0 Þ a x MAP 2 - a + N x MAP + M = 0 )
( 3.92+ N ) ± (3.92+ N ) -15.68M
2
Þ x MAP = , 0£ x MAP
£1
7.84
Mohammed S. Elmusrati – University of VAASA 41
Example (2)
• In the previous result, we have two results, and we
should always select the one which is between 0
and 1.
• Let’s assume that in our observation we have
M/N=0.5. In MLE estimation, we will estimate as
p=0.5. But how it will be in MAP estimate with the
availability of a priori density function.
• Next figure shows the MAP estimate for M/N=0.5
for several N values.
( ) 1 ( )
2
- yi -x 2s 2
f y x yi x = e
2ps
Mohammed S. Elmusrati – University of VAASA 46
Example (3)
• In this example, we assume that we have no priori
knowledge about the parameter x that we are looking to
estimate. Therefore, the optimum estimator is the
maximum likelihood estimation.
• Assume we have N measurements, y1, y2, .., yN. Therefore,
the likelihood density becomes (due to the independence
assumptions of ni)
å( y - x )
2
( ) ( ) N
( ) ( )
i
l y;x = log éê f y x ù
y x ú = - log 2p - N log s - i=1
ë û 2 2s 2
N N
å( y - x ) åy
d
( )
i i
Þ l y;x = i=1
= 0 Þ x ML = i=1
dx x=x ML
s2 N
ê å y i ú å E éë yi ùû å x Nx
E éë x ML ùû = E ê i=1 ú = i=1 = i=1 = = x Þ unbiased
ê N ú N N N
ê ú
ë û
éæ N ö ù
2
ê åy ú é ù
ç ÷ æ ö
2
( )
N
( ) é ù ê ú 1 ê
2 i
Var x ML = E ê x ML - E éë x ML ùû ú = E êç i=1
- x ÷ ú = 2 E ç å y i - Nx ÷ ú
ë û
êçç
N ÷ ú N êè i=1 ø ú
÷ø ë û
êë è úû
1 éN 2 N N N
ù
= 2 E ê å yi + å å yi y k - 2Nx å yi + N 2 x 2 ú
N ë i=1 i=1 k¹i i=1 û
Ns 2 s 2
1 é
N ë
2
(2
) (
2 2 2
)
2 2ù
= 2 N s + x + N N -1 x - 2N x + N x = 2 =
û N N
Þ Consistent
Mohammed S. Elmusrati – University of VAASA 49
Example (3)
• Let’s check if the estimator is efficient or not:
-1
( )
Var x est ³
é ¶2 ù
E ê 2 log f y;x ú ( )
ë ¶x û
It should be quit easy for
( ) ( ) ()
fY,X y ; x = fY X y x f X x Þ log éë fY,X ( ) ë ( )
y ; x ùû = log éê fY X y x ()
ù + log é f x ù
úû ë X û you to prove the
N regularity condition:
å( y - x )
2
N
( ) ( ) ( ( ))
i
= - log 2p - N log s - i=1
+ a constant representslog éë f X x ùû
2 2s 2
(
å y i - x ¶2 ) é ¶ ù
¶
( )
Þ log éë fY,X y ; x ùû = i=1 2
N
Þ 2 log éë fY,X y ; x ùû = - 2 ( ) ( )
E ê log f y;x ú = 0
¶x s ¶x s ë ¶x û
s2
( )
ÞVar x est ³
N
, Hence, x ML is an efficient estimator
å( y - x )
2
( ) ( ) N
( ) ( )
i
l y;x = log éê f y x ù
y x ú = - log 2p - N log s - i=1
ë û 2 2s 2
N N
å( y - x ) å( y - x )
2 2
d
( ) N i i ML
Þ l y;x =- + i=1
= 0Þ s ML
2
= i=1
ds s =s ML
s s3 N
∵ E éë x ML
2
ù=
1
E
é æ N
ê åy ú=
ö
2
( ) ( = =
)
ù N s 2 + x 2 + N N -1 x 2 s 2 + Nx 2 s 2
+ x 2
û N 2 êçè i÷
ø ú N2 N N
ë i=1
û
é N
2ù
N
E éës ML
2
û N
1
( ) 1
ù = E ê å yi - x ML ú = å E é yi2 - 2 yi x ML + x ML
N ë
2
ù
û
ë i=1 û i=1
Né 2 2 s ù N +1 2
( )
2
= ê s + x - 2x +
2
+x ú=
2
s Þ Biased
Në N û N
N
å( y - x )
2
• To have unbiased estimator: i ML
s ML
2
= i=1
Is it a big problem? No, especially for large N! N -1
Mohammed S. Elmusrati – University of VAASA 52
Example (5)
• In the same previous example, let’s assume that we have a
priori knowledge about the parameter to be estimated.
• For example, assume x itself has Normal distribution with
known mean 𝜇x and variance 𝝈x.
• It might be the same problem that x is not fixed but
changing in a random manner, however, we may assume
that it is fixed during the measurement period. One
example is tracking a moving object in unpredictable way.
Hence, we will collect data to estimate its updated location
• Since in this problem we have some extra knowledge even
with high uncertainty, we should use MAP instead of MLE.
fX ( x ) ()
fn n
2
x -
1 1 n
2 2
( )
f y yi
X
)
1 1
fn yi - x
fy ( ) ( ) ï 1 - + yi £ x £ + yi
yi x = fn yi - x = í 2 2
(
X
ï 0
i
î eleswhere 1 y 1
- + yi i
2
+y
2 i
x
• Therefore, the parameter x could be determined based on the
measurement. For example, if we have a single measurement,
say yi=1, hence, the conditional probability will be uniform
from 0.5 to 1.5. It is clear that this uniform does not have a
single mode (maximum) value. Hence, we may consider the
mean or the median which are equal, and x̂ = 1 , i.e., x̂ = yi.
However, we should keep in mind that 0≤x≤2 as well.
( )
ML
÷ G kML
1
N i=1
( )
+ å log yi = 0Þ log kML -
G kML
ML
= log ç i=1
ç N ÷ N i=1
( ) ( )
÷ - å log yi , where G¢ kML =
dk
çè å y i÷ ç ÷ø k=kML
i=1 ø Mohammed S. Elmusrati – Universityèof VAASA 69
Example 10
• It is clear from the previous result that estimating
the parameters of the gamma distribution, we
need to solve non-linear function. There are many
efficient numerical methods could be used to
solve the previous equation.
• If we have the following database of y=[2, 3, 7, 9,
3, 5], estimate the parameters k and λ.