Maximum Likelihood Estimation for Hidden Markov Models (HMMs)
1. Overview of HMMs
Definition: HMMs are probabilistic models for sequential data with hidden states.
o States: Hidden variables {q1,q2,...,qT}{q1,q2,...,qT}, where each qt∈{1,...,N}qt∈{1,...,N}.
o Observations: Observed variables {o1,o2,...,oT}{o1,o2,...,oT}, where each ot∈{1,...,M}ot
∈{1,...,M}.
o Parameters:
Transition matrix AA: aij=P(qt=j∣qt−1=i)aij=P(qt=j∣qt−1=i).
Emission matrix BB: bj(k)=P(ot=k∣qt=j)bj(k)=P(ot=k∣qt=j).
Initial state distribution ππ: πi=P(q1=i)πi=P(q1=i).
2. Maximum Likelihood Estimation (MLE) for HMMs
Objective: Estimate parameters θ=(A,B,π)θ=(A,B,π) that maximize the likelihood of observed
data O={o1,...,oT}O={o1,...,oT}.
Likelihood Function:
L(θ)=P(O∣θ)=∑QP(O,Q∣θ)L(θ)=P(O∣θ)=Q∑P(O,Q∣θ)
where Q={q1,...,qT}Q={q1,...,qT} is a hidden state sequence.
Challenge: Direct summation over all possible QQ is intractable (complexity O(NT)O(NT)).
3. Expectation-Maximization (EM) Algorithm for HMMs
The Baum-Welch algorithm (a special case of EM) iteratively updates θθ to maximize L(θ)L(θ).
Key Steps:
1. Forward-Backward Procedure: Compute:
o Forward probability αt(i)αt(i): P(o1,...,ot,qt=i∣θ)P(o1,...,ot,qt=i∣θ).
o Backward probability βt(i)βt(i): P(ot+1,...,oT∣qt=i,θ)P(ot+1,...,oT∣qt=i,θ).
o Posterior probabilities:
γt(i)=P(qt=i∣O,θ)∝αt(i)βt(i)γt(i)=P(qt=i∣O,θ)∝αt(i)βt(i).
ξt(i,j)=P(qt=i,qt+1=j∣O,θ)∝αt(i)aijbj(ot+1)βt+1(j)ξt(i,j)=P(qt=i,qt+1=j∣O,θ)∝αt(i)aijbj(ot+1
)βt+1(j).
2. E-Step: Compute γt(i)γt(i) and ξt(i,j)ξt(i,j) using current θθ.
3. M-Step: Update parameters:
o πinew=γ1(i)πinew=γ1(i).
o aijnew=∑t=1T−1ξt(i,j)∑t=1T−1γt(i)aijnew=∑t=1T−1γt(i)∑t=1T−1ξt(i,j).
o bj(k)new=∑t=1Tγt(j)⋅I(ot=k)∑t=1Tγt(j)bj(k)new=∑t=1Tγt(j)∑t=1Tγt(j)⋅I(ot=k).
4. Example: Weather Prediction HMM
Scenario:
Hidden states: {Sunny, Rainy}.
Observations: {Walk, Shop, Clean}.
Given Data: O={Walk,Shop,Clean}O={Walk,Shop,Clean}.
Goal: Estimate A,B,πA,B,π using Baum-Welch.
Steps:
1. Initialize parameters θ(0)θ(0).
2. Iterate until convergence:
o Compute γt(i)γt(i), ξt(i,j)ξt(i,j).
o Update π,A,Bπ,A,B.
5. Practical Considerations
Initialization: Use domain knowledge or random initialization.
Local Maxima: Baum-Welch converges to local optima; multiple restarts may be needed.
Scaling: Use log probabilities or scaling factors to avoid underflow.
6. Extensions & Limitations
Regularization: Add priors (e.g., Dirichlet) for small datasets.
Variants: Bayesian HMMs (using MCMC or variational inference).