Got it — let’s go deep into the Speech Recognition Architecture so you have a detailed,
structured view from raw speech input to final text output, especially in the context of
HMM-based systems.
Speech Recognition Architecture — Detailed View
A typical Automatic Speech Recognition (ASR) system consists of five major modules
working together:
1. Acoustic Front-End (Feature Extraction)
This is the signal processing stage, which converts the raw speech waveform into a sequence
of feature vectors that capture relevant information for recognition.
Steps:
1. Pre-Emphasis
o Purpose: Amplify high-frequency components, which are weaker in speech
signals.
o Equation:
y[n]=x[n]−αx[n−1],α≈0.95y[n] = x[n] - \alpha x[n-1], \quad \alpha \approx
0.95
2. Framing
o Speech is non-stationary, but over short durations (~20–25 ms) it can be
treated as stationary.
o The waveform is divided into overlapping frames (e.g., 25 ms frame length, 10
ms shift).
3. Windowing
o Apply a Hamming window to reduce spectral discontinuities at frame edges.
4. FFT (Fast Fourier Transform)
o Converts the time-domain signal into frequency domain.
5. Mel Filterbank
o Passes the spectrum through a bank of filters spaced according to the Mel
scale to mimic human auditory perception.
6. Log Compression
o Converts energy values to a logarithmic scale.
7. DCT (Discrete Cosine Transform)
o Produces MFCCs (Mel Frequency Cepstral Coefficients), which are compact
representations of speech.
8. Feature Normalization
o CMVN (Cepstral Mean and Variance Normalization) to handle channel and
environmental variations.
2. Acoustic Model
The acoustic model maps the extracted features to basic sound units (phonemes or sub-
phonetic states).
Why HMMs?
o Speech is a time-varying sequence; HMMs model both temporal progression
and statistical variation.
o Each HMM state corresponds to a segment of a phoneme.
Emission Probability Models:
o Traditionally GMMs (Gaussian Mixture Models).
o Modern systems use DNNs (Deep Neural Networks), CNNs, or RNNs to
output state likelihoods.
Training:
o Uses Baum-Welch algorithm (a form of EM algorithm) to estimate model
parameters.
3. Pronunciation Lexicon (Dictionary)
This is the bridge between phonemes and words.
Contains:
o Word → Phoneme mappings.
o Example:
o HELLO → HH AH0 L OW1
o CAT → K AE1 T
o Stress and tone markers for some languages.
Importance:
o Ensures the system knows how a word is pronounced.
o Can handle multiple pronunciations of the same word.
4. Language Model
The language model predicts the most likely sequence of words.
Purpose: Resolve ambiguities in acoustic decoding.
Example:
o Acoustic model might confuse "recognize speech" with "wreck a nice beach".
o The language model makes "recognize speech" more probable.
Types:
o N-gram models (unigram, bigram, trigram).
o Neural LMs (RNN, Transformer-based).
Training Data:
o Large text corpora (news, transcriptions, books).
5. Decoder
The decoder combines:
Acoustic model scores.
Pronunciation lexicon.
Language model probabilities.
Goal: Find the most probable word sequence W∗W^* given observed features OO:
W∗=argmaxWP(W∣O)=argmaxWP(O∣W)⋅P(W)W^* = \arg\max_W P(W|O) =
\arg\max_W P(O|W) \cdot P(W)
P(O∣W)P(O|W): Acoustic model likelihood.
P(W)P(W): Language model probability.
Methods:
Viterbi Search: Finds the best path through the HMM state network.
Beam Search: Prunes unlikely paths to improve efficiency.
WFST Decoding: Uses Weighted Finite-State Transducers to unify AM, LM, and
lexicon into a single search graph.
6. Post-Processing
Confidence Scoring: Assigns reliability scores to recognized words.
Punctuation & Capitalization: Restores formatting.
Error Correction: Uses additional grammar rules or statistical post-filters.
End-to-End View Diagram (Text Form)
Speech Input → Acoustic Front-End → Acoustic Model (HMM/DNN)
→ Pronunciation Lexicon → Language Model
→ Decoder → Recognized Text Output
If you want, I can make a full graphical diagram of this architecture so that each block and
data flow is visually clear. That would make it easier to remember and present.