0% found this document useful (0 votes)
8 views3 pages

Speech Recognition Architecture - Detailed View: 1. Acoustic Front-End (Feature Extraction)

The document outlines the architecture of an Automatic Speech Recognition (ASR) system, detailing its five major modules: Acoustic Front-End, Acoustic Model, Pronunciation Lexicon, Language Model, and Decoder. Each module plays a crucial role in transforming raw speech input into recognized text, utilizing techniques such as feature extraction, HMMs, and language modeling. The process culminates in post-processing steps to enhance the accuracy and formatting of the output text.

Uploaded by

guptashweta30113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Speech Recognition Architecture - Detailed View: 1. Acoustic Front-End (Feature Extraction)

The document outlines the architecture of an Automatic Speech Recognition (ASR) system, detailing its five major modules: Acoustic Front-End, Acoustic Model, Pronunciation Lexicon, Language Model, and Decoder. Each module plays a crucial role in transforming raw speech input into recognized text, utilizing techniques such as feature extraction, HMMs, and language modeling. The process culminates in post-processing steps to enhance the accuracy and formatting of the output text.

Uploaded by

guptashweta30113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Got it — let’s go deep into the Speech Recognition Architecture so you have a detailed,

structured view from raw speech input to final text output, especially in the context of
HMM-based systems.

Speech Recognition Architecture — Detailed View


A typical Automatic Speech Recognition (ASR) system consists of five major modules
working together:

1. Acoustic Front-End (Feature Extraction)

This is the signal processing stage, which converts the raw speech waveform into a sequence
of feature vectors that capture relevant information for recognition.

Steps:

1. Pre-Emphasis
o Purpose: Amplify high-frequency components, which are weaker in speech
signals.
o Equation:

y[n]=x[n]−αx[n−1],α≈0.95y[n] = x[n] - \alpha x[n-1], \quad \alpha \approx


0.95

2. Framing
o Speech is non-stationary, but over short durations (~20–25 ms) it can be
treated as stationary.
o The waveform is divided into overlapping frames (e.g., 25 ms frame length, 10
ms shift).
3. Windowing
o Apply a Hamming window to reduce spectral discontinuities at frame edges.
4. FFT (Fast Fourier Transform)
o Converts the time-domain signal into frequency domain.
5. Mel Filterbank
o Passes the spectrum through a bank of filters spaced according to the Mel
scale to mimic human auditory perception.
6. Log Compression
o Converts energy values to a logarithmic scale.
7. DCT (Discrete Cosine Transform)
o Produces MFCCs (Mel Frequency Cepstral Coefficients), which are compact
representations of speech.
8. Feature Normalization
o CMVN (Cepstral Mean and Variance Normalization) to handle channel and
environmental variations.
2. Acoustic Model

The acoustic model maps the extracted features to basic sound units (phonemes or sub-
phonetic states).

 Why HMMs?
o Speech is a time-varying sequence; HMMs model both temporal progression
and statistical variation.
o Each HMM state corresponds to a segment of a phoneme.
 Emission Probability Models:
o Traditionally GMMs (Gaussian Mixture Models).
o Modern systems use DNNs (Deep Neural Networks), CNNs, or RNNs to
output state likelihoods.
 Training:
o Uses Baum-Welch algorithm (a form of EM algorithm) to estimate model
parameters.

3. Pronunciation Lexicon (Dictionary)

This is the bridge between phonemes and words.

 Contains:
o Word → Phoneme mappings.
o Example:
o HELLO → HH AH0 L OW1
o CAT → K AE1 T
o Stress and tone markers for some languages.
 Importance:
o Ensures the system knows how a word is pronounced.
o Can handle multiple pronunciations of the same word.

4. Language Model

The language model predicts the most likely sequence of words.

 Purpose: Resolve ambiguities in acoustic decoding.


Example:
o Acoustic model might confuse "recognize speech" with "wreck a nice beach".
o The language model makes "recognize speech" more probable.
 Types:
o N-gram models (unigram, bigram, trigram).
o Neural LMs (RNN, Transformer-based).
 Training Data:
o Large text corpora (news, transcriptions, books).
5. Decoder

The decoder combines:

 Acoustic model scores.


 Pronunciation lexicon.
 Language model probabilities.

Goal: Find the most probable word sequence W∗W^* given observed features OO:

W∗=arg⁡max⁡WP(W∣O)=arg⁡max⁡WP(O∣W)⋅P(W)W^* = \arg\max_W P(W|O) =


\arg\max_W P(O|W) \cdot P(W)

 P(O∣W)P(O|W): Acoustic model likelihood.


 P(W)P(W): Language model probability.

Methods:

 Viterbi Search: Finds the best path through the HMM state network.
 Beam Search: Prunes unlikely paths to improve efficiency.
 WFST Decoding: Uses Weighted Finite-State Transducers to unify AM, LM, and
lexicon into a single search graph.

6. Post-Processing

 Confidence Scoring: Assigns reliability scores to recognized words.


 Punctuation & Capitalization: Restores formatting.
 Error Correction: Uses additional grammar rules or statistical post-filters.

End-to-End View Diagram (Text Form)


Speech Input → Acoustic Front-End → Acoustic Model (HMM/DNN)
→ Pronunciation Lexicon → Language Model
→ Decoder → Recognized Text Output

If you want, I can make a full graphical diagram of this architecture so that each block and
data flow is visually clear. That would make it easier to remember and present.

You might also like