Module 5
Speech Recognition and Natural Language Processing (NLP)
Speech Recognition
Speech recognition, also known as Automatic Speech Recognition (ASR),
involves converting spoken language into a sequence of words. It is a
complex process that maps acoustic signals into meaningful text
representations.
How Speech Recognition Works:
1. Input Representation:
o The audio signal is divided into frames, often around 20 ms
each, to create input vectors.
2. Feature Extraction:
o Traditional systems use hand-designed features, while deep
learning systems can learn features directly from raw input.
3. Modelling and Alignment:
o Early systems used Hidden Markov Models (HMMs) combined
with Gaussian Mixture Models (GMMs) to model phonemes and
their sequences.
o Modern systems incorporate deep learning approaches like
LSTMs and convolutional networks to improve recognition
accuracy.
Deep Learning in Speech Recognition:
Deep feedforward networks and Restricted Boltzmann Machines
(RBMs) were early neural techniques.
Advanced models, including recurrent networks like LSTMs and
attention-based systems, help align acoustic signals with linguistic
sequences.
Applications:
Virtual assistants (e.g., Alexa, Siri)
Real-time transcription services
Voice command systems.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of artificial intelligence
focused on enabling machines to understand, and respond to human
language. It bridges the gap between human communication and machine
understanding by transforming unstructured language data into a
structured format that computers can process
Applications:
Machine translation (e.g., Google Translate)
Sentiment analysis (e.g., product reviews)
Chatbots and virtual assistants
Text summarization and more
How NLP Works:
1. Preprocessing:
o Tokenization: Splitting text into sentences or words.
o Cleaning: Removing noise like punctuation and stop words.
2. Language Modelling:
o Early models like n-grams focused on short sequences.
o Neural language models replaced these with distributed
representations and embeddings for efficiency.
3. Modern Advances:
o RNNs and LSTMs handle sequential data by preserving context
over time.
o Attention mechanisms and Transformers, such as BERT and
GPT, allow parallel processing of sequences, improving tasks
like translation and summarization
What is Natural Language Processing (NLP)?
Natural Language Processing (NLP) is a field of artificial intelligence
focused on enabling machines to understand, and respond to human
language. It bridges the gap between human communication and machine
understanding by transforming unstructured language data into a
structured format that computers can process
Steps Involved in NLP
1. Tokenization
The first step in NLP involves breaking down text into smaller units called
tokens, which can be words, characters, or sub words. This segmentation
is essential for further processing.
2. Text Cleaning and Preprocessing
Removing unnecessary characters, punctuation, and stop words.
Lowercasing text for uniformity.
Stemming or lemmatization to reduce words to their root forms.
3. Feature Representation
Converting tokens into numerical representations that machine learning
models can process. Common methods include:
Bag of Words (Bow): A sparse representation based on word
frequency.
TF-IDF: Weights words based on importance.
Word Embeddings: Dense vector representations capturing
semantic relationships (e.g., Word2Vec, Glove).
4. Language Modelling
Developing probabilistic models to predict sequences of words. Early
methods used n-grams, while modern systems employ neural language
models, such as those based on Recurrent Neural Networks (RNNs) or
Transformers.
5. Model Training and Optimization
Building predictive models for specific tasks like classification or
translation using supervised, unsupervised, or reinforcement learning
techniques.
6. Evaluation and Improvement
Using metrics such as accuracy, precision, recall, F1 score, or BLEU score
(for translation) to assess performance and refine the model
Long Short-Term Memory (LSTM): Working Principles with
Equations
Long Short-Term Memory (LSTM) networks are a type of gated recurrent
neural network designed to handle long-term dependencies in sequence
data. They address issues like vanishing gradients that traditional RNNs
face during training by incorporating a system of gates to control the flow
of information.
Core Components of LSTM:
1. Cell State (Ct ):
A memory element that carries information across time steps,
allowing the network to retain or discard information.
2. Gates:
Three primary gates regulate the flow of information:
o Forget Gate
o Input Gate
o Output Gate
Advantages of LSTM:
Efficient handling of long-term dependencies by controlling when to
forget or retain information.
Adaptable time scales of memory depending on the sequence
context.
Widely used in tasks such as language modelling, speech
recognition, and time-series prediction
How Recurrent Neural Networks (RNNs) Process Data Sequences
Recurrent Neural Networks (RNNs) are specialized neural networks
designed for processing sequential data. Unlike feedforward networks,
RNNs maintain a hidden state that captures information about previous
inputs, making them suitable for tasks involving temporal or sequential
patterns.
Advantages of RNNs:
They can process variable-length sequences, making them flexible
for tasks like language modelling and speech recognition.
They capture temporal dependencies and maintain memory through
their hidden state.
Challenges:
RNNs face issues like vanishing or exploding gradients during
training, which can make learning long-term dependencies difficult.
Applications:
Natural Language Processing (e.g., language translation, text
generation)
Speech Recognition
Time-Series Forecasting
Video and Gesture Recognition
Bidirectional Recurrent Neural Networks (RNNs)
Bidirectional Recurrent Neural Networks (BRNNs) are an extension of
traditional RNNs designed to process sequential data more effectively by
considering both past and future context during training and prediction.
Traditional RNNs process sequences in a causal structure, where the state
at time t depends only on the past inputs x(1),x(2),...,x(t−1) and the
present input x(t). However, many tasks, such as speech and handwriting
recognition, require understanding dependencies in both directions
Advantages
Context Awareness: Allows predictions to depend on the entire
input sequence, rather than just the preceding elements.
Better for Ambiguities: Particularly useful in tasks like speech
recognition, where the correct interpretation of a word or phoneme
may depend on the surrounding context.
Flexible Applications: Can be extended to handle 2D inputs, such
as images, by incorporating RNNs in four directions (up, down, left,
right).
Applications
Speech Recognition: Enables accurate phoneme classification by
considering linguistic dependencies.
Handwriting Recognition: Processes both local and global
patterns in the writing sequence.
Bioinformatics: Analyses DNA sequences by considering forward
and reverse strands