NLP ASSIGNMENT
DINESH R
URK22AI1022
I. Machine Translation
1. Introduction to the Application Domain
Machine Translation (MT) is one of the oldest and most significant applications of Natural
Language Processing (NLP). It is the employment of computer software to translate speech
or written text from one language to another. Multilingual communication has become a
necessity during globalization in education, healthcare, global business, and diplomacy.
Translation takes time and is costly when done manually, which makes MT systems an
effective solution for instant communication through language barriers.
In the past, MT started with Rule-Based Machine Translation (RBMT) systems of the
1950s based on manually encoded dictionaries and rules of grammar. Then, Statistical
Machine Translation (SMT) applied probabilistic models to increase translation accuracy.
In recent times, Neural Machine Translation (NMT) became the most accurate and
prevalent method, applying deep learning models based on meaning and context of the
text.
2. Flowchart of Machine Translation System
Here is a detailed step-by-step flow of how a typical neural machine translation system
works:
This flow allows the system to effectively understand, process, and generate human-like
translations.
3. Methodology Used in Machine Translation
Current MT is built on Neural Machine Translation (NMT), which is rooted in
deep learning, specifically the Encoder-Decoder architecture.
• a. Encoder-Decoder Architecture
The encoder translates the input sequence and compresses it into a context vector. The
decoder uses this vector to produce the target language output.
• b. Attention Mechanism
Enhances the accuracy of translation by enabling the decoder to pay attention to important
parts of the input sentence.
c. Transformer Models
Replaces RNN-based systems with self-attention for parallel processing efficiency.
Examples are BERT and GPT.
d. Subword Units
Models employ Byte Pair Encoding (BPE) to manage rare words and minimize
vocabulary size.
e. Training and Inference
Models are trained on parallel corpora and employ beam search or greedy decoding during
inference to produce translations.
4. Working Principles in Simple Terms
Machine Translation (MT) processes text in one language into another automatically
through computer programs. The strongest type of MT now is known as Neural Machine
Translation (NMT), which is based on neural networks—a form of artificial intelligence
that is modeled after the way our brains work.
At the center of this system is something known as the encoder-decoder architecture.
Think of you translating a sentence. First, you scan the entire sentence and know what it
says. That is precisely what the encoder does—it scans the input sentence (say, in
English) and converts it into a list of numbers known as vectors that capture the meaning
of the sentence in a machine-readable format. This is a technique referred to as
embedding, which enables the model to deal with words, subwords, or even characters in
a more adaptive manner.
Then there is the decoder, which is similar to the section of your brain that begins
constructing a new sentence in the target language (such as French or Hindi). The decoder
examines the encoded meaning and then produces the translated sentence word by word.
At every step, it determines what word should be next based on the context of the original
sentence and the words it has already generated.
In order to ensure the model is paying attention to the correct words in translation, a critical
technique referred to as attention mechanism is applied. This allows the model to "look
back" at the proper parts of the input sentence when it produces each word in the
translation. Picturing it as a spotlight shifting over the most important words at every
instance.
A more sophisticated iteration of this entire system is the Transformer model, which does
not depend on more traditional methods such as Recurrent Neural Networks (RNNs) or
Long Short-Term Memory (LSTM) models. Rather, it employs a mechanism known as
self-attention to handle all words in the input simultaneously, as opposed to one-by-one.
This results in much faster and more capable processing of long-distance relationships
between words.
While training, the system is provided with millions of pairs of sentences (such as English-
French) so that it learns how words and phrases are used within context. While the model
is actually in use (during inference), it employs various strategies for output generation—
such as greedy search (selecting the best word every time), or beam search (trying a few
possibilities before selecting the best overall sentence).
So, in short, Machine Translation systems interpret the meaning of a sentence through an
encoder, determine what to write in another language through a decoder, and enhance
accuracy through attention. They employ deep learning and neural networks to discover
the way languages operate based on enormous data, thus making the translations highly
accurate, fluent, and natural than ever before.
5. Real-Time Applications of Machine Translation In NLP
• Google Translate – Translates text, voice, and images across 100+ languages.
• Microsoft Translator – Integrated into Office, Teams, and Edge.
• Skype Translator – Real-time speech translation.
• Amazon Translate – Cloud translation for apps and websites.
• YouTube – Auto-captioning and subtitle translation.
• Facebook/Instagram – Auto-translates user posts and comments.
• Chatbots – Real-time multilingual customer support.
• Education Platforms – Translate online courses for global students.
• Healthcare Apps – Translate symptoms and instructions.
• Legal Services – Translate policies and legal documents.
• Travel Apps – Translate signage and menus for tourists.
II. Text-to-Speech Conversion
1. Introduction to the Application Domain
Text-to-Speech (TTS) conversion is a field within Natural Language Processing (NLP)
that enables machines to read text aloud. It is a crucial technology for human-computer
interaction, transforming written language into speech that sounds increasingly natural
thanks to advancements in deep learning. TTS is widely used in areas like:
• Assistive technology: Helping visually impaired users by reading aloud digital
content.
• Virtual assistants: Siri, Alexa, and Google Assistant use TTS to communicate.
• Customer service: TTS powers automated IVR systems in call centers.
• Education: Assists language learners in improving pronunciation.
• Navigation systems: GPS apps like Google Maps use TTS for voice directions.
Earlier TTS systems produced robotic, unnatural voices. However, recent breakthroughs
using deep neural networks have led to fluid, expressive, and human-like synthetic voices.
2. Flowchart of Text-to-Speech (TTS) System
3. Methodology Used in Text-to-Speech (TTS) System
Modern TTS systems follow a two-stage pipeline: linguistic analysis and speech
synthesis.
A. Linguistic Analysis (Frontend)
• Text Normalization: Converts written content into a readable format by
expanding numbers, dates, and abbreviations.
E.g., "Dr. John has 2kg rice" → "Doctor John has two kilograms rice".
• Tokenization: Breaks the normalized text into words or sentences.
• Grapheme-to-Phoneme Conversion (G2P): Maps letters to phonetic
representations using phonemes.
E.g., "cat" → /k/, /æ/, /t/
• Prosody Prediction: Estimates how the sentence should be spoken – including
pitch, duration, intonation, and rhythm.
B. Speech Synthesis (Backend)
• Acoustic Modeling: Predicts acoustic features such as Mel-spectrograms (a visual
representation of sound intensity over time). Models like Tacotron 2, FastSpeech,
or Glow-TTS handle this.
• Vocoder: Converts Mel-spectrograms into actual speech waveforms. Modern
vocoders include:
o WaveNet: Autoregressive model generating high-quality speech.
o HiFi-GAN: Fast, real-time vocoder producing natural audio.
o Griffin-Lim: Older method with less natural output.
4. Working Principles in Simple Terms
A. Encoder-Decoder Architecture
This architecture is central to models like Tacotron 2. It works in two parts:
• Encoder: Converts input text (or phonemes) into a sequence of embeddings.
• Decoder: Uses an attention mechanism to focus on parts of the input and produce
a Mel-spectrogram frame-by-frame.
B. Attention Mechanism
This helps the decoder know which part of the input to focus on while generating each
output frame. It improves alignment between the text and the generated speech.
C. Acoustic Modeling
Once the phoneme sequence is passed through the encoder-decoder, the model outputs a
Mel-spectrogram — a time-frequency representation of speech. This visual form of audio
captures tone, pitch, and energy.
D. Neural Vocoder
The vocoder then takes the spectrogram and generates raw audio. Models like WaveNet
and HiFi-GAN use deep convolutional layers or autoregressive techniques to generate
high-fidelity sound.
E. Neural Networks in TTS
Modern TTS relies heavily on deep learning, particularly Recurrent Neural Networks
(RNNs) and Convolutional Neural Networks (CNNs). Recently, Transformers have
also been used (like in FastSpeech 2) for faster training and better parallelization.
5. Real-Time Applications of Text-to-Speech In NLP
• Virtual Assistants (e.g., Siri, Alexa, Google Assistant)
• Navigation Systems (e.g., Google Maps providing spoken directions)
• Accessibility Tools (e.g., screen readers for visually impaired users)
• Customer Service Chatbots (spoken replies for better UX)
• Audiobook Generation
• Language Learning Apps (teaching pronunciation)
• Voice Alerts in Smart Devices (home automation systems)
• Public Transport Announcements
• Interactive Voice Response (IVR) Systems in call centers
• Voice-Based Smart Toys or Educational Robots
• Speech Interfaces for Industrial Machines
III. Automatic Speech Recognition
1. Introduction to the Application Domain
Automatic Speech Recognition (ASR) is a core technology of Natural Language
Processing (NLP) that recognizes speech to produce written text. ASR allows real-
time human-machine interaction through voice, the basis for voice assistants,
transcription services, and speech interfaces.
The demand for ASR has grown significantly with the rise of:
• Smartphones
• Virtual assistants (like Siri, Google Assistant)
• Voice typing tools
• Automated customer service
• Real-time transcription systems
ASR technology bridges the gap between human speech and computer understanding,
making digital systems more accessible and intuitive.
2. Flowchart of Automatic Speech Recognition
3. Methodology Used in Automatic Speech Recognition
Modern ASR systems follow a multi-step deep learning-based methodology combining
audio signal processing with linguistic modeling.
A. Preprocessing
• Raw speech is often noisy and unstructured.
• Techniques like noise suppression, normalization, and framing are used to
prepare the audio signal.
B. Feature Extraction
• Converts raw audio into useful numerical features using:
o MFCC (Mel-Frequency Cepstral Coefficients)
o Spectrograms
o Log-Mel Spectrograms
These features capture the frequency, pitch, and energy components of the voice.
C. Acoustic Modeling
• Uses neural networks (like CNNs, RNNs, or Transformers) to model the
relationship between audio features and phonemes (the smallest unit of sound in a
language).
D. Language Modeling
• Predicts word sequences from phoneme predictions.
• Helps choose the most likely transcription from several possibilities (e.g., “write”
vs. “right”).
E. Decoding
• Combines the outputs of acoustic and language models to produce the final
transcription.
• Uses algorithms like beam search or CTC decoding (Connectionist Temporal
Classification).
4. Working Principles in Simple Terms
a. Audio Signal Processing
• Speech is first broken into small frames (~20ms each).
• These frames are converted into Mel spectrograms that visualize how sound
frequencies vary over time.
b. Neural Networks in ASR
Modern ASR systems use a deep learning pipeline to map audio features to text.
i. Recurrent Neural Networks (RNNs)
• Models like LSTM (Long Short-Term Memory) can remember past input while
processing current speech, making them ideal for time-series data.
ii. Convolutional Neural Networks (CNNs)
• Extract local patterns in audio (e.g., detecting phoneme-like structures).
iii. Transformer Models
• Use self-attention to capture long-range dependencies in speech.
• Models like Wav2Vec 2.0, HuBERT, and Whisper are based on Transformers.
c. Encoder-Decoder Models
• Encoder: Converts audio into hidden representations.
• Decoder: Maps those representations to text.
• Often enhanced with attention mechanisms to improve alignment between audio
and text.
d. Connectionist Temporal Classification (CTC)
• A loss function used in end-to-end ASR models.
• It allows variable-length audio to be aligned with variable-length transcriptions
without pre-aligned data.
5. Real-Time Applications of Automatic Speech Recognition In NLP
1. Virtual Assistants
Assistants like Google Assistant, Siri, and Alexa use ASR to process user commands.
2. Real-Time Captions and Subtitles
Platforms like Zoom, Google Meet, and YouTube use ASR to generate live captions for
accessibility.
3. Voice Typing
Google Docs and mobile keyboards support voice-to-text input using ASR.
4. Customer Service
ASR automates voice-based customer service systems through IVRs and chatbots.
5. Healthcare Transcription
Doctors use ASR to dictate notes during patient visits, saving time on documentation.
6. Language Learning
Apps like Duolingo use ASR to evaluate pronunciation and provide feedback.
7. Smart Devices
Smart TVs, smartwatches, and home automation systems use voice input through ASR.
8. Voice Search
Search engines like Google support voice queries, converting spoken questions into text.
9. Dictation Tools
Professionals use dictation software (e.g., Dragon NaturallySpeaking) to write emails,
reports, and documents.
10. Real-Time Translation
Applications like Google Translate use ASR to transcribe spoken words before translating
them to another language.
11. Accessibility Tools
ASR helps individuals with physical disabilities interact with technology through voice.
IV. Chatbots and Dialogue Systems
1. Introduction to the Application Domain
Chatbots and Dialogue Systems form a core area of Natural Language Processing (NLP)
and Artificial Intelligence (AI). They are designed to enable human-computer interaction
in the natural language—either in the form of text or speech.
• A Chatbot is typically built to serve a specific purpose like answering FAQs,
handling simple user queries, or guiding users through defined processes (like
booking a ticket).
• A Dialogue System is broader and more complex, enabling more flexible and
natural interactions. These systems can handle multi-turn, context-aware
conversations, and may include voice-based interfaces (like Alexa or Google
Assistant).
These systems combine several NLP techniques such as natural language understanding
(NLU), dialogue management, natural language generation (NLG), and often speech
recognition/synthesis in voice interfaces.
2. Flowchart of Chatbots and Dialogue Systems
3. Methodology Used in Chatbots and Dialogue Systems
Modern chatbots and dialogue systems are powered by a pipeline of NLP components:
a. Input Processing
• If the user provides spoken input, it first goes through Automatic Speech
Recognition (ASR) to convert it into text.
• If it's already in text, preprocessing steps like tokenization and stopword removal
are applied.
b. Natural Language Understanding (NLU)
• Intent Recognition: Determines what the user wants (e.g., "Book a flight").
• Entity Extraction: Identifies key pieces of information (e.g., "New York",
"tomorrow").
c. Dialogue Management
• Maintains context of the conversation.
• Uses dialogue policies to decide how to respond.
• Advanced systems use Reinforcement Learning (RL) or Transformer-based
models for decision-making.
d. Natural Language Generation (NLG)
• Transforms structured data or decisions into natural language.
• Can be template-based (rule-driven) or model-based (neural networks).
e. Output Delivery
• Text is either displayed or converted back into speech using Text-to-Speech
(TTS).
4. Working Principles in Simple Terms
1. Natural Language Understanding (NLU)
• Uses models like BERT, RoBERTa, or spaCy for:
o Intent classification: Predicts the goal of the user.
o Named Entity Recognition (NER): Extracts specific data (like dates,
names, etc.).
2. Dialogue Management
• Can be rule-based or AI-driven.
• Advanced systems use:
o RNNs or LSTM networks for context tracking.
o Transformers like GPT, DialoGPT, ChatGPT, or LaMDA for response
generation.
o Reinforcement Learning to learn optimal dialogue strategies over time.
3. Natural Language Generation (NLG)
• Can be:
o Template-based: Predefined phrases (e.g., “Your flight is booked.”).
o Neural models: Use encoder-decoder architecture to generate human-like
responses.
4. Encoder-Decoder Models
• Common in transformer-based systems.
• The encoder reads user input; the decoder generates the response.
• These are often trained using large dialogue datasets like Persona-Chat,
DailyDialog, or OpenSubtitles.
5. Speech Integration
• For voice systems, ASR is used on input and TTS on output.
• Models: Whisper, Wav2Vec 2.0 for ASR; Tacotron 2, FastSpeech for TTS.
5. Real-Time Applications of Chatbots and Dialogue Systems In NLP
• Customer Support Chatbots (e.g., Zendesk bots)
• Virtual Assistants (e.g., Siri, Google Assistant)
• Healthcare Bots (e.g., symptom checkers)
• Banking and Finance Bots (e.g., balance inquiries)
• Educational Tutors (e.g., language learning assistants)
• E-commerce Bots (e.g., product suggestions)
• Appointment Scheduling (e.g., chatbot calendars)
• FAQ and Helpdesk Automation
• Mental Health Chatbots (e.g., Woebot)
• Voice Interfaces in Smart Devices
• Social Chatbots (e.g., Replika)