5.
2 Language Modeling:
● Introduction
● N-gram Models
● Language Model Evaluation
● Parameter Estimation
● Language Model Adaptation
● Type Of Language Models
● Language Specific Modeling Problems
● Multilingual And Cross Lingual Language Modeling.
LANGUAGE MODELING
● Language modeling is a fundamental task in Natural Language Processing
(NLP).
● It involves predicting the next word (or character) in a sequence, or the
likelihood of a given sequence of words.
● It helps machines understand and generate human language.
● A language model (LM) assigns a probability to a sequence of words in a
language.
● For example, the sentence:
"I am going to the store."
is more likely (has a higher probability) than:
"I am going to the elephant."
● Language models help identify which sequences of words are more natural
or likely in a language.
LANGUAGE MODELING
Language modeling is used in:
● Speech recognition
● Text generation
● Machine translation
● Spelling correction
● Chatbots
● Autocompletion (e.g., Gmail, Google search)
LANGUAGE MODELING
Language models used in practice:
● n-gram models (traditional)
● RNNs / LSTMs
● Transformers (e.g., BERT, GPT, RoBERTa)
LANGUAGE MODELING
Working of Language Modeling:
● Let’s say we have a sentence:
"The cat sat on the ___"
A language model tries to predict the next word (mat, table, etc.)
based on the previous words.
LANGUAGE MODELING
Working of Language Modeling:
1. Next Word Prediction
Example:
Input: "The sun is shining in the"
Model Output: "sky"
Use case:
Used in mobile keyboards (like SwiftKey, Gboard) or writing assistants
(like Grammarly) to suggest the next word.
LANGUAGE MODELING
Working of Language Modeling:
2. Text Generation
Example:
Input Prompt: "Once upon a time"
Model Output: "there was a little girl who lived in a village near the
forest..."
Use case:
Used in content generation, story writing, and chatbot conversations (like
ChatGPT).
LANGUAGE MODELING
Working of Language Modeling:
3. Speech Recognition
Example:
Audio Input: "I want to go to the"
Model helps choose between:
● "bank" (for river context)
● "bank" (for financial context)
Use case:
Language models help disambiguate words based on context in automatic
speech recognition (ASR).
LANGUAGE MODELING
Working of Language Modeling:
4. Machine Translation
Example:
Input: "Je suis étudiant"
Model Output: "I am a student"
Use case:
Language models help ensure fluency and grammatical correctness in
machine translation (like Google Translate).
LANGUAGE MODELING
Working of Language Modeling:
5. Spelling and Grammar Correction
Example:
Input: "He go to school everyday"
Model Correction: "He goes to school every day"
Use case:
Used in grammar tools to identify and correct common language errors.
LANGUAGE MODELING
Working of Language Modeling:
6. Autocompletion
Example:
Start typing: "Thank you for your"
Model Suggests: "time" / "support" / "email"
Use case:
Used in code editors, emails, and document writing to improve productivity.
LANGUAGE MODELING
Working of Language Modeling:
7. Question Answering
Example:
Question: "Who is the president of the United States?"
Contextual text is processed using a language model to extract: "Joe Biden"
Use case:
Used in search engines, digital assistants, and educational tools.
LANGUAGE MODELING
Working of Language Modeling:
8. Chatbots and Virtual Assistants
Example:
User: "Tell me a joke"
Bot: "Why don't scientists trust atoms? Because they make up everything!"
Use case:
Core of NLP chatbots like Alexa, Siri, or ChatGPT.
LANGUAGE MODELING
Working of Language Modeling:
9. Summarization
Example:
Long Text: An article on climate change.
Model Output: A concise summary: "Climate change is causing global
temperatures to rise due to greenhouse gases."
Use case:
Used in news apps, summarizing legal or medical documents.
LANGUAGE MODELING
Types of Language Models (based on approach):
1. Statistical Language Models (Traditional):
They use probabilities based on word frequency in training data.
● Unigram model: Each word is independent.
○ P(w1, w2, w3) = P(w1) * P(w2) * P(w3)
● Bigram model: Each word depends on the previous one.
○ P(w3 | w2)
● Trigram model: Each word depends on the two previous words.
○ P(w3 | w1, w2)
Example (Bigram):
"The cat" = P("The") * P("cat" | "The")
LANGUAGE MODELING
Types of Language Models (based on approach):
2. Neural Language Models:
Use neural networks to learn complex patterns.
● RNNs (Recurrent Neural Networks)
● LSTMs (Long Short-Term Memory)
● Transformers (like GPT, BERT)
These models learn context better and generate more human-like language.
Example in Python (Simple Bigram Model):
from collections import defaultdict
sentences = ["I love NLP", "I love Python", "Python is great"]
bigram_model = defaultdict(lambda: defaultdict(int))
# Build bigram counts
for sentence in sentences:
words = sentence.split()
for i in range(len(words)-1):
bigram_model[words[i]][words[i+1]] += 1
# Convert counts to probabilities
for w1 in bigram_model:
total_count = float(sum(bigram_model[w1].values()))
for w2 in bigram_model[w1]:
bigram_model[w1][w2] /= total_count
# Predict next word after "I"
print("Next word predictions for 'I':", dict(bigram_model["I"]))
LANGUAGE MODELING
Types of Language Modeling in NLP
Unigram Language Model
● Assumes each word is independent of the previous words.
● Probability of a sentence:
P(w1, w2, w3) = P(w1) * P(w2) * P(w3)
Use case: Simple applications, quick baselines.
LANGUAGE MODELING
Types of Language Modeling in NLP
Bigram & N-gram Language Models
● Assumes each word depends on (n-1) previous words.
● Bigram: P(wi | wi-1)
● Trigram: P(wi | wi-2, wi-1)
● General n-gram: P(wi | wi-n+1, ..., wi-1)
Use case: Early statistical models, text prediction, autocomplete.
Limitation: Poor at long-range dependencies and sparsity issues.
LANGUAGE MODELING
Types of Language Modeling in NLP
Neural Language Models (NLM)
● Use neural networks (like feedforward NN, RNN) to learn language
probability distributions.
● Capture better context than traditional n-grams.
● Use case: Advanced NLP tasks like translation, text generation.
LANGUAGE MODELING
Types of Language Modeling in NLP
Recurrent Neural Network (RNN) Language Models
● Model sequential dependencies better than n-grams.
● Maintain a hidden state that remembers past information.
Use case: Sequence generation, speech recognition.
Limitation: Struggle with long-term dependencies.
LANGUAGE MODELING
Types of Language Modeling in NLP
LSTM / GRU Language Models
● Improved version of RNNs.
● Handle long-range dependencies using gates (forget, input,
output).
● GRU = simpler version of LSTM.
Use case: Text generation, chatbots, summarization.
LANGUAGE MODELING
Types of Language Modeling in NLP
Transformer-Based Language Models
● Do not use recurrence—use self-attention mechanism instead.
● Can process entire sequences in parallel.
● Capture global context better.
Use case: All state-of-the-art models today.
Examples:
● BERT (Bidirectional Encoder Representations from Transformers) →
Masked LM
● GPT (Generative Pre-trained Transformer) → Causal LM
● T5, XLNet, RoBERTa, DistilBERT, etc.
LANGUAGE MODELING
Types of Language Modeling in NLP
Masked Language Models (MLM)
● Predict missing (masked) words in a sentence.
● Used in BERT.
Example:
Input: "The cat [MASK] on the mat"
Model predicts: "sat"
LANGUAGE MODELING
Types of Language Modeling in NLP
Causal / Autoregressive Language Models
● Predict the next word given previous words (unidirectional).
● Used in GPT models.
Example:
Input: "The cat sat on the"
Model predicts: "mat"
LANGUAGE MODELING
Types of Language Modeling in NLP
Bidirectional Language Models
● Look at both previous and next words to understand context.
● Example: BERT, ELMo.
LANGUAGE MODELING
Types of Language Modeling in NLP
Type—Key Idea—Example Models
• Unigram Model
Each word is independent
NLTK, Gensim
• Bigram / N-gram
Word depends on n-1 words
KenLM, NLTK
• Neural Language Model
Uses neural networks
Simple FFNN
• RNN Language Model
Remembers past words (sequential)
RNN
• LSTM / GRU Language Model
Remembers longer sequences
LSTM, GRU
LANGUAGE MODELING
Types of Language Modeling in NLP
Type—Key Idea—Example Models
• Transformer Language Model
Self-attention for global context
GPT, BERT, T5
• Masked Language Model (MLM)
Predict masked words
BERT
• Causal Language Model
Predict next word
GPT
• Bidirectional Language Model
Context from both directions
BERT, ELMo
● N-gram Models
● Language Model Evaluation
● Parameter Estimation
● Language Model Adaptation
● Type Of Language Models
● Language Specific Modeling Problems
● Multilingual And Cross Lingual Language Modeling.
N-GRAM MODELS
N-gram models are a fundamental concept in Natural Language
Processing (NLP), especially for language modeling and text
prediction tasks.
An N-gram is a contiguous sequence of N items (usually words or
characters) from a given text or speech.
● Unigram (1-gram): "I", "am", "happy"
● Bigram (2-gram): "I am", "am happy"
● Trigram (3-gram): "I am happy"
N-gram models help us:
● Predict the next word in a sentence.
● Calculate the probability of a sentence.
● Improve speech recognition, text generation, and autocomplete
systems.
N-gram models are based on the Markov Assumption, which means
the probability of a word depends only on the previous (N-1) words.
Example
Sentence: "I love NLP"
● Unigrams: I, love, NLP
● Bigrams: I love, love NLP
● Trigrams: I love NLP
Limitations of N-gram Models
● Data sparsity: High n-values need huge data.
● Memory usage: Storing N-grams becomes expensive.
● Context limitations: N-grams don’t understand long-term
dependencies.
Solutions to Limitations
● Smoothing techniques (e.g., Laplace, Kneser-Ney)
● Neural Language Models (e.g., RNNs, Transformers like
BERT/GPT)
N-GRAM MODEL:
from collections import defaultdict
from nltk import ngrams
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "I love NLP and I love machine learning"
# Tokenize the text
tokens = word_tokenize(text)
# Create bigrams
bigrams = list(ngrams(tokens, 2))
# Count bigrams
bigram_freq = defaultdict(int)
for bg in bigrams:
bigram_freq[bg] += 1
print("Bigrams and their frequencies:")
for pair, freq in bigram_freq.items():
print(f"{pair}: {freq}")
Word-Level N-gram Examples
Sentence:
"Natural Language Processing is amazing"
Unigrams (1-gram):
['Natural', 'Language', 'Processing', 'is', 'amazing']
Bigrams (2-grams):
[('Natural', 'Language'),
('Language', 'Processing'),
('Processing', 'is'),
('is', 'amazing')]
Trigrams (3-grams):
[('Natural', 'Language', 'Processing'),
('Language', 'Processing', 'is'),
('Processing', 'is', 'amazing')]
4-grams:
[('Natural', 'Language', 'Processing', 'is'),
('Language', 'Processing', 'is', 'amazing')]
Character-Level N-gram Examples
Word: "NLP"
Unigrams (1-gram):
['N', 'L', 'P']
Bigrams (2-grams):
['NL', 'LP']
Trigrams (3-grams):
['NLP']
Word: "text"
bigrams: ['te', 'ex', 'xt']
trigrams: ['tex', 'ext']
Real-World Text Example
Sentence:
"I am learning Python"
[('I', 'am', 'learning'),
('am', 'learning', 'Python')]
Use Cases with Examples:
Use Case—N-gram
Autocomplete–Trigrams
Example:Input: "I am" → Suggests: "learning", "a student", etc.
Spell Correction–Bigrams
Example:"He go to shcool" → "go to school"
Text Classification–Unigrams/Bigrams
Example:Extract features like ("not", "good") to detect sentiment
Plagiarism Detection–Character 4-grams
Example:Compare overlapping phrases
● Language Model Evaluation
● Parameter Estimation
● Language Model Adaptation
● Type Of Language Models
● Language Specific Modeling Problems
● Multilingual And Cross Lingual Language Modeling.
Language Model Evaluation:
In Natural Language Processing (NLP), language model evaluation
refers to the process of measuring how well a language model
performs on a specific task. A language model predicts or generates
text based on some input, and evaluation helps us determine how
accurately or effectively it does that.
we need to:
● Check accuracy (Is it correct?).
● Measure fluency (Does it sound natural?).
● Compare models (Which one is better?).
● Decide if it’s ready for production.
Common Tasks for Language Models
● Text generation
● Text classification
● Machine translation
● Question answering
● Sentiment analysis
Each task may need different evaluation metrics.
Types of Evaluation
1. Intrinsic Evaluation
● Tests the model independently of real-world tasks.
● Focuses on things like:
○ Perplexity (how well the model predicts words)
○ BLEU score (for translation tasks)
○ ROUGE (for summarization)
Example:
If a language model predicts the next word in a sentence, we can
check how many times it was correct — this is intrinsic.
2. Extrinsic Evaluation
● Measures model performance on real tasks.
● You use the model in an application, and check how much it helps
improve the task.
Example:
Using a language model in a chatbot — how well does it help answer
customer questions?
1. Perplexity (for Language Modeling)
🔹 What it does:
Measures how well a language model predicts the next word in a
sequence.
🔹 Interpretation:
● Lower perplexity = better model.
● A perplexity of P means the model is as confused as if it had to
choose between P equally likely words.
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text = "The sky is blue and the sun is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
perplexity = torch.exp(loss)
print("Perplexity:", perplexity.item())
2. BLEU Score (for Machine Translation or Text Generation)
🔹 What it does:
Compares generated text to reference text by checking how many
n-grams (1-gram, 2-gram, etc.) match.
🔹 Interpretation:
● BLEU score ranges from 0 to 1 (or 0 to 100 if scaled).
● Higher is better — more overlap with reference = better translation.
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "sky", "is", "blue"]]
candidate = ["the", "sky", "is", "bright"]
score = sentence_bleu(reference, candidate)
print("BLEU Score:", score)
3. ROUGE Score (for Summarization)
🔹 What it does:
Measures overlap between reference summary and generated
summary.
🔹 Types:
● ROUGE-N: Overlap of n-grams
● ROUGE-L: Longest Common Subsequence (LCS)
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score("The cat was found under the bed", "The cat is under the
bed")
print(scores)
4. Accuracy / F1-Score (for Classification Tasks)
🔹 What it does:
Used when a language model is fine-tuned for tasks like sentiment
analysis or spam detection.
● Accuracy: % of correct predictions.
● F1-Score: Balance between precision & recall.
from sklearn.metrics import accuracy_score, f1_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
5. Human Evaluation
🔹 What it does:
Humans rate generated output for:
● Fluency (does it sound natural?)
● Relevance (does it make sense?)
● Coherence (is it logically connected?)
6. Edit Distance / Levenshtein Distance
🔹 What it does:
Measures how many edits (insertions, deletions, substitutions) are
needed to convert model output to reference text.
● Lower distance = better match
Approach—Task Type—Goal
Perplexity
Language Modeling
How well it predicts next word
BLEU
Translation/Generation
N-gram overlap
ROUGE
Summarization
N-gram/LCS overlap
Approach—Task Type—Goal
Accuracy/F1
Classification
Prediction correctness
Human Eval
General
Fluency, coherence, relevance
Edit Distance
Generation
Similarity to reference
● Parameter Estimation
● Language Model Adaptation
● Type Of Language Models
● Language Specific Modeling Problems
● Multilingual And Cross Lingual Language Modeling.
Parameter Estimation:
In Natural Language Processing (NLP), parameter estimation refers to
the process of determining the best values for the parameters of a model so
that it can make accurate predictions or generate meaningful language.
What are Parameters?
Parameters are internal variables of a model that are learned from data.
These parameters influence how the model behaves.
● In a statistical model (like Naive Bayes), parameters might be:
○ Word probabilities (like P(word∣class)P(word|class)P(word∣class))
○ Prior probabilities (like P(class)P(class)P(class))
● In a machine learning model (like Logistic Regression or a Neural
Network):
○ Parameters include weights and biases that are tuned during training.
● In deep learning NLP models (like BERT, GPT, etc.):
○ Parameters can number in the millions or billions, and they are
optimized using training data through techniques like backpropagation.
What is Parameter Estimation?
It’s the process of finding the values of those parameters that maximize the
model's performance on a given task.
In simple terms:
Given training data, we estimate the parameters of our model so that it
best fits or explains the data.
Examples
1. Naive Bayes Classifier (text classification):
We estimate parameters like:
P(wordi∣classj)=count(wordi in classj)+1total words in classj+VP(word_i |
class_j) = \frac{\text{count(word}_i \text{ in class}_j) + 1}{\text{total words
in class}_j + V}
P(wordi ∣ classj) = [count(wordi in classj)+1] / (total words in class j)+V
Where VVV is vocabulary size (Laplace smoothing).
This is parameter estimation using Maximum Likelihood Estimation (MLE)
or Maximum A Posteriori (MAP).
2. Logistic Regression for NLP tasks:
We estimate weights www for each feature (e.g., word presence) using
Gradient Descent to minimize a loss function (e.g., cross-entropy loss).
3. Neural Networks (e.g., BERT, LSTM):
Millions of weights are initialized randomly and then learned from data
using backpropagation + optimizers like Adam.
Common Parameter Estimation Techniques
● Maximum Likelihood Estimation (MLE)
● Maximum A Posteriori Estimation (MAP)
● Gradient Descent and its variants (used in neural models)
● Bayesian Estimation
Term—Meaning
Parameters—Internal values learned from data (e.g., word probabilities,
model weights)
Estimation—Finding the best values for these parameters using data
Purpose—To help the model make accurate predictions or generate text
Used in—Everything from simple models (like Naive Bayes) to complex
transformers (like BERT)
● Language Model Adaptation
● Type Of Language Models
● Language Specific Modeling Problems
● Multilingual And Cross Lingual Language Modeling.
Language Model Adaptation is the process of modifying a pre-trained
language model so it performs better on a specific task, domain, or type of
language (e.g., legal, medical, social media).
Think of it like teaching a general-purpose language model to specialize in
something new, like legal documents, medical conversations, or customer
reviews.
Why Adapt a Language Model?
Because:
● Pretrained models (like GPT, BERT, etc.) are trained on general data (e.g.,
Wikipedia, books, web).
Types of Language Model Adaptation
1. Domain Adaptation
Train or fine-tune the model on data from a specific domain.
● E.g., adapting BERT for medical language → BioBERT
● Goal: Help the model understand domain-specific vocabulary and context.
2. Task Adaptation
Adapting a model for a specific task (like sentiment analysis, summarization).
● E.g., fine-tuning BERT for named entity recognition (NER) or question
answering.
3. User/Style Adaptation
Adapting to a specific writing style or speaker's language.
● E.g., personalizing a chatbot to speak like the user.
How is Adaptation Done?
Usually through fine-tuning:
from transformers import BertForSequenceClassification, Trainer,
TrainingArguments
model =
BertForSequenceClassification.from_pretrained("bert-base-uncased")
# Fine-tuning on custom dataset
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./results", num_train_epochs=3),
Pretraining—Adaptation
Trained on general, huge corpora
Fine-tuned on domain/task-specific data
Expensive and slow
Faster, cheaper
Learns general language patterns
Learns task/domain-specific knowledge
● Language Model Adaptation = Making a general model work better for a
specific context.
● Common methods: fine-tuning, continued pretraining, adapter layers.
● Used in: chatbots, medical NLP, legal document analysis, etc.
Cross Lingual Language Modeling
• Applications of Cross-Lingual Models
• 3. Sentiment Analysis Across Languages
• A sentiment analysis model trained in English can analyze emotions in
French or Hindi, even with little to no training data in those languages.
• 4. Low-Resource Language Support
• If a model is trained in English and Spanish, it can still provide basic NLP
functionality in related languages like Portuguese, even without additional
data.
Cross Lingual Language Modeling
Key Benefits of Cross-Lingual Modeling in NLP
1. Overcomes Language Barriers:
Enables seamless communication between speakers of different languages.
Example: A model trained in English can help process French, Hindi, or
Chinese without requiring separate training for each language.
Why It Matters: Supports global businesses, diplomacy, and multilingual
customer interactions.