0% found this document useful (0 votes)

41 views62 pages

NLP Unit-4

This document provides an overview of language modeling in natural language processing, discussing methods such as statistical and neural language modeling, along with key concepts like N-gram models and evaluation metrics. It highlights the applications of language models in various fields, including text generation, machine translation, and speech recognition, while also addressing challenges like data sparsity and bias. Additionally, it outlines evaluation techniques to assess model performance and effectiveness across different tasks.

Uploaded by

ashmitha1428

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views62 pages

NLP Unit-4

Uploaded by

ashmitha1428

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

UNIT IV

Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter

Estimation, Language Model Adaptation, Types of Language Models, Language-Specific

Modeling Problems, Multilingual and Crosslingual Language Modeling

1. Language Modeling: Introduction

Language modeling is a fundamental task in natural language processing (NLP) that

involves predicting the next word or sequence of words in a given context. It is a core

component of many NLP applications, such as machine translation, speech recognition, text

generation, and more. Language models (LMs) are trained to capture the structure, grammar,

and semantics of a language, enabling them to generate coherent and contextually appropriate

text.

Methods of Language Modelling

Two methods of Language Modeling:

1. Statistical Language Modelling: Statistical Language Modeling, or Language Modeling,

is the development of probabilistic models that can predict the next word in the

sequence given the words that precede. Examples such as N-gram language modeling.

2. Neural Language Modeling: Neural network methods are achieving better results than

classical methods both on standalone language models and when models are incorporated

into larger models on challenging tasks like speech recognition and machine translation.

A way of performing a neural language model is through word embeddings.

Key Concepts in Language Modeling

1. Probability Distribution:

o A language model assigns probabilities to sequences of words.

For a given sequence of words w1, w2, wn the model estimates the probability

P (w1, w2, …, wn)

o The probability of a sequence is typically using the chain rule of probability:

P (w1, w2, …, wn) =P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)…P(wn∣w1,w2,…,wn−1)

2. N-gram Models:

o N-gram models are a traditional approach to language modeling. They approximate

the probability of a word given its history by considering only the

previous n−1 words.

o For example, in a bigram model (n=2), the probability of a word depends only on

the previous word: P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)

o N-gram models are simple but suffer from sparsity issues (i.e., many possible word

sequences are never seen in the training data).

3. Neural Language Models:

o Neural networks have largely replaced traditional n-gram models due to their

ability to capture long-range dependencies and generalize better to unseen data.

o Common architectures include:

 Recurrent Neural Networks (RNNs): Process sequences one word at a

time, maintaining a hidden state that captures context.

 Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs):

Variants of RNNs designed to handle long-term dependencies.

 Transformers: Use self-attention mechanisms to capture relationships

between words in a sequence, enabling parallel processing and better

scalability.

o Pre-trained models like GPT (Generative Pre-trained Transformer) and BERT

(Bidirectional Encoder Representations from Transformers) have set new

benchmarks in language modeling.

4. Evaluation Metrics:

o Perplexity: A common metric for evaluating language models. It measures how well

the model predicts a sample. Lower perplexity indicates better performance.

o BLEU, ROUGE, METEOR: Used for evaluating generated text in tasks like

machine translation and summarization.

5. Applications of Language Models:

o Text Generation: Generating coherent and contextually relevant text (e.g.,

chatbots, story generation).

o Machine Translation: Translating text from one language to another.

o Speech Recognition: Converting spoken language into text.

o Autocomplete and Spell Checking: Assisting users in typing by predicting the

next word or correcting errors.

o Sentiment Analysis: Understanding the sentiment expressed in text.

6. Challenges in Language Modeling:

o Data Sparsity: Rare or unseen word sequences can be difficult to model.

o Context Length: Capturing long-range dependencies in text.

o Bias and Fairness: Language models can inherit biases present in the training

data.

o Computational Resources: Training large-scale models like GPT-3 requires

significant computational power.

N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of

text or speech. The items can be letters, words, or base pairs according to the application.

The N-grams typically are collected from a text or speech corpus (A long text dataset).

For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams

(“This article”, “article is”, “is on”, “on NLP”).

For N-1 words, the N-gram modeling predicts most occurred words that can follow the

sequences.
The model is the probabilistic language model which is trained on the collection of the text.

This model is useful in applications i.e. speech recognition, and machine translations.

The N-gram language model is about finding probability distributions over the sequences

of the word. Consider the sentences i.e. "There was heavy rain" and "There was heavy flood".

By using experience, it can be said that the first statement is good.

The N-gram language model tells that the "heavy rain" occurs more frequently than the

"heavy flood". So, the first statement is more likely to occur and it will be then selected by

this model.

 In the one-gram model, the model usually relies on that which word occurs often

without pondering the previous words.

 In 2-gram, only the previous word is considered for predicting the current word.

 In 3-gram, two previous words are considered.

 In the N-gram language model the following probabilities are calculated:

P (“There was heavy rain”)

= P (“There”, “was”, “heavy”, “rain”)
= P (“There”)
P (“was” |“There”)
P (“heavy”| “There was”)
P (“rain” |“There was heavy”).

Advantages of N-grams

1. Simplicity: N-grams are intuitive and relatively simple to understand and implement.

2. Low Memory Usage: They require minimal memory for storage compared to more complex

models.

Limitations of N-grams

1. Limited Context: N-grams have a finite context window, which means they cannot

capture long-range dependencies or context beyond the previous N-1 words.

2. Sparsity: As N increases, the number of possible N-grams grows exponentially, leading

to sparse data and increased computational demands.

N-gram Language Model
An N-gram language model predicts the probability of a given N-gram within any

sequence of words in a language. A well-crafted N-gram model can effectively predict the

next word in a sentence, which is essentially determining the value of p(w∣h), where h is the

history or context and w is the word to predict.

LetÕs explore how to predict the next word in a sentence. We need to calculate p(w|h),

where w is the candidate for the next word. Consider the sentence ‘This article is on…Õ.If

we want to calculate the probability of the next word being “NLP”, the probability can be

expressed as:

p(“NLP”∣“This”,“article”,“is”,“on”)

To generalize, the conditional probability of the fifth word given the first four can be

written as:

p(w5∣w1,w2,w3,w4) or p(W)=p(wn∣w1,w2,…,wn−1)

This is calculated using the chain rule of probability:

P(A∣B)=P(A∩B)/P(B) and P(A∩B)=P(A∣B)P(B)

Now generalize this to sequence probability:

P(X1,X2,…,Xn)=P(X1)P(X2∣X1)P(X3∣X1,X2)…P(Xn∣X1,X2,…,Xn−1)

This yields:

P(w1,w2,w3,…,wn)=∏i P(wi∣w1,w2,…,wi−1)

By applying Markov assumptions, which propose that the future state depends only on the

current state and not on the sequence of events that preceded it, we simplify the formula:

P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)

For a unigram model (k=0), this simplifies further to:

P(w1,w2,…,wn)≈∏iP(wi)P(w1,w2,…,wn)≈∏iP(wi)

And for a bigram model (k=1):

P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)

Limitations of N-gram Model in NLP

The N-gram language model has also some limitations. There is a problem with the out

of vocabulary words. These words are during the testing but not in the training. One solution
is to use the fixed vocabulary and then convert out vocabulary words in the training

to pseudowords.

When implemented in the sentiment analysis, the bi-gram model outperformed the uni-

gram model but the number of the features is then doubled.

So, the scaling of the N-gram model to the larger data sets or moving to the higher-

order needs better feature selection approaches. The N-gram model captures the long-

distance context poorly. It has been shown after every 6-grams, the gain of performance is

limited.
2.Language Model Evaluation
Language Model (LM) evaluation is a critical aspect of Natural Language Processing (NLP)

that involves assessing the performance, quality, and effectiveness of language models.

Language models are designed to understand, generate, and manipulate human language, and

their evaluation ensures they meet desired standards for specific tasks.

1. Types of Language Models

Language models can be categorized based on their architecture and purpose:

 Statistical Language Models: Traditional models like n-grams.

 Neural Language Models: Modern models like RNNs, LSTMs, Transformers (e.g.,

GPT, BERT).

 Pre-trained Language Models: Models like GPT, BERT, T5, and others fine-tuned for

specific tasks.

2. Evaluation Metrics

The choice of evaluation metrics depends on the task the language model is designed for.

Common tasks include:

 Text Generation

 Text Classification

 Machine Translation

 Question Answering

 Summarization

 Language Understanding

3. Evaluation Metrics Methods:

1. Intrinsic Evaluation

2. Extrinsic Evaluation

3. Qualitative Evaluation

4. Bias and Fairness Evaluation

5. Efficiency and Scalability Evaluation

6. Robustness Evaluation

7. Long-Term Evaluation
1. Intrinsic Evaluation

Intrinsic evaluation measures the performance of the model based on its ability to generate

or understand text. It is usually done by using standardized datasets or predefined tasks

and does not involve real-world applications directly.

Common Methods:

 Perplexity: Measures how well a model predicts a sample. Perplexity is the inverse

probability of the test set normalized by the number of words. A lower perplexity

indicates a better model.

where P(wi) is the probability of the i-th word in the sequence, and N is the total

number of words.

 Accuracy: For classification tasks, accuracy is a common evaluation metric, indicating

how many predictions are correct.

 BLEU (Bilingual Evaluation Understudy): Common in machine translation, BLEU

compares n-grams of the generated text with reference text, evaluating the overlap.

It is particularly useful for evaluating translation quality.

 ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Mainly used for text

summarization tasks. It measures the overlap between n-grams in the generated and

reference summaries.

 F1 Score: The harmonic mean of precision and recall, especially important when the

classes are imbalanced, such as in text classification or named entity recognition

(NER).

 Accuracy and Loss: In tasks like text classification, the accuracy (percentage of

correct predictions) and loss (the difference between predicted probabilities and

true labels) are key metrics for evaluation.

2. Extrinsic Evaluation

Extrinsic evaluation assesses how well the model performs on a specific task or application,

providing real-world relevance. This kind of evaluation uses the model as a component of a

larger system and measures its impact on the overall system's performance.
Common Methods:

 Task-Specific Performance: This includes evaluating the model on tasks like

sentiment analysis, named entity recognition (NER), part-of-speech tagging, or

document classification, among others.

 Human Evaluation: For tasks like text generation, summarization, or translation,

human annotators may assess the quality of generated outputs. This can include

evaluating fluency, coherence, relevance, and informativeness.

 End-User Impact: In some cases, the performance of the language model is

measured by how it impacts users or their workflows, such as speed, accuracy, and

user satisfaction in applications like chatbots or virtual assistants.

3. Qualitative Evaluation

Qualitative evaluation involves analyzing the behavior of the language model in more

subjective terms. This helps assess the modelÕs ability to handle diverse inputs, such as

slang, ambiguity, or rare scenarios.

Common Methods:

 Error Analysis: Reviewing specific model errors to understand its weaknesses. This

might include misclassifications, false positives, or generated text that lacks

coherence.

 Human Judgments: Experts or users assess outputs based on factors like relevance,

fluency, diversity, and creativity, which canÕt always be captured by automated

metrics like BLEU or perplexity.

 Interpretability and Explainability: In some domains (e.g., healthcare, law),

understanding why a model makes certain decisions is important. Tools like LIME

(Local Interpretable Model-agnostic Explanations) or SHAP (Shapley Additive

Explanations) can be used to interpret the modelÕs predictions.

4. Bias and Fairness Evaluation

Bias evaluation is crucial to ensure that models do not perpetuate or amplify harmful

stereotypes. Language models can inherit societal biases from the data they are trained on,

which may lead to biased outputs in tasks such as sentiment analysis, gender prediction, and

job candidate screening.

Common Methods:

 Bias Detection in Outputs: Evaluating model outputs for biased language, such as

gender, racial, or cultural bias. This is especially important when the model is

deployed in real-world applications affecting diverse communities.

 Fairness Metrics: These are metrics designed to ensure that a model's predictions

do not unfairly favor one group over another. Metrics like demographic parity and

equalized odds can be used in classification tasks to evaluate fairness.

5. Efficiency and Scalability Evaluation

Given the resource-intensive nature of many large-scale language models (e.g., GPT-4),

efficiency and scalability evaluations are also critical.

Common Methods:

 Inference Speed: Evaluating how fast the model can generate predictions or process

inputs. This is essential for real-time applications like conversational AI.

 Memory and Computational Cost: Measuring how much memory and computational

resources the model consumes, which is vital when deploying models at scale or on

edge devices.

6. Robustness Evaluation

Robustness refers to how well a model performs in the face of noisy or adversarial inputs.

Common Methods:

 Adversarial Testing: Introducing adversarial examples or noisy inputs to evaluate

the modelÕs performance under challenging conditions.

 Out-of-Distribution (OOD) Testing: Evaluating the model on inputs that differ from

the training data distribution. This helps assess how well the model generalizes to

unseen scenarios.

7. Long-Term Evaluation

In real-world applications, models are often deployed over time and must be continuously

evaluated to track their performance, as it may degrade due to changing data or

environments.

Common Methods:

 A/B Testing: Running multiple model versions concurrently and testing their

performance on different user segments to evaluate improvements or regressions.

 User Feedback: Collecting real-world feedback from users to assess model

performance over time and ensure the system remains effective.

Common Metrics

1. Perplexity:

o Measures how well a language model predicts a sample.

o Lower perplexity indicates better performance.

o Commonly used for evaluating generative models.

2. BLEU (Bilingual Evaluation Understudy):

o Used for evaluating machine translation and text generation.

o Measures the overlap between generated text and reference text using n-

grams.

3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

o Used for summarization and text generation tasks.

o Focuses on recall (overlap of n-grams between generated and reference text).

4. METEOR:

o Evaluates machine translation by considering synonym matching, stemming, and

word order.

5. Accuracy:

o Used for classification tasks to measure the percentage of correct

predictions.

6. F1 Score:

o Balances precision and recall, useful for tasks like named entity recognition

(NER) and sentiment analysis.

7. Human Evaluation:

o Involves human judges to assess fluency, coherence, and relevance of

generated text.

o Subjective but essential for tasks like dialogue systems and creative text

generation.

8. BERTScore:

o Uses contextual embeddings from BERT to evaluate text generation by

comparing semantic similarity.

9. Diversity and Novelty:

o Measures the variety of generated text to avoid repetitive or generic outputs.

10. Task-Specific Metrics:

o For example, Exact Match (EM) for question answering or CIDEr for image

captioning.

4. Challenges in Language Model Evaluation

 Subjectivity: Tasks like text generation or summarization often require human

judgment, which can be subjective.

 Bias and Fairness: Models may exhibit biases present in training data, leading to

unfair or harmful outputs.

 Generalization: Models may perform well on benchmarks but fail in real-world

scenarios.

 Interpretability: Understanding why a model makes certain predictions can be

difficult, especially for deep learning models.

 Overfitting to Benchmarks: Models may optimize for specific evaluation metrics

without improving overall language understanding.

5. Evaluation Benchmarks and Datasets

To standardize evaluation, researchers use benchmark datasets and tasks:

 GLUE (General Language Understanding Evaluation): A collection of tasks for

evaluating language understanding.

 SuperGLUE: A more challenging version of GLUE.

 SQuAD (Stanford Question Answering Dataset): For evaluating question answering

systems.

 WMT (Workshop on Machine Translation): For machine translation evaluation.

 Common Crawl, Wikipedia, and BookCorpus: Often used for pre-training language

models.

6. Emerging Trends in Evaluation

 Zero-Shot and Few-Shot Evaluation: Assessing models' ability to generalize to

unseen tasks with minimal examples.

 Robustness Testing: Evaluating models on adversarial or out-of-distribution data.

 Ethical Evaluation: Assessing models for fairness, bias, and ethical considerations.
 Multimodal Evaluation: Evaluating models that process both text and other

modalities (e.g., images, audio).

3.Parameter estimation
Parameter estimation in language modeling is a fundamental task in Natural Language

Processing (NLP). It involves determining the parameters of a statistical or neural language

model to accurately predict the probability distribution of word sequences.

What is a Language Model?

A language model assigns probabilities to sequences of words. For example, given a sequence

of words w1,w2,…,wn the model estimates:

P(w1,w2,…,wn)

This probability can be used for tasks like text generation, machine translation, and speech

recognition.

Types of Language Models

 N-gram Models: These models estimate probabilities based on the frequency of word

sequences (n-grams) in a corpus.

 Neural Language Models: These use neural networks (e.g., RNNs, LSTMs,

Transformers) to model word sequences and capture complex dependencies.

Parameter Estimation

Parameter estimation involves learning the model's parameters from training data. The

goal is to maximize the likelihood of the observed data under the model.

1. Maximum Likelihood Estimation (MLE)

2. Bayesian Parameter Estimation

3. Large-Scale Language Models

3.1 Maximum Likelihood Estimation (MLE)

MLE is a method for estimating the parameters of a statistical model by maximizing the

likelihood of the observed data. In language modeling, MLE is used to estimate the

probabilities of n-grams (sequences of nn words) based on their frequencies in the training

corpus.

MLE for N-gram Models

 For n-gram models, MLE estimates probabilities by counting n-gram frequencies in the

training data:

 Count: The number of times the n-gram or (n-1)-gram appears in the training data.

Example

Suppose we have a bigram model (n=2) and the following training corpus:

the cat sat on the mat

the dog sat on the cat

 The bigram "the cat" appears twice.

 The unigram "the" appears four times.
 Using MLE, the probability of "cat" given "the" is:

Limitations of MLE

 Zero Probability Problem: If an n-gram does not appear in the training data, its

probability is estimated as zero. This is problematic because it assigns zero probability

to unseen but valid sequences.

 Overfitting: MLE tends to overfit to the training data, especially for rare n-grams.

Smoothing Techniques

Smoothing techniques address the limitations of MLE by redistributing probability mass to

account for unseen or rare n-grams. Here are some common smoothing methods:

 Laplace (Add-One) Smoothing: Add 1 to all counts.

 Good-Turing Smoothing: Adjust counts for rare events.
 Kneser-Ney Smoothing: A more advanced method that considers the context of n-
grams.
1. Laplace (Add-One) Smoothing

 Add 1 to the count of every n-gram (seen and unseen).

 Adjust the denominator to account for the added counts.

V: Vocabulary size (total number of unique words).

Example:

 Using the same corpus as above, suppose the vocabulary size V=6.

 The smoothed probability of "cat" given "the" is:

Limitation: Adds too much probability mass to unseen events, which can distort

the distribution.

2.Good-Turing Smoothing

 Estimates the probability of n-grams based on the frequency of their occurrence.

 Replaces the count of an n-gram with a smoothed count:

o c: Original count of the n-gram.

o Nc: Number of n-grams with count cc.

Example:

 If there are 10 bigrams that appear once (N1=10) and 5 bigrams that appear twice

(N2=5), the smoothed count for bigrams with c=1 is:

Advantage: Better handles rare events compared to Laplace smoothing.

3.Kneser-Ney Smoothing

 A more advanced smoothing technique that considers the context of n-grams.

 Uses a discounting factor to redistribute probability mass to unseen n-grams.

 The probability is calculated as:

Comparison of Smoothing Techniques

Method Description Advantages Limitations

Directly uses observed Fails for unseen n-grams;

MLE Simple and intuitive.
counts. overfits to training data.

Overestimates probability
Laplace Adds 1 to all counts. Handles zero probabilities.
of rare events.

Adjusts counts based on

Good- Computationally expensive
frequency of Better for rare events.
Turing for large datasets.
frequencies.

Uses discounting and State-of-the-art for n-

More complex to
Kneser-Ney continuation gram models; captures
implement.
probabilities. context diversity.

3.2 Bayesian Parameter Estimation

Bayesian Parameter Estimation is a probabilistic approach to estimating the

parameters of a model by incorporating prior knowledge and updating it with observed data.

Unlike Maximum Likelihood Estimation (MLE), which only relies on the observed data,

Bayesian estimation uses Bayes' Theorem to combine prior beliefs with evidence from the

data. This approach is particularly useful in language modeling when dealing with limited or

sparse data.
Key Concepts in Bayesian Parameter Estimation

Bayes' Theorem

Bayesian estimation is based on Bayes' Theorem, which describes how to update the

probability of a hypothesis (or parameter) given new evidence:

Prior Distribution

The prior represents our initial belief about the parameters before observing any data. For
example:

 In language modeling, we might assume that all n-grams are equally likely (uniform
prior).

 Alternatively, we might use a more informed prior based on domain knowledge.

Posterior Distribution

The posterior combines the prior and the likelihood to provide an updated estimate of the
parameters. It represents our belief about the parameters after observing the data.

Predictive Distribution

Once the posterior is computed, we can use it to make predictions about new data:
Bayesian Parameter Estimation in Language Modeling

In language modeling, Bayesian estimation is often used to estimate the probabilities

of n-grams or other linguistic patterns. Here's how it works:

Dirichlet Prior for Multinomial Distributions

 In n-gram models, the parameters are multinomial distributions over words or n-grams.

 A common choice for the prior is the Dirichlet distribution, which is the conjugate
prior for the multinomial distribution. This means the posterior will also be a Dirichlet
distribution, making computations tractable.

The Dirichlet distribution is parameterized by a vector α=(α1,α2,…,αV), where V is the

vocabulary size. The prior is:

P(θ)=Dirichlet(θ; α)

Posterior Distribution

After observing the data DD, the posterior distribution is:

P(θ∣D) =Dirichlet (θ; α+c)

c= (c1, c2,…,cV): Counts of each word or n-gram in the data.

Predictive Probability

The predictive probability of a word wi given its context is:

Example: Bayesian Estimation for Bigram Models

Suppose we have a bigram model and a vocabulary of size V=3 (words: A, B, C).
We use a Dirichlet prior with α=(1,1,1) (uniform prior).

Observed Data

ABAC

BAC

 Counts: c(A,B)=1 c(A,C)=1, c(B,A)=1, c(B,C)=1

Posterior Distribution

For the context "A", the posterior parameters are:

Predictive Probability

The probability of the next word given "A" is:

3.3 Large-Scale Language Models (LLMs):

Large-Scale Language Models (LLMs) represent a significant advancement in Natural

Language Processing (NLP), leveraging massive amounts of data and computational resources to
achieve state-of-the-art performance on a wide range of tasks. Parameter estimation in these
models involves learning billions (or even trillions) of parameters from vast datasets

Parameter Estimation in Large-Scale Language Models

Model Architecture

 Transformer: The core architecture of LLMs, consisting of:

o Self-Attention Mechanisms: Capture dependencies between words in a

sequence.

o Feedforward Layers: Transform the representations.

o Layer Normalization and Residual Connections: Stabilize training.

 Parameters: Include weights and biases in attention layers, feedforward layers,

embeddings, and positional encodings.

Pretraining Objective

 Autoregressive Models (e.g., GPT): Predict the next word in a sequence given the
previous words.

o Objective: Maximize the likelihood of the next word:

L=−∑i=1Nlog⁡P(wi∣w<i)L=−i=1∑NlogP(wi∣w<i)

 Masked Language Models (e.g., BERT): Predict masked words in a sequence.

o Objective: Maximize the likelihood of the masked words:

L=−∑i∈maskedlog⁡P(wi∣wcontext)L=−i∈masked∑logP(wi∣wcontext)
Optimization

 Stochastic Gradient Descent (SGD): Used to minimize the loss function.

 Adaptive Optimizers: Techniques like Adam or AdamW are commonly used to handle
large-scale optimization efficiently.

 Learning Rate Scheduling: Dynamic adjustment of the learning rate during training
(e.g., warmup followed by decay).

Distributed Training

 Data Parallelism: Split the data across multiple GPUs or TPUs.

 Model Parallelism: Split the model across devices.

 Mixed Precision Training: Use lower precision (e.g., FP16) to speed up computation and
reduce memory usage.

Challenges in Parameter Estimation for LLMs

Computational Resources

 Hardware: Requires high-performance GPUs or TPUs.

 Memory: Large models require significant memory for storing parameters and
intermediate activations.

 Training Time: Pretraining can take weeks or months.

Data Requirements

 Large Corpora: Models are trained on terabytes of text data.

 Data Quality: High-quality, diverse datasets are essential for generalization.

Overfitting

 Despite the large scale, models can still overfit to the training data, especially in fine-
tuning.

Scalability

 Scaling up models (e.g., increasing the number of parameters) requires careful

engineering to maintain efficiency.

Examples of Large-Scale Language Models

1. GPT (Generative Pre-trained Transformer)

 Architecture: Autoregressive Transformer.

 Pretraining Objective: Predict the next word in a sequence.

 Applications: Text generation, summarization, question answering.

2. BERT (Bidirectional Encoder Representations from Transformers)

 Architecture: Bidirectional Transformer.

 Pretraining Objective: Predict masked words in a sequence.

 Applications: Sentence classification, named entity recognition, sentiment analysis.

3. T5 (Text-to-Text Transfer Transformer)

 Architecture: Encoder-decoder Transformer.

 Pretraining Objective: Convert all tasks into a text-to-text format.

 Applications: Machine translation, summarization, question answering.

4. Language Model Adaptation
Language Model Adaptation refers to the process of modifying a pre-trained language
model to better suit a specific task, domain, or dataset. This is particularly important in NLP
because pre-trained language models (e.g., GPT, BERT) are typically trained on large,
general-purpose corpora, and their performance can be significantly improved by adapting
them to the nuances of a specific domain or task.

Why Adapt Language Models?

Pre-trained language models are powerful, but they may not perform optimally in specific
scenarios due to:

 Domain Mismatch: The pre-training corpus may differ significantly from the target
domain (e.g., medical, legal, or technical text).

 Task-Specific Requirements: Downstream tasks (e.g., sentiment analysis, named

entity recognition) may require specialized knowledge or fine-grained understanding.

 Data Scarcity: The target domain or task may have limited labeled data, making it
difficult to train a model from scratch.

Adaptation bridges the gap between general-purpose pre-training and task-specific

requirements.

Techniques for Language Model Adaptation

Fine-Tuning

 What it is: Fine-tuning involves continuing the training of a pre-trained model on a

smaller, task-specific dataset.

 How it works:

o Initialize the model with pre-trained weights.

o Train the model on the target dataset using a task-specific loss function (e.g.,
cross-entropy for classification).

 Advantages:

o Leverages general knowledge from pre-training.

o Adapts the model to the specific task or domain.

 Challenges:

o Risk of overfitting if the target dataset is small.

o Computationally expensive for very large models.

Domain Adaptation

 What it is: Adapting a pre-trained model to a specific domain (e.g., medical, legal, or
financial text).

 Techniques:

o Continued Pretraining: Further pre-train the model on domain-specific

unlabeled data before fine-tuning.

o Domain-Specific Vocabulary: Adjust the tokenizer to include domain-specific

terms.

 Example: Adapting BERT to the biomedical domain by pre-training on PubMed

articles.

Task-Specific Adaptation

 What it is: Adapting a pre-trained model to a specific task (e.g., sentiment analysis,
machine translation).

 Techniques:

o Add task-specific layers (e.g., a classification head for sentiment analysis).

o Fine-tune the entire model or only the task-specific layers.

 Example: Adding a linear layer on top of BERT for text classification.

Parameter-Efficient Adaptation

 What it is: Adapting a model with minimal changes to its parameters to reduce
computational cost.

 Techniques:

o Adapter Modules: Insert small, trainable layers between the pre-trained

layers.

o LoRA (Low-Rank Adaptation): Decompose weight updates into low-rank

matrices.

o Prompt Tuning: Modify the input prompt to guide the model's behavior
without changing its parameters.

 Advantages:

o Reduces memory and computational requirements.

o Enables adaptation for multiple tasks without retraining the entire model.
Few-Shot and Zero-Shot Adaptation

 What it is: Adapting a model to a new task with very few or no labeled examples.

 Techniques:

o Prompt Engineering: Design input prompts to elicit desired outputs from the
model.

o Meta-Learning: Train the model to quickly adapt to new tasks with minimal
data.

 Example: Using GPT-3 for zero-shot text classification by crafting appropriate

prompts.

Challenges in Language Model Adaptation

Overfitting

 Fine-tuning on small datasets can lead to overfitting.

 Solution: Use regularization techniques like dropout or weight decay.

Catastrophic Forgetting

 The model may forget general knowledge learned during pre-training.

 Solution: Use techniques like elastic weight consolidation (EWC) or replay buffers.

Computational Cost

 Fine-tuning large models requires significant computational resources.

 Solution: Use parameter-efficient methods like adapters or LoRA.

Data Scarcity

 Limited labeled data in the target domain can hinder adaptation.

 Solution: Use semi-supervised learning or data augmentation techniques.

Applications of Language Model Adaptation

Domain-Specific Applications

 Medical NLP: Adapting models to clinical text for tasks like diagnosis prediction or
medical entity recognition.

 Legal NLP: Adapting models to legal documents for tasks like contract analysis or
case law summarization.

 Financial NLP: Adapting models to financial reports for tasks like sentiment analysis
or risk assessment.
Task-Specific Applications

 Sentiment Analysis: Fine-tuning a model to classify text as positive, negative, or

neutral.

 Machine Translation: Adapting a model to translate between specific language pairs.

 Question Answering: Fine-tuning a model to answer questions based on a given

context.

Multilingual Adaptation

 Adapting models to work across multiple languages, especially low-resource

languages.
6. Types of Language Models
Language models are a core component of Natural Language Processing (NLP) and are
used to predict the probability of a sequence of words. They can be categorized based on their
architecture, training objectives, and applications.

1. Class-based language models

Class-based language models are a type of statistical language model that groups words
into classes or categories based on certain criteria, such as semantic similarity, syntactic role,
or frequency. These models aim to reduce the sparsity problem in traditional n-gram models
by generalizing over word classes rather than individual words. This approach can improve
generalization, especially in cases where the training data is limited.

How Class-Based Language Models Work

1. Word Classification: Words are grouped into classes based on shared characteristics.
For example, all days of the week might be grouped into a single class, or all verbs might
be grouped into another class.

2. Probability Estimation: Instead of estimating the probability of a word given its history
(as in traditional n-gram models), the model estimates the probability of a word class
given its history, and then the probability of the word given its class.

3. Smoothing: By grouping words into classes, the model can better handle rare or unseen
words, as the probability of a class is often more stable than the probability of an
individual word.

Example of a Class-Based Language Model

Let's consider a simple example where we have a small vocabulary and we want to
build a class-based bigram model.

Vocabulary:

 Words: "cat", "dog", "mouse", "runs", "jumps", "sleeps"

 Classes:

o Class 1 (Animals): "cat", "dog", "mouse"

o Class 2 (Actions): "runs", "jumps", "sleeps"

Bigram Probabilities:

 Instead of estimating the probability of "dog" given "cat" (P("dog" | "cat")),

we estimate:

o The probability of Class 1 given Class 1 (P(Class 1 | Class 1))

o The probability of "dog" given Class 1 (P("dog" | Class 1))

Example Calculation:

1. Class Transition Probability:

P(Class 2 | Class 1) = 0.7 (e.g., the probability that an action follows an animal)

2. Word Emission Probability:

P("runs" | Class 2) = 0.4 (e.g., the probability of the word "runs" given that we are
in the action class)

Generating a Sentence:

 Start with Class 1 (Animals): Choose "cat" (P("cat" | Class 1) = 0.5)

 Transition to Class 2 (Actions): P(Class 2 | Class 1) = 0.7

 Choose "runs" from Class 2: P("runs" | Class 2) = 0.4

The generated sentence could be: "cat runs"

2. Variable-length language models

Variable-length language models are a type of language model that can handle input and
output sequences of varying lengths. Unlike fixed-length models, which require inputs and
outputs to be of a specific size, variable-length models are more flexible and better suited
for tasks like text generation, translation, and summarization, where the length of the input
and output can vary significantly.

Key Characteristics of Variable-Length Language Models:

1. Dynamic Input/Output Handling: They can process sequences of any length, making
them adaptable to different tasks.

2. Recurrent or Transformer-Based Architectures: These models often use

architectures like RNNs (Recurrent Neural Networks), LSTMs (Long Short-Term
Memory), or Transformers, which are designed to handle sequences of varying lengths.

3. Attention Mechanisms: Transformers, for example, use self-attention to weigh the

importance of different parts of the input sequence, enabling them to handle variable-
length inputs effectively.
Example: Transformer-Based Variable-Length Language Model

A popular example of a variable-length language model is the Transformer architecture, which

powers models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder
Representations from Transformers).

How It Works:

1. Input Representation: The input text is tokenized into subwords or words, and each
token is converted into an embedding vector.

2. Positional Encoding: Since Transformers don't inherently understand the order of

tokens, positional encodings are added to the embeddings to provide information about
the position of each token in the sequence.

3. Self-Attention Mechanism: The model computes attention scores between all pairs of
tokens in the sequence, allowing it to capture dependencies regardless of their distance.

4. Output Generation: For tasks like text generation, the model predicts the next token
in the sequence iteratively, allowing it to generate sequences of arbitrary length.

Example Task: Text Generation

Suppose we want to generate a story starting with the prompt:

"Once upon a time"

A variable-length language model like GPT-3 would:

1. Take the input sequence ["Once", "upon", "a", "time"].

2. Generate the next token (e.g., ",").

3. Append the new token to the sequence and repeat the process until a stopping condition
is met (e.g., reaching a maximum length or generating an end-of-sequence token).

The output might look like:

"Once upon a time, in a faraway land, there lived a brave knight..."

The model can continue generating text indefinitely, demonstrating its ability to handle
variable-length sequences.

3. Discriminative language models

Discriminative language models are a type of language model that focuses on

distinguishing between different classes or categories of text. Unlike generative language
models, which aim to generate new text, discriminative models are trained to classify or
predict labels for given input text. They are widely used in tasks such as sentiment analysis,
text classification, named entity recognition, and more.
Key Characteristics of Discriminative Language Models:

1. Task-Specific: They are designed for specific tasks like classification or prediction.

2. Conditional Probability: They model the conditional probability P(y∣x)P(y∣x), where yy is

the label and xx is the input text.

3. No Text Generation: They do not generate new text but instead predict labels or
categories for input text.

4. Supervised Learning: They require labeled data for training.

Examples of Discriminative Language Models:

1. Logistic Regression for Text Classification:

o A simple discriminative model that predicts the probability of a class label given
a text input.

o Example: Classifying emails as "spam" or "not spam."

2. Support Vector Machines (SVM):

o Used for text classification tasks like sentiment analysis.

o Example: Determining whether a movie review is "positive" or "negative."

3. Conditional Random Fields (CRF):

o Commonly used for sequence labeling tasks like named entity recognition (NER).

o Example: Identifying names, dates, and locations in a sentence.

4. BERT (Bidirectional Encoder Representations from Transformers):

o Although BERT is a pre-trained generative model, it can be fine-tuned for

discriminative tasks like text classification, question answering, and NER.

o Example: Fine-tuning BERT to classify news articles into categories like "sports,"
"politics," or "technology."

5. RoBERTa (Robustly Optimized BERT Approach):

o An optimized version of BERT, often used for discriminative tasks.

o Example: Sentiment analysis on social media posts.

Example Use Case: Sentiment Analysis

 Task: Classify movie reviews as "positive" or "negative."

 Model: Fine-tuned BERT for binary classification.

 Input: "The movie was a fantastic experience with brilliant performances."

 Output: "Positive"

Comparison with Generative Language Models:

 Generative Models: Focus on generating text by modeling P(x)P(x) (e.g., GPT,

Transformer-XL).

 Discriminative Models: Focus on predicting labels by modeling P(y∣x)P(y∣x).

4. Syntax-based language models

Syntax-based language models are a type of language model that incorporates syntactic
structure (grammar rules, sentence structure, etc.) into their predictions. These models go
beyond simple word sequence prediction and consider the grammatical relationships between
words in a sentence. They are particularly useful for tasks like parsing, grammar correction,
and generating syntactically correct sentences.
How Syntax-Based Language Models Work

Syntax-based models often use:

1. Parse Trees: Represent the syntactic structure of a sentence.

2. Grammar Rules: Define how words and phrases can be combined.

3. Probabilistic Context-Free Grammars (PCFGs): Assign probabilities to different

syntactic structures.

4. Dependency Parsing: Captures relationships between words (e.g., subject-verb-object).

These models can be integrated into neural networks or used in rule-based systems.

Example of a Syntax-Based Language Model

LetÕs consider a simple example using Probabilistic Context-Free Grammar (PCFG).

Grammar Rules:

1. S → NP VP (Sentence → Noun Phrase Verb Phrase)

2. NP → Det N (Noun Phrase → Determiner Noun)

3. VP → V NP (Verb Phrase → Verb Noun Phrase)

4. Det → "the" | "a"

5. N → "cat" | "dog"

6. V → "chased" | "ate"

Sentence Generation:

Using the above rules, the model can generate sentences like:

 "The cat chased the dog."

 "A dog ate the cat."

Parsing:

Given a sentence, the model can parse it into a syntactic structure:

 Input: "The cat chased the dog."

 Parse Tree:

5. Maximum Entropy (MaxEnt) language models

Maximum Entropy (MaxEnt) language models, also known as logistic regression models,
are a type of statistical model used in natural language processing (NLP) to predict the
probability of a sequence of words. These models are based on the principle of maximum
entropy, which states that the best model is the one that makes the least assumptions about
the data while still being consistent with the observed data.

Key Characteristics of MaxEnt Language Models:

1. Feature-Based: MaxEnt models use features extracted from the input data to make
predictions. These features can be anything from the presence of specific words to
more complex linguistic patterns.

2. Discriminative: Unlike generative models (e.g., n-gram models), MaxEnt models are
discriminative, meaning they directly model the conditional probability P(y∣x), where y is
the output (e.g., the next word) and xx is the input (e.g., the previous words).
3. Flexible: MaxEnt models can incorporate a wide variety of features, making them highly
flexible and capable of capturing complex relationships in the data.

Example of a MaxEnt Language Model:

Suppose we want to build a MaxEnt model to predict the next word in a sentence. Let's
consider a simple example where we want to predict the next word after the sequence "I want
to".

Step 1: Define Features

We define a set of features that might be useful for predicting the next word. For example:

 The previous word is "to".

 The word before the previous word is "want".

 The word before that is "I".

Step 2: Collect Training Data

We collect a corpus of sentences and extract the features for each instance where the
sequence "I want to" appears. For each instance, we note the next word (the target) and the
features.

Step 3: Train the Model

We train the MaxEnt model using the collected data. The model learns the weights for
each feature, which indicate how important each feature is for predicting the next word.

Step 4: Make Predictions

Once the model is trained, we can use it to predict the next word given a new sequence. For
example, given the sequence "I want to", the model might predict:

 "go" with a probability of 0.4

 "eat" with a probability of 0.3

 "sleep" with a probability of 0.2

 etc.

Mathematical Formulation:

The MaxEnt model defines the conditional probability P(y∣x) as:

where:

 fi(x,y) are the feature functions.

 λi are the weights associated with each feature.

 Z(x) is the normalization factor to ensure the probabilities sum to 1.

Advantages of MaxEnt Models:

 Incorporation of Diverse Features: MaxEnt models can incorporate a wide range of

features, including contextual, syntactic, and semantic information.

 Efficiency: They are computationally efficient to train and use, especially with large
datasets.

 Interpretability: The weights learned by the model can provide insights into the
importance of different features.

6. Factored language models

Factored language models are a type of statistical language model that incorporate
additional structure or factors beyond the standard n-gram models. These factors can include
morphological, syntactic, semantic, or even contextual information to improve the model's
ability to predict the next word in a sequence.

Example of a Factored Language Model

Let's consider a simple example where we want to predict the next word in a sentence
using a factored language model that incorporates both word forms and their part-of-speech
(POS) tags.

Sentence:

"The cat sat on the mat."

Step 1: Tokenization and POS Tagging

First, we tokenize the sentence and assign POS tags to each word:

 The (DT)

 cat (NN)

 sat (VBD)

 on (IN)

 the (DT)

 mat (NN)

 . (.)
Step 2: Define Factors

In this example, we define two factors:

1. Word Form: The actual word (e.g., "cat", "sat").

2. POS Tag: The part of speech of the word (e.g., "NN", "VBD").

Step 3: Create Factored N-grams

Instead of just using the word forms to create n-grams, we use both the word
forms and their POS tags. For example, a bigram model would consider pairs of (word
form, POS tag) sequences.

Step 4: Model Training

We train the model on a corpus where each word is represented as a combination of its
form and POS tag. For example, the sentence "The cat sat on the mat." would be represented
as:

 (The, DT)

 (cat, NN)

 (sat, VBD)

 (on, IN)

 (the, DT)

 (mat, NN)

 (., .)

Step 5: Prediction

When predicting the next word, the model considers both the previous word's form and
its POS tag. For example, if the previous word was "the" (DT), the model might predict that
the next word is likely to be a noun (NN) like "cat" or "mat".

Advantages of Factored Language Models

1. Improved Generalization: By incorporating additional linguistic information, the model

can generalize better to unseen data.

2. Better Handling of Rare Words: POS tags can help the model make better predictions
even for rare words by leveraging their syntactic role.

3. Contextual Understanding: Factors like POS tags provide syntactic context, which can
improve the model's understanding of sentence structure.
Example Prediction

Given the sequence:

 (The, DT)

 (cat, NN)

The model might predict that the next word is likely to be a verb (VBD) like "sat" or "jumped",
based on the learned patterns from the training data.

7. Other Tree-based language models

Tree-based language models are a class of models that leverage hierarchical structures
(trees) to represent and process language. Unlike sequential models like RNNs or
Transformers, tree-based models explicitly capture syntactic or semantic relationships in a
hierarchical manner. Below are some examples of tree-based language models and their
applications:

1. Recursive Neural Networks (RNNs)

 Description: Recursive Neural Networks process data in a tree structure, where each
node represents a word or phrase, and the model recursively combines child nodes to
form parent nodes.

 Example:

o Constituency Parsing: Building a parse tree for a sentence by recursively

combining words into phrases.

o Sentiment Analysis: Assigning sentiment scores to phrases and aggregating them

up the tree to determine the overall sentiment of a sentence.

2. Tree-LSTMs

 Description: Tree-LSTMs are an extension of Long Short-Term Memory (LSTM)

networks designed to operate on tree structures. They capture dependencies in
hierarchical data more effectively than sequential LSTMs.

 Example:

o Dependency Parsing: Encoding dependency trees to predict syntactic

relationships between words.

o Semantic Relatedness: Measuring the similarity between two sentences by

comparing their Tree-LSTM representations.
3. Grammar-Based Models

 Description: These models use formal grammars (e.g., Context-Free Grammars or

Probabilistic Context-Free Grammars) to generate or parse sentences based on tree
structures.

 Example:

o Syntax-Aware Language Modeling: Generating sentences that adhere to specific

grammatical rules.

o Machine Translation: Using grammar-based trees to align source and target

language structures.

4. Neural Variational Inference for Tree Structures

 Description: These models use variational inference to learn latent tree structures from
text data, often in an unsupervised manner.

 Example:

o Unsupervised Parsing: Inferring syntactic trees without labeled data.

o Text Generation: Generating text by sampling from latent tree structures.

5. Transformer-Based Tree Models

 Description: Some recent models combine Transformers with tree structures to

leverage the strengths of both approaches. For example, they might use attention
mechanisms over tree nodes instead of sequential tokens.

 Example:

o Syntax-Infused Transformers: Enhancing Transformer models with syntactic

tree information for better language understanding.

o Code Generation: Using tree structures to represent abstract syntax trees

(ASTs) for programming languages.

6. Abstract Syntax Tree (AST) Based Models

 Description: These models are specifically designed for programming languages, where
the input is represented as an Abstract Syntax Tree (AST).

 Example:

o Code Completion: Predicting the next token or node in a program's AST.

o Bug Detection: Analyzing ASTs to identify potential bugs in code.

7. Neural Symbolic Machines

 Description: These models combine neural networks with symbolic reasoning, often using
tree structures to represent logical forms or programs.

 Example:

o Semantic Parsing: Mapping natural language to executable logical forms (e.g., SQL
queries).

o Program Synthesis: Generating code from natural language descriptions.

8. Bayesian topic-based language models

Bayesian topic-based language models are a class of probabilistic models that

incorporate topic modeling into language modeling. These models aim to capture the latent
semantic structure of text by representing documents as mixtures of topics, where each topic
is a distribution over words. By integrating topics into language models, they can generate more
coherent and contextually relevant text.

One popular approach is the Latent Dirichlet Allocation (LDA) model, which is a
generative probabilistic model that assumes documents are generated from a mixture of
topics. However, LDA itself is not a language model. To create a topic-based language model,
LDA can be combined with traditional language models like n-grams or neural language models.

Example: Bayesian Topic-Based Language Model

LetÕs consider a simplified example of a Bayesian topic-based language model. Suppose

we have a corpus of documents, and we want to generate text that is both syntactically correct
and semantically coherent.

Step 1: Topic Modeling with LDA

1. Input: A corpus of documents.

2. Output: A set of topics, where each topic is a distribution over words, and each
document is a distribution over topics.

For example, suppose we have three topics:

 Topic 1: {science, research, experiment, data}

 Topic 2: {sports, game, player, team}

 Topic 3: {politics, government, election, policy}

Step 2: Language Modeling

We can use a traditional language model (e.g., a bigram or trigram model) to capture the
syntactic structure of the text. However, instead of using a single language model for the
entire corpus, we can create a separate language model for each topic.

For example:

 Language Model for Topic 1: P("research" | "science") = 0.3, P("experiment" |

"research") = 0.2, etc.

 Language Model for Topic 2: P("game" | "sports") = 0.4, P("player" | "game") = 0.3, etc.

 Language Model for Topic 3: P("government" | "politics") = 0.5, P("election" |

"government") = 0.2, etc.

Step 3: Generating Text

To generate text, we first sample a topic from the document-topic distribution. Then, we use
the language model associated with that topic to generate words.

For example:

1. Sample a topic: Suppose we sample Topic 1 (science).

2. Generate words using the language model for Topic 1:

o Start with "science".

o Next word: "research" (with probability 0.3).

o Next word: "experiment" (with probability 0.2).

o And so on.

The generated text might look like: "science research experiment data..."

Bayesian Framework

In a Bayesian framework, we can place priors on the parameters of the topic models and
language models. For example:

 A Dirichlet prior can be placed on the topic distributions.

 A Dirichlet prior can also be placed on the word distributions within each topic.

This allows us to perform Bayesian inference to estimate the posterior distributions of the
topics and language model parameters given the observed data.

Advantages of Bayesian Topic-Based Language Models

1. Semantic Coherence: By incorporating topics, the generated text is more semantically

coherent.
2. Flexibility: The model can adapt to different domains by learning domain-specific topics.

3. Uncertainty Quantification: Bayesian methods provide a natural way to quantify

uncertainty in the model parameters.

9. Neural Network Language Models (NNLMs)

Neural Network Language Models (NNLMs) are a class of language models that use
neural networks to predict the probability of a sequence of words. They have become the
foundation of modern natural language processing (NLP) due to their ability to capture complex
patterns in language data. Below is an explanation of NNLMs, along with an example.

What is a Neural Network Language Model?

A Neural Network Language Model is a model that uses neural networks to estimate the
probability distribution of words in a sequence. It learns to predict the next word in a sequence
given the previous words. NNLMs are trained on large text corpora and can capture syntactic
and semantic relationships between words.

Key components of NNLMs:

1. Input Layer: Represents words or tokens as vectors (e.g., word embeddings).

2. Hidden Layers: Learn patterns and relationships in the data (e.g., Recurrent Neural
Networks (RNNs), Long Short-Term Memory (LSTM), or Transformers).

3. Output Layer: Produces a probability distribution over the vocabulary for the next
word.

Example of a Neural Network Language Model

LetÕs consider a simple example of predicting the next word in a sentence using a neural
network.

Input Sentence:

"The cat sat on the ___"

Goal:

Predict the next word (e.g., "mat").

Steps in the NNLM Process:

1. Tokenization:

o Break the sentence into tokens: ["The", "cat", "sat", "on", "the"].
2. Word Embeddings:

o Convert each word into a dense vector representation (e.g., using Word2Vec,
GloVe, or learned embeddings).

o Example: "cat" → [0.2, -0.5, 0.7], "sat" → [0.1, 0.3, -0.2], etc.

3. Neural Network Architecture:

o Use a sequence model (e.g., RNN, LSTM, or Transformer) to process the sequence
of word embeddings.

o The model learns to capture dependencies between words.

4. Prediction:

o The model outputs a probability distribution over the vocabulary for the next
word.

o Example: P("mat") = 0.6, P("chair") = 0.3, P("floor") = 0.1.

5. Output:

o The word with the highest probability is selected as the prediction (e.g., "mat").

Types of Neural Network Language Models

1. Feedforward Neural Networks:

o Early NNLMs used fixed-size context windows to predict the next word.

o Example: Bengio et al.'s 2003 model.

2. Recurrent Neural Networks (RNNs):

o Process sequences of arbitrary length by maintaining a hidden state.

o Example: LSTM or GRU-based models.

3. Transformers:

o Use self-attention mechanisms to capture long-range dependencies.

o Example: GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder

Representations from Transformers).
Language-specific modeling problems
Language-specific modeling problems in NLP arise from the inherent complexities and
nuances of natural language, including ambiguity, context dependence, and the need for models
to understand and generate text that is both grammatically correct and semantically
meaningful.
Language models (LMs) in NLP face various language-specific modeling problems due to
linguistic diversity, structural differences, and cultural nuances across languages. Below are
key challenges with examples:
1. Morphological Richness
 Problem: Languages with complex morphology (e.g., agglutinative/fusional) challenge
tokenization and word representation.
 Example:
o Finnish: "taloissani" ("in my houses") combines root + plural + possessive suffix.
o Arabic: "‫"( "ﻛﺘﺒﻨﺎ‬katabnā" = "we wrote" or "our books") requires disambiguation.
2. Word Order Variations
 Problem: Fixed-context LMs (e.g., GPT) struggle with free-word-order languages.
 Example:
o German: "Den Mann beißt der Hund" (OVS, "The man is bitten by the dog") vs.
standard SVO.
o Turkish: "Ali okula dün gitti" (SOV) vs. "Dün Ali okula gitti" (OSV) — same meaning.
3. Pro-Drop Phenomena
 Problem: Subject omission requires context-aware inference.
 Example:
o Spanish: "Hablo español" ("[I] speak Spanish") lacks an explicit subject.
o Japanese: "食べた" ("[I] ate") relies on context.
4. Gender and Agreement
 Problem: Gender-neutral languages vs. gendered systems cause mismatches.
 Example:
o English: "They are a doctor" (gender-neutral) vs. French "Ils sont médecins"
(masculine plural).
o Hebrew: Verbs agree with gender ("‫[ "הוא הלך‬he went] vs. "‫[ "היא הלכה‬she went]).
5. Script and Tokenization
 Problem: Non-Latin scripts (e.g., logographic, abugida) complicate subword splitting.
 Example:

o Chinese: "喜欢" ("xı̌huān" = "like") is one word but two characters.

o Thai: "กําลังไป" ("kamlang pai" = "is going") has no spaces between words.
6. Idioms and Cultural Context

 Problem: Literal translations fail for culturally rooted phrases.

 Example:

o English: "Kick the bucket" → dies.

o Dutch: "De kat uit de boom kijken" ("watch the cat out of the tree" = hesitate).

7. Low-Resource Data Scarcity

 Problem: LMs for rare languages lack training data.

 Example:

o Navajo: "ShikÕéí dóó shidineÕé" ("my relatives and my people") has limited NLP
resources.

8. Polysemy and Homonymy

 Problem: Words with multiple meanings require disambiguation.

 Example:

o English: "Bank" (financial institution vs. river edge).

o Russian: "Ключ" ("klyuch" = "key" or "spring").

9. Formality and Honorifics

 Problem: Social hierarchy encoded in language.

 Example:

o Korean: "합니다" (formal) vs. "해" (informal) for "do."

o Japanese: "食べます" (polite) vs. "食べる" (plain).

10. Dialectal Variations

 Problem: LMs trained on standardized forms fail on dialects.

 Example:

o Arabic: Modern Standard Arabic (MSA) vs. Egyptian Arabic "‫"( "عﺎيز‬ʕāyez" =
"want").

o English: "Y'all" (Southern U.S.) vs. "You guys" (Northern U.S.).

1. Language modeling for morphologically rich languages (MRLs)

Language modeling for morphologically rich languages (MRLs) in natural language

processing (NLP) refers to the specific challenges and strategies required to handle languages
with complex word structures, where a single word can convey a wide range of meanings
through inflection, derivation, and compounding. These languages, which include languages like
Turkish, Finnish, Arabic, Russian, and many others, have highly inflected forms that require
special treatment compared to languages with more regular or simple morphology, like English.

Key Challenges for MRLs in NLP:

1. Word Forms Explosion:

o MRLs tend to have a large number of word forms due to rich inflectional systems
(e.g., gender, number, case, tense, aspect, mood). This can result in a vocabulary
explosion, making it difficult for traditional language models to handle all possible
forms of a word.

o Example: In Turkish, the verb "gelmek" (to come) can be inflected in many forms:

 geliyorum (I am coming)

 geldim (I came)

 geleceğim (I will come)

 gelmeliyim (I should come)

2. Ambiguity and Polysemy:

o MRLs may have a high degree of ambiguity, where different word forms can look
similar but have different meanings based on context, requiring better
disambiguation.

o Example: In Arabic, the word "‫( "ﻛﺘﺐ‬kataba) can mean "he wrote" or "he is writing,"
depending on tense, but without proper context, it can be ambiguous.

3. Morphological Parsing:

o MRLs require accurate morphological parsing to break down words into meaningful
units (morphemes). This is essential for understanding word structure and
generating meaningful outputs.

o Example:

o In Finnish, the word

"lentokonesuihkuturbiinimittarikapellimittaristoilmasuihkuturbiinimittarinkapelli
mittaristo" is a long compound word that refers to "airplane jet engine
instrument panel" and needs to be parsed to understand its components.

4. Word Segmentation:

o Some MRLs, such as Chinese or Thai, do not use spaces between words, which
requires segmentation as part of preprocessing. While this is less of an issue for
languages like Turkish or Finnish, segmentation still plays a role in understanding
compound words.
o Example: In Chinese, "我喜欢吃苹果" ("I like eating apples") must be correctly
segmented into "我喜欢吃苹果."

Strategies for Language Modeling in MRLs:

1. Subword Tokenization:

o Byte-Pair Encoding (BPE), SentencePiece, or WordPiece tokenization can help by

breaking down complex words into subword units. This helps the model to
generalize better, as it doesnÕt need to handle every possible word form
separately.

o Example: For Turkish, using BPE might break "geliyorum" (I am coming) into
subwords like "gel" and "iyor" (present continuous suffix) and "um" (1st person
singular pronoun).

2. Morphological Analysis:

o Preprocessing with morphological analyzers can help to identify root forms and
affixes (prefixes, suffixes) and treat them as distinct components. This allows
the model to learn and generalize better from these components rather than
memorizing every word form.

o Example: In Finnish, the word

"lentokonesuihkuturbiinimittarikapellimittaristoilmasuihkuturbiinimittarinkapelli
mittaristo" can be split into smaller morphemes such as "lentokone" (airplane),
"suihkuturbiini" (jet engine), "mittari" (gauge), etc.

3. Character-Level Modeling:

o Some models handle the task at the character level rather than word level. This
can be especially useful for MRLs, where complex morphology means that the form
of a word can change substantially while maintaining its core meaning.

o Example: In Arabic, instead of modeling "kataba" (he wrote) as a word, a

character-level model might break it down to its constituent characters: "k", "a",
"t", "a", "b", "a", and learn the morphological rules from these characters.

4. Transfer Learning with Pre-trained Models:

o Using pre-trained models (like multilingual BERT, XLM-R, or mT5) on a large

corpus of multiple languages, including MRLs, can help transfer knowledge about
morphology and syntax from similar languages. This allows models to better handle
the rich morphological structures of MRLs.
o Example: Multilingual BERT (mBERT) can be fine-tuned for specific MRLs, such as
Turkish or Arabic, to adapt to the nuances of the language and improve
performance on downstream tasks.

5. Dealing with Out-of-Vocabulary (OOV) Words:

o MRLs often have words that donÕt appear in the training data. Subword
tokenization and character-level models are particularly useful in mitigating OOV
issues because they break words into smaller components that can be learned even
if the full word wasn't seen during training.

o Example: A rare word like "kitapçık" (booklet) in Turkish might be split into
subwords like "kitap" (book) and "çık" (diminutive suffix), allowing the model to
handle it effectively.

2. Selection of Subword Units

In natural language processing (NLP), subword units are smaller linguistic components
that are used to break down words into manageable pieces, especially for languages with
complex morphology or when dealing with rare or unseen words. The process of selecting
subword units is critical because it affects how a model learns to understand and generate
language. Subword units can be characters, syllables, or more complex units like word stems or
morphemes.

Importance of Subword Units in Language Models:

1. Handling Out-of-Vocabulary (OOV) Words:

o Words that do not appear in the training set, especially rare or compound words,
can be decomposed into subword units, reducing the impact of OOV issues.

o Example: For the word “unhappiness,” a subword tokenizer might break it down
into subword units like “un,” “happi,” and “ness.”

2. Handling Morphologically Rich Languages:

o In languages with rich morphology (like Turkish or Finnish), words can have many
inflections or derivations. Subword units help break these words into more basic
units that can be processed effectively.

o Example: In Turkish, the word "kitaplarınızdan" (from your books) can be split
into subword units like "kitap" (book), "lar" (plural suffix), "ınız" (your suffix), and
"dan" (ablative case).
3. Efficient Representation of Rare Words:

o Subword tokenization helps represent rare and compound words, even if they
don't appear in the training corpus, by breaking them down into familiar subword
units.

o Example: "Googleplex" might be split into "Goog" and "leplex," making it easier for
the model to understand and process.

Common Methods for Selecting Subword Units:

1. Byte-Pair Encoding (BPE):

o BPE is a data-driven algorithm that iteratively merges the most frequent pair of
characters or subword units into a new symbol. This approach is useful in managing
both OOV words and rare word forms by breaking them into frequent subword
units.

Steps in BPE:

o Start with a vocabulary of individual characters.

o Count all pairs of adjacent characters.

o Merge the most frequent pair into a new symbol.

o Repeat until the desired vocabulary size is achieved.

Example:

o Let's say we have the word "lower" and "newer."

1. Initially, BPE treats each character as a subword: [l, o, w, e, r], [n, e, w, e,

r].

2. The most frequent pair of characters is "e" and "r", so they are merged
into "er".

3. Now, the words become [l, o, w, er], [n, e, w, er].

4. Continue merging the most frequent pairs.

Result: Words like "lower" and "newer" may be represented as subword units like [l, o, w, er],
making it easier to process variations of "lower" and "newer."
2. WordPiece:

o WordPiece is similar to BPE but with some modifications. It also merges subwords
based on frequency, but it tries to maximize the likelihood of a sequence of
subword units given a training corpus.

o WordPiece is widely used in models like BERT and its variants.

Example:

o WordPiece starts with individual characters and progressively merges them into
subwords like "un", "happiness", and "es" to form "un", "happi", "ness." It builds a
vocabulary of subword units that maximize the likelihood of words in the context
of the language model.

3. SentencePiece:

o SentencePiece is a data-driven, unsupervised text tokenizer and detokenizer that

is based on subword units. It is often used in neural machine translation and other
NLP tasks. SentencePiece can be used in both BPE and unigram language model
modes.

Example:

o Using SentencePiece, the word "unhappiness" might be split into ["▁un",

"happiness"] (where "▁" indicates a space).

SentencePiece often produces a compact and efficient representation, especially for

languages that do not use spaces between words.

4. Morpheme-based Tokenization:

o In languages with rich morphology, morpheme-based tokenization can be

employed, where the subword units are actual morphemes (the smallest meaning-
carrying units of language). This method relies on linguistic knowledge and is more
precise but requires proper morphological analysis tools.

Example:

o In Finnish, the word "lentokonesuihkuturbiinimittarikapellimittaristo" (meaning

"airplane jet engine instrument panel") might be split into morphemes like: ["lento"
(airplane), "kone" (machine), "suihku" (jet), "turbiini" (turbine), "mittari" (gauge),
etc.]
Subword Selection in Practice: Examples

1. English:

o The word “unhappiness” might be split into subwords like ["un", "happiness"] or
["un", "happi", "ness"] using BPE or WordPiece. This allows the model to handle
variations like "happy," "happier," and "unhappy" by focusing on the core subwords.

2. German:

o German has compound words, and a simple tokenization approach might result in
long words that are difficult for the model to process. A subword model would
break down a word like "Donaudampfschifffahrtsgesellschaftskapitän" (Danube
steamship company captain) into manageable subwords like ["Donau", "dampf",
"schiff", "fahrt", "gesellschaft", "skapitän"].

3. Turkish:

o Turkish has agglutination, where words are built from a root plus multiple affixes.
For example, the word "kitaplarınızdan" (from your books) can be broken into
subwords: ["kitap" (book), "lar" (plural), "ınız" (your), "dan" (ablative)].

4. Chinese:

o In languages like Chinese, where there are no spaces between words, subword
tokenization (like SentencePiece) is used to split sentences into meaningful
subword units. For example, "我喜欢吃苹果" (I like eating apples) might be split
into ["我", "喜欢", "吃", "苹果"].

3. Modeling with Morphological Categories

Modeling morphological categories in language models refers to the ability to

understand and represent the different forms and structures of words that convey

grammatical information, such as tense, case, gender, number, and other linguistic features.

Morphology is particularly important in languages with rich inflection or derivation systems,

where a single word can have many forms based on its morphological structure.

In Natural Language Processing (NLP), incorporating morphological categories allows

language models to better capture the meaning and grammatical function of words, improving

performance on tasks like machine translation, text generation, and speech recognition.
Key Morphological Categories in NLP:

1. Tense (Verb inflection indicating time of action):

o Example: In English, the verb "to walk" can be inflected to show tense: "walk"

(present), "walked" (past), "walking" (progressive), etc.

o In Russian, verbs have different forms based on past, present, future, and

aspect (perfective vs. imperfective).

2. Case (Grammatical category indicating the syntactic or semantic role of a noun or

pronoun):

o Example: In German, nouns change form depending on the case: "der Hund"

(nominative, the dog), "den Hund" (accusative, the dog), "dem Hund" (dative, to

the dog), etc.

3. Gender (The classification of nouns and adjectives as masculine, feminine, or neuter):

o Example: In Spanish, nouns have gender: "el libro" (the book, masculine) and "la

mesa" (the table, feminine).

4. Number (Singular or plural form of a word):

o Example: In English, "cat" is singular, while "cats" is plural. In Arabic, the plural

form can be broken down into regular plural (e.g., "‫ "ﻛﺘﺎب" → "ﻛﺘﺐ‬for books) or

broken plural (e.g., "‫ "ولد" → "أوﻻد‬for children).

5. Person (The subject of the verb: 1st, 2nd, or 3rd person):

o Example: In French, the verb changes according to the person: "je mange" (I

eat), "tu manges" (you eat), "il mange" (he eats).

6. Mood (Indicates the attitude of the speaker toward the action: indicative, imperative,

subjunctive, etc.):

o Example: In Spanish, the subjunctive mood is used in sentences like "Espero que

él venga" (I hope that he comes).

7. Aspect (Describes the internal temporal flow of an action, such as whether it is

completed or ongoing):

o Example: In Russian, the verb "читать" (to read) can have a perfective aspect

"прочитать" (to read completely) and an imperfective aspect "читать" (to read

regularly).
Methods for Modeling Morphological Categories in NLP

Morphological categories are often captured and modeled in NLP by breaking down words into

smaller units like morphemes, stems, and affixes. Here are a few techniques and models used

to represent and model these categories:

1. Morphological Analysis (Morphological Tagging)

 Morphological analysis is the task of identifying and tagging words with their

morphological features (e.g., tense, case, gender). This can be done using rule-based or

statistical approaches.

 Example: In Turkish, the word "kitaplarınızdan" can be analyzed as:

o Root: kitap (book)

o Plural suffix: -lar

o Possessive suffix: -ınız (your)

o Ablative suffix: -dan (from)

 Output: "kitaplarınızdan" → Root: kitap, Case: Ablative, Person: 2nd, Number: Plural

2. Subword Tokenization:

 Using methods like Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, subword

tokenization can break down complex words into smaller units (e.g., prefixes, roots,

suffixes), which can help capture morphological features more effectively.

 Example: For the Turkish word "geliyorum" (I am coming), the tokenization might

break it into subword units like:

o "gel" (root for come)

o "iyor" (present continuous suffix)

o "um" (first-person singular suffix)

 Result: This allows the model to capture the structure and tense (present continuous)

without needing to see the entire word form during training.

3. Character-Level Models:

 Character-level models model words as sequences of characters rather than whole

words. This can be especially useful for morphologically rich languages where words

can change significantly through affixation or compounding.

 Example: The word "katılım" (participation) in Turkish could be represented as a

sequence of characters: ['k', 'a', 't', 'ı', 'l', 'ı', 'm'].

 Advantage: By processing characters instead of entire words, the model can learn

patterns for suffixes, prefixes, and other morphological markers.

4. Pretrained Multilingual Models (e.g., mBERT, XLM-R)

 Pretrained multilingual models like mBERT (Multilingual BERT) and XLM-R (XLM-

RoBERTa) can capture morphological categories by leveraging training data from many

languages. These models can learn the underlying grammatical structures and

relationships between words across languages, making them effective for

morphologically rich languages.

 Example: mBERT, when fine-tuned on specific tasks, can model grammatical features

such as tense, number, and case for various languages, including languages like Arabic,

Turkish, and Finnish, without requiring separate models for each language.

5. LSTM and Transformer-based Models with Morphological Features:

 Long Short-Term Memory (LSTM) networks and transformers can also be enhanced

to model morphology by incorporating additional morphological features, such as POS

(part-of-speech) tags or grammatical features.

 Example: In Finnish, the word "lentokonesuihkuturbiinimittarikapellimittaristo" (jet

engine instrument panel) could be split into its components, and an LSTM or

Transformer model could incorporate the feature that "lentokone" is a noun (root),

"suihkuturbiini" is a compound (jet turbine), and so on.

6. Morphological Embeddings:

 Morphological embeddings can capture the meaning of different morphemes (prefixes,

suffixes, roots) and their relationship. By representing morphological components as

embeddings (dense vector representations), a model can effectively learn to capture

their contribution to the overall meaning of a word.

 Example: A model might learn that "un-" typically negates the meaning of a word (e.g.,

"unhappy") and represent this prefix as a unique embedding that can be combined with

the root word embedding (e.g., "happy").

In many languages, words are not explicitly separated by spaces, which presents a

unique challenge for language models in Natural Language Processing (NLP). This is

the case for languages like Chinese, Japanese, Thai, Vietnamese, and Malay, where

text is written without clear word boundaries. In such languages, the task of
segmentation becomes crucial, as it involves determining where one word ends and the

next begins, a process known as word segmentation or tokenization.

Challenges of Word Segmentation:

 No Spaces Between Words: Unlike languages like English, where spaces are used to

separate words, many languages do not use spaces or punctuation marks to indicate

word boundaries.

 Ambiguity: In languages without word segmentation, multiple interpretations of a

sequence of characters are possible. For example, a string of characters might be

segmented into different words depending on context.

4.Languages without Word Segmentation

In many languages, words are not explicitly separated by spaces, which presents a

unique challenge for language models in Natural Language Processing (NLP). This is the

case for languages like Chinese, Japanese, Thai, Vietnamese, and Malay, where text is

written without clear word boundaries. In such languages, the task of segmentation

becomes crucial, as it involves determining where one word ends and the next begins, a

process known as word segmentation or tokenization.

Challenges of Word Segmentation:

 No Spaces Between Words: Unlike languages like English, where spaces are used to

separate words, many languages do not use spaces or punctuation marks to indicate

word boundaries.

 Ambiguity: In languages without word segmentation, multiple interpretations of a

sequence of characters are possible. For example, a string of characters might be

segmented into different words depending on context.

Approaches to Handle Languages Without Word Segmentation:

1. Character-Level Models:

o In languages without explicit word segmentation, character-level models can be

used to process the text as sequences of characters rather than whole words.

This approach eliminates the need for explicit word boundaries, as the model

learns to recognize patterns and linguistic structures from the characters

themselves.
o Example: For Chinese or Japanese, character-level models like RNNs or

Transformers (such as BERT and GPT) can be fine-tuned on tasks such as

machine translation, sentiment analysis, or text generation.

2. Word Segmentation Models:

o Statistical Models: Techniques like Hidden Markov Models (HMM), Conditional

Random Fields (CRF), and Maximum Entropy Models have been applied to the

word segmentation problem. These models predict word boundaries based on the

probability distributions of sequences in a large corpus of labeled data.

o Neural Network Models: With the rise of deep learning, neural network-based

models such as BiLSTMs (Bidirectional Long Short-Term Memory networks),

CRFs, and Transformers have shown strong performance in word segmentation.

These models are trained end-to-end and can handle ambiguous segmentations

by considering the broader context in which a sequence of characters appears.

3. Pretrained Language Models:

o Multilingual Transformers (e.g., mBERT, XLM-R): Pretrained models like mBERT

are capable of handling text in languages without explicit word segmentation.

These models are trained on massive multilingual corpora and can learn to

process languages like Chinese, Japanese, and Thai without relying on traditional

word segmentation.

4. Lexical Resources:

o Dictionary-based Segmentation: A dictionary-based approach uses precompiled

lexicons to identify known words in the text and segment accordingly. This can

be combined with statistical methods to improve accuracy, especially for unseen

words or rare terms.

5. Spoken versus Written Languages

The distinction between spoken and written languages in Natural Language Processing
(NLP) is an important one, as these two forms of language exhibit different
characteristics and challenges. These differences can affect how language models are
trained and applied for tasks like text generation, translation, sentiment analysis, and
speech recognition.
Key Differences Between Spoken and Written Language:

1. Formality and Structure:

o Spoken Language: It is often informal, more conversational, and spontaneous.

Speech tends to have shorter, incomplete sentences, frequent use of fillers
(e.g., "uh," "you know"), and contractions (e.g., "I'm" instead of "I am").

o Written Language: Writing is usually more formal, structured, and carefully

constructed. Sentences are complete, and grammar and punctuation are strictly
followed.

o Example:

 Spoken: "Uh, I was thinking about going to the store later."

 Written: "I was considering going to the store later."

2. Vocabulary:

o Spoken Language: The vocabulary used in spoken language is often simpler, with
frequent use of common, everyday words. There are also colloquial expressions,
slang, and regional dialects.

o Written Language: Written language tends to have a more sophisticated

vocabulary, with fewer colloquialisms and more formal word choices.

o Example:

 Spoken: "Wanna go out later?"

 Written: "Would you like to go out later?"

3. Disfluencies and Hesitations:

o Spoken Language: In speech, people often hesitate, repeat words, or make

errors that they quickly correct. These disfluencies are common in everyday
conversation.

o Written Language: In writing, disfluencies are usually removed or edited. Text

is typically clean and free from interruptions or self-corrections.

o Example:

 Spoken: "I... I don't know if I can make it."

 Written: "I don't know if I can make it."

4. Contextual Cues:

o Spoken Language: Spoken communication relies heavily on contextual and non-

verbal cues (e.g., tone of voice, facial expressions, gestures). Speakers can
clarify meaning through prosody (intonation and stress) and body language.
o Written Language: Writing lacks these immediate non-verbal cues, so it relies
more on explicit context and grammatical structure to convey meaning.

5. Pacing and Length:

o Spoken Language: Speech tends to be more immediate and fast-paced, often

with incomplete thoughts and shorter phrases.

o Written Language: Writing is typically slower, more deliberate, and more

thought-out. Writers have time to revise their ideas and choose words carefully.

Challenges in NLP for Spoken vs. Written Language:

1. Speech Recognition (Converting spoken language to written text):

o Spoken language models for speech recognition need to deal with natural
disfluencies, hesitations, and incomplete phrases. Speech recognition systems
must be robust enough to handle these elements and convert speech into
grammatically correct text.

o Example: Converting “Um, could you, like, help me with this?” into a clean
sentence requires the model to understand and remove filler words like “um” and
“like.”

2. Text Generation:

o Generating spoken text (e.g., for virtual assistants like Siri or Alexa) requires
the language model to produce informal, conversational language that fits the
context of an ongoing interaction.

o Example: If a user asks, "WhatÕs the weather like today?" a spoken language
model might respond with something like, "ItÕs sunny and 75 degrees." In
contrast, a written language model might provide a more detailed answer: "The
weather today is sunny with a temperature of 75°F."

3. Machine Translation:

o Spoken language translation (e.g., simultaneous translation in real-time

conversations) requires handling the informal structure and the often context-
dependent nature of spoken phrases.

o Example: In spoken language translation, idiomatic expressions like "kick the

bucket" (meaning to die) need to be understood and translated accurately,
whereas in written translation, these expressions might be handled in a more
structured and formal manner.

4. Disfluency Handling:
o In spoken language models, handling disfluencies is key. This includes the task
of filtering out unnecessary parts of speech (e.g., "uh," "um") or correcting false
starts (e.g., "I mean, I think we should go...").

o Example: A spoken language model might clean up a sentence like: "Um, I was,
like, thinking about going to the park... but I donÕt know." It might be converted
to: "I was thinking about going to the park, but I donÕt know."

Techniques to Handle Both Spoken and Written Language:

1. Multimodal Models:

o Multimodal models combine both spoken and written data to handle both forms
of language effectively. These models are trained on both speech (audio) and
text to bridge the gap between spoken and written language. For example,
DeepSpeech or wav2vec can recognize speech and then translate it into written
form.

2. Pretrained Language Models:

o Pretrained models like BERT, GPT-3, or T5 can be fine-tuned for both spoken
and written language tasks. For instance, a model could be trained on both
formal text (e.g., news articles) and informal spoken text (e.g., dialogue
datasets).

3. Speech-to-Text (STT) and Text-to-Speech (TTS):

o STT systems convert spoken language into written form, while TTS systems
take written text and generate spoken output. These models are critical in
bridging the gap between spoken and written language and are useful for voice
assistants, transcription services, and accessibility tools.

Examples of NLP Applications for Spoken vs. Written Language:

1. Spoken Language Applications:

o Voice Assistants: Models like Siri, Google Assistant, and Alexa rely on spoken
language models to respond to verbal commands in a conversational manner.

o Speech Recognition Systems: These systems convert spoken input into written
text. Examples include Dragon NaturallySpeaking and Google Speech-to-Text.

o Real-Time Translation: Services like Google Translate and Skype Translator

use spoken language models to convert spoken input in one language to spoken
output in another language in real-time.

2. Written Language Applications:

o Text Summarization: Models like BERT and T5 can be used to generate
summaries of written content, such as articles or reports, in a concise and
structured manner.

o Grammar and Style Correction: Tools like Grammarly or Hemingway Editor

focus on improving written language by correcting grammar, spelling, and style.

o Text Classification: Written language models are widely used for classifying
text into categories, such as spam detection or sentiment analysis, based on
formal written content.

Multilingual and Crosslingual Language Modeling

Multilingual Language Modeling (MLM) involves training a single language model on text
from multiple languages, enabling it to process and generate text across different
languages without requiring separate monolingual models. This approach is crucial for
improving NLP applications in low-resource languages and reducing redundancy in model
development.

Key Concepts in Multilingual Language Modeling

a) Shared Vocabulary & Tokenization

 Subword Tokenization (e.g., Byte-Pair Encoding (BPE), SentencePiece, Unigram

LM) helps handle multiple languages efficiently.

 A shared vocabulary is created across languages, allowing the model to represent

words from different languages in a unified embedding space.

 Example: XLM-R uses a single vocabulary trained on 100+ languages.

b) Crosslingual Transfer Learning

 High-resource languages (e.g., English, Chinese) help improve performance on low-

resource languages (e.g., Swahili, Bengali).

 The model learns language-agnostic representations, enabling zero-shot or few-

shot transfer.

c) Language Identification & Embeddings

 Some models (e.g., mBERT, mT5) use language embeddings to indicate the input
language, helping the model switch between languages.
2. Popular Multilingual Language Models

Model Architecture Languages Key Features

Transformer Trained on Wikipedia, no
mBERT 104
(BERT-based) explicit alignment
XLM-R (RoBERTa- Larger scale, better
Transformer 100+
based) crosslingual transfer
mT5 (Multilingual Text-to-text framework,
Seq2Seq (T5-based) 101
T5) supports generation
GPT-3.5/4 Few-shot multilingual
Decoder-only ~100+
(ChatGPT) capabilities
Open-source, large-scale
BLOOM Decoder-only 46
multilingual LM
3. Training Strategies for MLM

a) Monolingual Corpus Mixing

 Train on a mix of monolingual texts from different languages (e.g., mBERT,

XLM-R).

 Challenge: Some languages may dominate due to data imbalance.

b) Parallel Data-Based Training

 Use translation pairs (e.g., XLM with Translation Language Modeling (TLM)).

 Helps align representations across languages.

c) Unsupervised Alignment

 Techniques like masked language modeling (MLM) and back-translation help

align languages without parallel data.

d) Parameter-Efficient Fine-Tuning

 Adapters, LoRA: Train small language-specific modules instead of full fine-

tuning.

4. Applications of Multilingual LMs

 Machine Translation (e.g., GoogleÕs mT5, FacebookÕs NLLB)

 Crosslingual Text Classification (e.g., sentiment analysis in multiple languages)

 Named Entity Recognition (NER) (e.g., identifying entities in low-resource

languages)

 Multilingual Chatbots & Virtual Assistants (e.g., ChatGPT, Alexa)

 Information Retrieval & Search (e.g., multilingual search engines)

5. Challenges & Future Directions

Challenges

 Language Imbalance: High-resource languages dominate model performance.

 Script & Grammar Differences: Handling languages with different syntax (e.g.,
Arabic vs. English).

 Computational Cost: Training large multilingual models is expensive.

Future Trends

 Better Low-Resource Adaptation (e.g., meta-learning, few-shot prompting).

 Efficient Multilingual Models (e.g., modular architectures like AdapterFusion).

 Improved Crosslingual Alignment (e.g., contrastive learning, better

tokenization).

Crosslingual Language Modeling (CLM)

Crosslingual Language Modeling (CLM) focuses on training models to transfer knowledge

from one language to another, enabling them to perform tasks in a target language even
with limited or no supervised data. Unlike Multilingual Language Models (MLMs) (which
handle multiple languages simultaneously), CLM emphasizes cross-language generalization,
making it crucial for low-resource languages.

1. Key Concepts in Crosslingual Language Modeling

a) Definition & Goal

 Goal: Train a model on one or more source languages (typically high-resource)

and apply it to target languages (often low-resource) with minimal fine-tuning.

 Core Idea: Learn language-agnostic representations that generalize across

languages.

b) Crosslingual Transfer Learning

 Zero-shot Transfer: Apply a model trained on Language A directly to Language

B without additional training.

 Few-shot Transfer: Use a small amount of labeled data in the target language
for adaptation.

 Unsupervised CLM: Align languages without parallel data (e.g., using back-
translation, masked LM).
c) Alignment Strategies

 Lexical Alignment: Mapping words/subwords across languages (e.g., using

bilingual dictionaries).

 Sentence-Level Alignment: Using parallel corpora (e.g., Europarl, UN datasets).

 Latent Space Alignment: Forcing embeddings of different languages into a

shared space (e.g., LASER, VecMap).

2. Approaches to Crosslingual Language Modeling

A) Supervised CLM (Using Parallel Data)

 Translation Language Modeling (TLM) (XLM):

o Extends Masked LM (MLM) by masking tokens in parallel sentences and

predicting them bidirectionally.

o Example: Mask a word in an English sentence and predict it using context

from its French translation.

 Multilingual Seq2Seq (mBART, mT5):

o Trained on large-scale parallel corpora for translation-like tasks.

o Can be fine-tuned for crosslingual tasks (summarization, QA).

B) Unsupervised CLM (No Parallel Data Needed)

 Back-Translation:

o Generate synthetic parallel data by translating monolingual text.

o Used in Unsupervised Machine Translation (UMT).

 Adversarial Training:

o Use GANs or contrastive learning to align embeddings (e.g., MUSE

embeddings).

 Self-Training:

o Use model predictions on unlabeled data to iteratively improve performance.

C) Pretrain & Fine-Tune Paradigm

1. Pretrain on multiple languages (e.g., mBERT, XLM-R).

2. Fine-tune on a source language (e.g., English NER).

3. Transfer to target languages with zero/few-shot learning.

3. Popular Crosslingual Language Models

Model Type Key Feature Languages Supported

XLM (Facebook) Encoder TLM & MLM 15+ (needs parallel data)

XLM-R Encoder Large-scale MLM 100+ (no parallel data)

mBART Seq2Seq Denoising autoencoder 25+ (parallel data for FT)

Unicoder (Microsoft) Encoder Multi-task alignment 12+

InfoXLM Encoder Contrastive learning 100+

4. Applications of Crosslingual Models

✔ Machine Translation (e.g., GoogleÕs mT5, MetaÕs NLLB)

✔ Crosslingual Text Classification (e.g., sentiment analysis in unseen languages)
✔ Named Entity Recognition (NER) (e.g., recognizing entities in Swahili using English
data)
✔ Question Answering (QA) (e.g., XQuAD benchmark for crosslingual QA)
✔ Multilingual Search & Retrieval

5. Challenges in Crosslingual Modeling

A) Linguistic Divergence

 Syntax Differences: Subject-Object-Verb (SOV) vs. SVO languages.

 Morphological Complexity: Agglutinative (e.g., Turkish) vs. analytic (e.g.,

Chinese).

B) Data Scarcity & Bias

 Most models rely on English-centric pretraining, hurting low-resource

languages.

 Domain mismatch (e.g., Wikipedia vs. social media text).

C) Evaluation Difficulties

 Lack of standardized benchmarks for all languages.

 Zero-shot performance often lags behind supervised models.

6. Future Directions

Better Unsupervised Alignment (e.g., using contrastive learning)

Parameter-Efficient Transfer (e.g., Adapters, LoRA for per-language tuning)
Meta-Learning for Few-Shot Adaptation
Improving Low-Resource Language Performance

Language Modeling in NLP
No ratings yet
Language Modeling in NLP
15 pages
NLP Unit-5.2 Notes
No ratings yet
NLP Unit-5.2 Notes
72 pages
Unit-5 Notes NLP
No ratings yet
Unit-5 Notes NLP
28 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
NLP-Ch-2 Introduction To Language Models
No ratings yet
NLP-Ch-2 Introduction To Language Models
82 pages
Ngrams
No ratings yet
Ngrams
22 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Language Modeling Lecture Notes
No ratings yet
Language Modeling Lecture Notes
88 pages
Language Models
No ratings yet
Language Models
11 pages
N-gram Models in NLP Explained
No ratings yet
N-gram Models in NLP Explained
28 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
NLP Model
No ratings yet
NLP Model
6 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
NLP Unit-5
No ratings yet
NLP Unit-5
13 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
9 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP
No ratings yet
NLP
12 pages
Language Modeling
No ratings yet
Language Modeling
3 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
NLP Notes For Students
67% (3)
NLP Notes For Students
18 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
NLP Unit 5
No ratings yet
NLP Unit 5
3 pages
Introduction To Language Models
No ratings yet
Introduction To Language Models
24 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
NLP Internal
No ratings yet
NLP Internal
15 pages
Unit 5 Language Modeling Notes
No ratings yet
Unit 5 Language Modeling Notes
3 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Deep Learning (MODULE-4) - RNN - NLP
No ratings yet
Deep Learning (MODULE-4) - RNN - NLP
52 pages
Statistical Language Model
No ratings yet
Statistical Language Model
9 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Lecture 10 - N-Gram Language Models4 - Unit 2
No ratings yet
Lecture 10 - N-Gram Language Models4 - Unit 2
4 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
NLP - AI2214601 Unit 1to Unit 5 Notes
No ratings yet
NLP - AI2214601 Unit 1to Unit 5 Notes
98 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Unit 3-Notes AI
No ratings yet
Unit 3-Notes AI
36 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
TRESUN Stock of Ceramic Products (2024-08-21 01 - 23 - 20)
No ratings yet
TRESUN Stock of Ceramic Products (2024-08-21 01 - 23 - 20)
33 pages
PAN Application Status NSDL Final
No ratings yet
PAN Application Status NSDL Final
10 pages
2800 Electronic Feb06 (New)
No ratings yet
2800 Electronic Feb06 (New)
134 pages
Computer Science 2 Model Question Paper Tumkur
No ratings yet
Computer Science 2 Model Question Paper Tumkur
5 pages
ACH580 - Modbus - Control Program Firmware Manual
No ratings yet
ACH580 - Modbus - Control Program Firmware Manual
28 pages
Old Paper 2597164
No ratings yet
Old Paper 2597164
1 page
Remove IP Blacklist on Spamhaus
No ratings yet
Remove IP Blacklist on Spamhaus
8 pages
FusionSolar App Quick Guide (Charger)
No ratings yet
FusionSolar App Quick Guide (Charger)
29 pages
Whitepaper
No ratings yet
Whitepaper
27 pages
VMAPStech Fact Sheet - 2022
No ratings yet
VMAPStech Fact Sheet - 2022
2 pages
Standard Operating Procedure - Efiling - Gujarat District Judiciary
No ratings yet
Standard Operating Procedure - Efiling - Gujarat District Judiciary
9 pages
Java Ela 22mic0101 4
No ratings yet
Java Ela 22mic0101 4
18 pages
EnDuRA Overview
No ratings yet
EnDuRA Overview
32 pages
Calculator Consum Electric LP ELECTRIC
No ratings yet
Calculator Consum Electric LP ELECTRIC
4 pages
TOS Empowerment
No ratings yet
TOS Empowerment
6 pages
GPON OLT for Telecom Providers
No ratings yet
GPON OLT for Telecom Providers
7 pages
DNP General Description
No ratings yet
DNP General Description
50 pages
Hacking PSP
No ratings yet
Hacking PSP
6 pages
Addison Wesley - Web Hacking - A - Attacks
No ratings yet
Addison Wesley - Web Hacking - A - Attacks
1,024 pages
HRRU5902 Hardware Description (02) (PDF) - en
No ratings yet
HRRU5902 Hardware Description (02) (PDF) - en
20 pages
SN 65 HVD 1781
No ratings yet
SN 65 HVD 1781
34 pages
Topic:: Building A Network, Requirements
No ratings yet
Topic:: Building A Network, Requirements
15 pages
Time, Space and Power Complexity of ARM CORTEX M3: Team: Jatin Kumar (18BCB0072) and Prateek Sinha (18BCB0081)
No ratings yet
Time, Space and Power Complexity of ARM CORTEX M3: Team: Jatin Kumar (18BCB0072) and Prateek Sinha (18BCB0081)
5 pages
Manual k3 2017 Eng
No ratings yet
Manual k3 2017 Eng
43 pages
WDD Assignment
0% (2)
WDD Assignment
11 pages
Business Requirements Document (Active Codes)
No ratings yet
Business Requirements Document (Active Codes)
5 pages
Naveed Iqbal Khan Resume
No ratings yet
Naveed Iqbal Khan Resume
2 pages
JD - SAP BW4HANA Consultant
No ratings yet
JD - SAP BW4HANA Consultant
1 page
Software Re-Engineering Guide
100% (1)
Software Re-Engineering Guide
3 pages
Football - Digital Income Engine
100% (1)
Football - Digital Income Engine
5 pages