NLP Unit-4
NLP Unit-4
involves predicting the next word or sequence of words in a given context. It is a core
component of many NLP applications, such as machine translation, speech recognition, text
generation, and more. Language models (LMs) are trained to capture the structure, grammar,
and semantics of a language, enabling them to generate coherent and contextually appropriate
text.
is the development of probabilistic models that can predict the next word in the
sequence given the words that precede. Examples such as N-gram language modeling.
2. Neural Language Modeling: Neural network methods are achieving better results than
classical methods both on standalone language models and when models are incorporated
into larger models on challenging tasks like speech recognition and machine translation.
1. Probability Distribution:
For a given sequence of words w1, w2, wn the model estimates the probability
o For example, in a bigram model (n=2), the probability of a word depends only on
o N-gram models are simple but suffer from sparsity issues (i.e., many possible word
o Neural networks have largely replaced traditional n-gram models due to their
scalability.
4. Evaluation Metrics:
o Perplexity: A common metric for evaluating language models. It measures how well
o BLEU, ROUGE, METEOR: Used for evaluating generated text in tasks like
o Bias and Fairness: Language models can inherit biases present in the training
data.
N-gram
    N-gram can be defined as the contiguous sequence of n items from a given sample of
text or speech. The items can be letters, words, or base pairs according to the application.
The N-grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams
For N-1 words, the N-gram modeling predicts most occurred words that can follow the
sequences.
The model is the probabilistic language model which is trained on the collection of the text.
This model is useful in applications i.e. speech recognition, and machine translations.
The N-gram language model is about finding probability distributions over the sequences
of the word. Consider the sentences i.e. "There was heavy rain" and "There was heavy flood".
The N-gram language model tells that the "heavy rain" occurs more frequently than the
"heavy flood". So, the first statement is more likely to occur and it will be then selected by
this model.
 In the one-gram model, the model usually relies on that which word occurs often
 In 2-gram, only the previous word is considered for predicting the current word.
Advantages of N-grams
1. Simplicity: N-grams are intuitive and relatively simple to understand and implement.
2. Low Memory Usage: They require minimal memory for storage compared to more complex
models.
Limitations of N-grams
1. Limited Context: N-grams have a finite context window, which means they cannot
sequence of words in a language. A well-crafted N-gram model can effectively predict the
next word in a sentence, which is essentially determining the value of p(w∣h), where h is the
LetÕs explore how to predict the next word in a sentence. We need to calculate p(w|h),
where w is the candidate for the next word. Consider the sentence ‘This article is on…Õ.If
we want to calculate the probability of the next word being “NLP”, the probability can be
expressed as:
p(“NLP”∣“This”,“article”,“is”,“on”)
To generalize, the conditional probability of the fifth word given the first four can be
written as:
p(w5∣w1,w2,w3,w4) or p(W)=p(wn∣w1,w2,…,wn−1)
P(X1,X2,…,Xn)=P(X1)P(X2∣X1)P(X3∣X1,X2)…P(Xn∣X1,X2,…,Xn−1)
This yields:
P(w1,w2,w3,…,wn)=∏i P(wi∣w1,w2,…,wi−1)
By applying Markov assumptions, which propose that the future state depends only on the
current state and not on the sequence of events that preceded it, we simplify the formula:
P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−k,…,wi−1)
P(w1,w2,…,wn)≈∏iP(wi)P(w1,w2,…,wn)≈∏iP(wi)
P(wi∣w1,w2,…,wi−1)≈P(wi∣wi−1)
of vocabulary words. These words are during the testing but not in the training. One solution
is to use the fixed vocabulary and then convert out vocabulary words in the training
to pseudowords.
When implemented in the sentiment analysis, the bi-gram model outperformed the uni-
So, the scaling of the N-gram model to the larger data sets or moving to the higher-
order needs better feature selection approaches. The N-gram model captures the long-
distance context poorly. It has been shown after every 6-grams, the gain of performance is
limited.
2.Language Model Evaluation
     Language Model (LM) evaluation is a critical aspect of Natural Language Processing (NLP)
that involves assessing the performance, quality, and effectiveness of language models.
Language models are designed to understand, generate, and manipulate human language, and
their evaluation ensures they meet desired standards for specific tasks.
 Neural Language Models: Modern models like RNNs, LSTMs, Transformers (e.g.,
GPT, BERT).
 Pre-trained Language Models: Models like GPT, BERT, T5, and others fine-tuned for
specific tasks.
2. Evaluation Metrics
The choice of evaluation metrics depends on the task the language model is designed for.
 Text Generation
 Text Classification
 Machine Translation
 Question Answering
 Summarization
 Language Understanding
1. Intrinsic Evaluation
2. Extrinsic Evaluation
3. Qualitative Evaluation
6. Robustness Evaluation
         7. Long-Term Evaluation
1. Intrinsic Evaluation
Intrinsic evaluation measures the performance of the model based on its ability to generate
Common Methods:
 Perplexity: Measures how well a model predicts a sample. Perplexity is the inverse
probability of the test set normalized by the number of words. A lower perplexity
where P(wi) is the probability of the i-th word in the sequence, and N is the total
number of words.
compares n-grams of the generated text with reference text, evaluating the overlap.
 ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Mainly used for text
summarization tasks. It measures the overlap between n-grams in the generated and
reference summaries.
 F1 Score: The harmonic mean of precision and recall, especially important when the
(NER).
 Accuracy and Loss: In tasks like text classification, the accuracy (percentage of
correct predictions) and loss (the difference between predicted probabilities and
2. Extrinsic Evaluation
Extrinsic evaluation assesses how well the model performs on a specific task or application,
providing real-world relevance. This kind of evaluation uses the model as a component of a
larger system and measures its impact on the overall system's performance.
Common Methods:
human annotators may assess the quality of generated outputs. This can include
measured by how it impacts users or their workflows, such as speed, accuracy, and
3. Qualitative Evaluation
Qualitative evaluation involves analyzing the behavior of the language model in more
subjective terms. This helps assess the modelÕs ability to handle diverse inputs, such as
Common Methods:
 Error Analysis: Reviewing specific model errors to understand its weaknesses. This
coherence.
 Human Judgments: Experts or users assess outputs based on factors like relevance,
understanding why a model makes certain decisions is important. Tools like LIME
Bias evaluation is crucial to ensure that models do not perpetuate or amplify harmful
stereotypes. Language models can inherit societal biases from the data they are trained on,
which may lead to biased outputs in tasks such as sentiment analysis, gender prediction, and
 Bias Detection in Outputs: Evaluating model outputs for biased language, such as
gender, racial, or cultural bias. This is especially important when the model is
 Fairness Metrics: These are metrics designed to ensure that a model's predictions
do not unfairly favor one group over another. Metrics like demographic parity and
Given the resource-intensive nature of many large-scale language models (e.g., GPT-4),
Common Methods:
 Inference Speed: Evaluating how fast the model can generate predictions or process
 Memory and Computational Cost: Measuring how much memory and computational
resources the model consumes, which is vital when deploying models at scale or on
edge devices.
6. Robustness Evaluation
Robustness refers to how well a model performs in the face of noisy or adversarial inputs.
Common Methods:
 Out-of-Distribution (OOD) Testing: Evaluating the model on inputs that differ from
the training data distribution. This helps assess how well the model generalizes to
unseen scenarios.
7. Long-Term Evaluation
In real-world applications, models are often deployed over time and must be continuously
environments.
Common Methods:
 A/B Testing: Running multiple model versions concurrently and testing their
Common Metrics
1. Perplexity:
o Measures the overlap between generated text and reference text using n-
grams.
4. METEOR:
word order.
5. Accuracy:
predictions.
6. F1 Score:
o Balances precision and recall, useful for tasks like named entity recognition
7. Human Evaluation:
generated text.
o Subjective but essential for tasks like dialogue systems and creative text
generation.
8. BERTScore:
o For example, Exact Match (EM) for question answering or CIDEr for image
captioning.
 Bias and Fairness: Models may exhibit biases present in training data, leading to
scenarios.
systems.
 Common Crawl, Wikipedia, and BookCorpus: Often used for pre-training language
models.
      Ethical Evaluation: Assessing models for fairness, bias, and ethical considerations.
       Multimodal Evaluation: Evaluating models that process both text and other
3.Parameter estimation
Parameter estimation in language modeling is a fundamental task in Natural Language
A language model assigns probabilities to sequences of words. For example, given a sequence
P(w1,w2,…,wn)
This probability can be used for tasks like text generation, machine translation, and speech
recognition.
 N-gram Models: These models estimate probabilities based on the frequency of word
 Neural Language Models: These use neural networks (e.g., RNNs, LSTMs,
Parameter Estimation
Parameter estimation involves learning the model's parameters from training data. The
goal is to maximize the likelihood of the observed data under the model.
MLE is a method for estimating the parameters of a statistical model by maximizing the
likelihood of the observed data. In language modeling, MLE is used to estimate the
corpus.
 For n-gram models, MLE estimates probabilities by counting n-gram frequencies in the
training data:
 Count: The number of times the n-gram or (n-1)-gram appears in the training data.
Example
Suppose we have a bigram model (n=2) and the following training corpus:
Limitations of MLE
 Zero Probability Problem: If an n-gram does not appear in the training data, its
 Overfitting: MLE tends to overfit to the training data, especially for rare n-grams.
Smoothing Techniques
account for unseen or rare n-grams. Here are some common smoothing methods:
Example:
 Using the same corpus as above, suppose the vocabulary size V=6.
Limitation: Adds too much probability mass to unseen events, which can distort
the distribution.
2.Good-Turing Smoothing
Example:
 If there are 10 bigrams that appear once (N1=10) and 5 bigrams that appear twice
                                                                       Overestimates probability
Laplace        Adds 1 to all counts.     Handles zero probabilities.
                                                                       of rare events.
parameters of a model by incorporating prior knowledge and updating it with observed data.
Unlike Maximum Likelihood Estimation (MLE), which only relies on the observed data,
Bayesian estimation uses Bayes' Theorem to combine prior beliefs with evidence from the
data. This approach is particularly useful in language modeling when dealing with limited or
sparse data.
Key Concepts in Bayesian Parameter Estimation
Bayes' Theorem
Bayesian estimation is based on Bayes' Theorem, which describes how to update the
Prior Distribution
The prior represents our initial belief about the parameters before observing any data. For
example:
      In language modeling, we might assume that all n-grams are equally likely (uniform
       prior).
Posterior Distribution
The posterior combines the prior and the likelihood to provide an updated estimate of the
parameters. It represents our belief about the parameters after observing the data.
Predictive Distribution
Once the posterior is computed, we can use it to make predictions about new data:
Bayesian Parameter Estimation in Language Modeling
 In n-gram models, the parameters are multinomial distributions over words or n-grams.
      A common choice for the prior is the Dirichlet distribution, which is the conjugate
       prior for the multinomial distribution. This means the posterior will also be a Dirichlet
       distribution, making computations tractable.
P(θ)=Dirichlet(θ; α)
Posterior Distribution
Predictive Probability
             Suppose we have a bigram model and a vocabulary of size V=3 (words: A, B, C).
       We use a Dirichlet prior with α=(1,1,1) (uniform prior).
Observed Data
ABAC
BAC
Predictive Probability
Model Architecture
Pretraining Objective
      Autoregressive Models (e.g., GPT): Predict the next word in a sequence given the
       previous words.
L=−∑i=1NlogP(wi∣w<i)L=−i=1∑NlogP(wi∣w<i)
L=−∑i∈maskedlogP(wi∣wcontext)L=−i∈masked∑logP(wi∣wcontext)
Optimization
      Adaptive Optimizers: Techniques like Adam or AdamW are commonly used to handle
       large-scale optimization efficiently.
      Learning Rate Scheduling: Dynamic adjustment of the learning rate during training
       (e.g., warmup followed by decay).
Distributed Training
      Mixed Precision Training: Use lower precision (e.g., FP16) to speed up computation and
       reduce memory usage.
Computational Resources
      Memory: Large models require significant memory for storing parameters and
       intermediate activations.
Data Requirements
Overfitting
      Despite the large scale, models can still overfit to the training data, especially in fine-
       tuning.
Scalability
  Pre-trained language models are powerful, but they may not perform optimally in specific
  scenarios due to:
        Domain Mismatch: The pre-training corpus may differ significantly from the target
         domain (e.g., medical, legal, or technical text).
        Data Scarcity: The target domain or task may have limited labeled data, making it
         difficult to train a model from scratch.
Fine-Tuning
 How it works:
            o   Train the model on the target dataset using a task-specific loss function (e.g.,
                cross-entropy for classification).
 Advantages:
 Challenges:
     What it is: Adapting a pre-trained model to a specific domain (e.g., medical, legal, or
      financial text).
 Techniques:
Task-Specific Adaptation
     What it is: Adapting a pre-trained model to a specific task (e.g., sentiment analysis,
      machine translation).
 Techniques:
Parameter-Efficient Adaptation
     What it is: Adapting a model with minimal changes to its parameters to reduce
      computational cost.
 Techniques:
         o   Prompt Tuning: Modify the input prompt to guide the model's behavior
             without changing its parameters.
 Advantages:
         o   Enables adaptation for multiple tasks without retraining the entire model.
Few-Shot and Zero-Shot Adaptation
 What it is: Adapting a model to a new task with very few or no labeled examples.
 Techniques:
          o   Prompt Engineering: Design input prompts to elicit desired outputs from the
              model.
          o   Meta-Learning: Train the model to quickly adapt to new tasks with minimal
              data.
Overfitting
Catastrophic Forgetting
 Solution: Use techniques like elastic weight consolidation (EWC) or replay buffers.
Computational Cost
Data Scarcity
Domain-Specific Applications
      Medical NLP: Adapting models to clinical text for tasks like diagnosis prediction or
       medical entity recognition.
      Legal NLP: Adapting models to legal documents for tasks like contract analysis or
       case law summarization.
      Financial NLP: Adapting models to financial reports for tasks like sentiment analysis
       or risk assessment.
Task-Specific Applications
Multilingual Adaptation
   1. Word Classification: Words are grouped into classes based on shared characteristics.
      For example, all days of the week might be grouped into a single class, or all verbs might
      be grouped into another class.
   2. Probability Estimation: Instead of estimating the probability of a word given its history
      (as in traditional n-gram models), the model estimates the probability of a word class
      given its history, and then the probability of the word given its class.
   3. Smoothing: By grouping words into classes, the model can better handle rare or unseen
      words, as the probability of a class is often more stable than the probability of an
      individual word.
              Let's consider a simple example where we have a small vocabulary and we want to
      build a class-based bigram model.
Vocabulary:
 Classes:
Example Calculation:
P(Class 2 | Class 1) = 0.7 (e.g., the probability that an action follows an animal)
             P("runs" | Class 2) = 0.4 (e.g., the probability of the word "runs" given that we are
         in the action class)
Generating a Sentence:
   1. Dynamic Input/Output Handling: They can process sequences of any length, making
      them adaptable to different tasks.
How It Works:
   1. Input Representation: The input text is tokenized into subwords or words, and each
      token is converted into an embedding vector.
   3. Self-Attention Mechanism: The model computes attention scores between all pairs of
      tokens in the sequence, allowing it to capture dependencies regardless of their distance.
   4. Output Generation: For tasks like text generation, the model predicts the next token
      in the sequence iteratively, allowing it to generate sequences of arbitrary length.
   3. Append the new token to the sequence and repeat the process until a stopping condition
      is met (e.g., reaching a maximum length or generating an end-of-sequence token).
The model can continue generating text indefinitely, demonstrating its ability to handle
variable-length sequences.
1. Task-Specific: They are designed for specific tasks like classification or prediction.
  3. No Text Generation: They do not generate new text but instead predict labels or
     categories for input text.
        o   A simple discriminative model that predicts the probability of a class label given
            a text input.
o Commonly used for sequence labeling tasks like named entity recognition (NER).
        o   Example: Fine-tuning BERT to classify news articles into categories like "sports,"
            "politics," or "technology."
 Output: "Positive"
       Syntax-based language models are a type of language model that incorporates syntactic
structure (grammar rules, sentence structure, etc.) into their predictions. These models go
beyond simple word sequence prediction and consider the grammatical relationships between
words in a sentence. They are particularly useful for tasks like parsing, grammar correction,
and generating syntactically correct sentences.
How Syntax-Based Language Models Work
These models can be integrated into neural networks or used in rule-based systems.
Grammar Rules:
6. V → "chased" | "ate"
Sentence Generation:
Using the above rules, the model can generate sentences like:
Parsing:
 Parse Tree:
                                    S
                                 /     \
                                NP      VP
                               / \    /    \
                            Det N V          NP
                             |   | |         / \
                           the cat chased Det N
                                           |     |
                                          the dog
       Maximum Entropy (MaxEnt) language models, also known as logistic regression models,
are a type of statistical model used in natural language processing (NLP) to predict the
probability of a sequence of words. These models are based on the principle of maximum
entropy, which states that the best model is the one that makes the least assumptions about
the data while still being consistent with the observed data.
   1. Feature-Based: MaxEnt models use features extracted from the input data to make
       predictions. These features can be anything from the presence of specific words to
       more complex linguistic patterns.
   2. Discriminative: Unlike generative models (e.g., n-gram models), MaxEnt models are
       discriminative, meaning they directly model the conditional probability P(y∣x), where y is
       the output (e.g., the next word) and xx is the input (e.g., the previous words).
   3. Flexible: MaxEnt models can incorporate a wide variety of features, making them highly
       flexible and capable of capturing complex relationships in the data.
       Suppose we want to build a MaxEnt model to predict the next word in a sentence. Let's
consider a simple example where we want to predict the next word after the sequence "I want
to".
We define a set of features that might be useful for predicting the next word. For example:
       We collect a corpus of sentences and extract the features for each instance where the
sequence "I want to" appears. For each instance, we note the next word (the target) and the
features.
       We train the MaxEnt model using the collected data. The model learns the weights for
each feature, which indicate how important each feature is for predicting the next word.
   Once the model is trained, we can use it to predict the next word given a new sequence. For
example, given the sequence "I want to", the model might predict:
 etc.
Mathematical Formulation:
      Efficiency: They are computationally efficient to train and use, especially with large
       datasets.
      Interpretability: The weights learned by the model can provide insights into the
       importance of different features.
  Factored language models are a type of statistical language model that incorporate
additional structure or factors beyond the standard n-gram models. These factors can include
morphological, syntactic, semantic, or even contextual information to improve the model's
ability to predict the next word in a sequence.
       Let's consider a simple example where we want to predict the next word in a sentence
using a factored language model that incorporates both word forms and their part-of-speech
(POS) tags.
Sentence:
First, we tokenize the sentence and assign POS tags to each word:
 The (DT)
 cat (NN)
 sat (VBD)
 on (IN)
 the (DT)
 mat (NN)
          . (.)
Step 2: Define Factors
2. POS Tag: The part of speech of the word (e.g., "NN", "VBD").
                 Instead of just using the word forms to create n-grams, we use both the word
         forms and their POS tags. For example, a bigram model would consider pairs of (word
         form, POS tag) sequences.
         We train the model on a corpus where each word is represented as a combination of its
form and POS tag. For example, the sentence "The cat sat on the mat." would be represented
as:
 (The, DT)
 (cat, NN)
 (sat, VBD)
 (on, IN)
 (the, DT)
 (mat, NN)
 (., .)
Step 5: Prediction
         When predicting the next word, the model considers both the previous word's form and
its POS tag. For example, if the previous word was "the" (DT), the model might predict that
the next word is likely to be a noun (NN) like "cat" or "mat".
      2. Better Handling of Rare Words: POS tags can help the model make better predictions
         even for rare words by leveraging their syntactic role.
      3. Contextual Understanding: Factors like POS tags provide syntactic context, which can
         improve the model's understanding of sentence structure.
Example Prediction
 (The, DT)
 (cat, NN)
The model might predict that the next word is likely to be a verb (VBD) like "sat" or "jumped",
based on the learned patterns from the training data.
       Tree-based language models are a class of models that leverage hierarchical structures
(trees) to represent and process language. Unlike sequential models like RNNs or
Transformers, tree-based models explicitly capture syntactic or semantic relationships in a
hierarchical manner. Below are some examples of tree-based language models and their
applications:
      Description: Recursive Neural Networks process data in a tree structure, where each
       node represents a word or phrase, and the model recursively combines child nodes to
       form parent nodes.
 Example:
2. Tree-LSTMs
 Example:
 Example:
     Description: These models use variational inference to learn latent tree structures from
      text data, often in an unsupervised manner.
 Example:
 Example:
     Description: These models are specifically designed for programming languages, where
      the input is represented as an Abstract Syntax Tree (AST).
 Example:
      Description: These models combine neural networks with symbolic reasoning, often using
       tree structures to represent logical forms or programs.
 Example:
          o   Semantic Parsing: Mapping natural language to executable logical forms (e.g., SQL
              queries).
       One popular approach is the Latent Dirichlet Allocation (LDA) model, which is a
generative probabilistic model that assumes documents are generated from a mixture of
topics. However, LDA itself is not a language model. To create a topic-based language model,
LDA can be combined with traditional language models like n-grams or neural language models.
   2. Output: A set of topics, where each topic is a distribution over words, and each
       document is a distribution over topics.
We can use a traditional language model (e.g., a bigram or trigram model) to capture the
syntactic structure of the text. However, instead of using a single language model for the
entire corpus, we can create a separate language model for each topic.
For example:
 Language Model for Topic 2: P("game" | "sports") = 0.4, P("player" | "game") = 0.3, etc.
To generate text, we first sample a topic from the document-topic distribution. Then, we use
the language model associated with that topic to generate words.
For example:
o And so on.
The generated text might look like: "science research experiment data..."
Bayesian Framework
In a Bayesian framework, we can place priors on the parameters of the topic models and
language models. For example:
 A Dirichlet prior can also be placed on the word distributions within each topic.
This allows us to perform Bayesian inference to estimate the posterior distributions of the
topics and language model parameters given the observed data.
      Neural Network Language Models (NNLMs) are a class of language models that use
neural networks to predict the probability of a sequence of words. They have become the
foundation of modern natural language processing (NLP) due to their ability to capture complex
patterns in language data. Below is an explanation of NNLMs, along with an example.
      A Neural Network Language Model is a model that uses neural networks to estimate the
probability distribution of words in a sequence. It learns to predict the next word in a sequence
given the previous words. NNLMs are trained on large text corpora and can capture syntactic
and semantic relationships between words.
   2. Hidden Layers: Learn patterns and relationships in the data (e.g., Recurrent Neural
      Networks (RNNs), Long Short-Term Memory (LSTM), or Transformers).
   3. Output Layer: Produces a probability distribution over the vocabulary for the next
      word.
      LetÕs consider a simple example of predicting the next word in a sentence using a neural
network.
Input Sentence:
Goal:
1. Tokenization:
           o   Break the sentence into tokens: ["The", "cat", "sat", "on", "the"].
  2. Word Embeddings:
        o   Convert each word into a dense vector representation (e.g., using Word2Vec,
            GloVe, or learned embeddings).
o Example: "cat" → [0.2, -0.5, 0.7], "sat" → [0.1, 0.3, -0.2], etc.
        o   Use a sequence model (e.g., RNN, LSTM, or Transformer) to process the sequence
            of word embeddings.
4. Prediction:
        o   The model outputs a probability distribution over the vocabulary for the next
            word.
5. Output:
o The word with the highest probability is selected as the prediction (e.g., "mat").
o Early NNLMs used fixed-size context windows to predict the next word.
3. Transformers:
          o   Thai: "กําลังไป" ("kamlang pai" = "is going") has no spaces between words.
6. Idioms and Cultural Context
 Example:
o Dutch: "De kat uit de boom kijken" ("watch the cat out of the tree" = hesitate).
 Example:
          o   Navajo: "ShikÕéí dóó shidineÕé" ("my relatives and my people") has limited NLP
              resources.
 Example:
 Example:
 Example:
          o   Arabic: Modern Standard Arabic (MSA) vs. Egyptian Arabic ""( "عﺎيزʕāyez" =
              "want").
         o   MRLs tend to have a large number of word forms due to rich inflectional systems
             (e.g., gender, number, case, tense, aspect, mood). This can result in a vocabulary
             explosion, making it difficult for traditional language models to handle all possible
             forms of a word.
o Example: In Turkish, the verb "gelmek" (to come) can be inflected in many forms:
 geliyorum (I am coming)
 geldim (I came)
         o   MRLs may have a high degree of ambiguity, where different word forms can look
             similar but have different meanings based on context, requiring better
             disambiguation.
         o   Example: In Arabic, the word "( "ﻛﺘﺐkataba) can mean "he wrote" or "he is writing,"
             depending on tense, but without proper context, it can be ambiguous.
3. Morphological Parsing:
         o   MRLs require accurate morphological parsing to break down words into meaningful
             units (morphemes). This is essential for understanding word structure and
             generating meaningful outputs.
o Example:
4. Word Segmentation:
         o   Some MRLs, such as Chinese or Thai, do not use spaces between words, which
             requires segmentation as part of preprocessing. While this is less of an issue for
             languages like Turkish or Finnish, segmentation still plays a role in understanding
             compound words.
        o   Example: In Chinese, "我喜欢吃苹果" ("I like eating apples") must be correctly
            segmented into "我 喜欢 吃 苹果."
1. Subword Tokenization:
        o   Example: For Turkish, using BPE might break "geliyorum" (I am coming) into
            subwords like "gel" and "iyor" (present continuous suffix) and "um" (1st person
            singular pronoun).
2. Morphological Analysis:
        o   Preprocessing with morphological analyzers can help to identify root forms and
            affixes (prefixes, suffixes) and treat them as distinct components. This allows
            the model to learn and generalize better from these components rather than
            memorizing every word form.
3. Character-Level Modeling:
        o   Some models handle the task at the character level rather than word level. This
            can be especially useful for MRLs, where complex morphology means that the form
            of a word can change substantially while maintaining its core meaning.
         o   MRLs often have words that donÕt appear in the training data. Subword
             tokenization and character-level models are particularly useful in mitigating OOV
             issues because they break words into smaller components that can be learned even
             if the full word wasn't seen during training.
         o   Example: A rare word like "kitapçık" (booklet) in Turkish might be split into
             subwords like "kitap" (book) and "çık" (diminutive suffix), allowing the model to
             handle it effectively.
      In natural language processing (NLP), subword units are smaller linguistic components
that are used to break down words into manageable pieces, especially for languages with
complex morphology or when dealing with rare or unseen words. The process of selecting
subword units is critical because it affects how a model learns to understand and generate
language. Subword units can be characters, syllables, or more complex units like word stems or
morphemes.
         o   Words that do not appear in the training set, especially rare or compound words,
             can be decomposed into subword units, reducing the impact of OOV issues.
         o   Example: For the word “unhappiness,” a subword tokenizer might break it down
             into subword units like “un,” “happi,” and “ness.”
         o   In languages with rich morphology (like Turkish or Finnish), words can have many
             inflections or derivations. Subword units help break these words into more basic
             units that can be processed effectively.
         o   Example: In Turkish, the word "kitaplarınızdan" (from your books) can be split
             into subword units like "kitap" (book), "lar" (plural suffix), "ınız" (your suffix), and
             "dan" (ablative case).
   3. Efficient Representation of Rare Words:
         o   Subword tokenization helps represent rare and compound words, even if they
             don't appear in the training corpus, by breaking them down into familiar subword
             units.
         o   Example: "Googleplex" might be split into "Goog" and "leplex," making it easier for
             the model to understand and process.
         o   BPE is a data-driven algorithm that iteratively merges the most frequent pair of
             characters or subword units into a new symbol. This approach is useful in managing
             both OOV words and rare word forms by breaking them into frequent subword
             units.
Steps in BPE:
Example:
                 2. The most frequent pair of characters is "e" and "r", so they are merged
                      into "er".
Result: Words like "lower" and "newer" may be represented as subword units like [l, o, w, er],
making it easier to process variations of "lower" and "newer."
2. WordPiece:
      o   WordPiece is similar to BPE but with some modifications. It also merges subwords
          based on frequency, but it tries to maximize the likelihood of a sequence of
          subword units given a training corpus.
Example:
      o   WordPiece starts with individual characters and progressively merges them into
          subwords like "un", "happiness", and "es" to form "un", "happi", "ness." It builds a
          vocabulary of subword units that maximize the likelihood of words in the context
          of the language model.
3. SentencePiece:
Example:
4. Morpheme-based Tokenization:
Example:
1. English:
         o   The word “unhappiness” might be split into subwords like ["un", "happiness"] or
             ["un", "happi", "ness"] using BPE or WordPiece. This allows the model to handle
             variations like "happy," "happier," and "unhappy" by focusing on the core subwords.
2. German:
         o   German has compound words, and a simple tokenization approach might result in
             long words that are difficult for the model to process. A subword model would
             break down a word like "Donaudampfschifffahrtsgesellschaftskapitän" (Danube
             steamship company captain) into manageable subwords like ["Donau", "dampf",
             "schiff", "fahrt", "gesellschaft", "skapitän"].
3. Turkish:
         o   Turkish has agglutination, where words are built from a root plus multiple affixes.
             For example, the word "kitaplarınızdan" (from your books) can be broken into
             subwords: ["kitap" (book), "lar" (plural), "ınız" (your), "dan" (ablative)].
4. Chinese:
         o   In languages like Chinese, where there are no spaces between words, subword
             tokenization (like SentencePiece) is used to split sentences into meaningful
             subword units. For example, "我喜欢吃苹果" (I like eating apples) might be split
             into ["我", "喜欢", "吃", "苹果"].
understand and represent the different forms and structures of words that convey
grammatical information, such as tense, case, gender, number, and other linguistic features.
where a single word can have many forms based on its morphological structure.
language models to better capture the meaning and grammatical function of words, improving
performance on tasks like machine translation, text generation, and speech recognition.
Key Morphological Categories in NLP:
o Example: In English, the verb "to walk" can be inflected to show tense: "walk"
o In Russian, verbs have different forms based on past, present, future, and
pronoun):
o Example: In German, nouns change form depending on the case: "der Hund"
(nominative, the dog), "den Hund" (accusative, the dog), "dem Hund" (dative, to
o Example: In Spanish, nouns have gender: "el libro" (the book, masculine) and "la
o Example: In English, "cat" is singular, while "cats" is plural. In Arabic, the plural
form can be broken down into regular plural (e.g., " "ﻛﺘﺎب" → "ﻛﺘﺐfor books) or
o Example: In French, the verb changes according to the person: "je mange" (I
6. Mood (Indicates the attitude of the speaker toward the action: indicative, imperative,
subjunctive, etc.):
o Example: In Spanish, the subjunctive mood is used in sentences like "Espero que
completed or ongoing):
o Example: In Russian, the verb "читать" (to read) can have a perfective aspect
"прочитать" (to read completely) and an imperfective aspect "читать" (to read
            regularly).
Methods for Modeling Morphological Categories in NLP
Morphological categories are often captured and modeled in NLP by breaking down words into
smaller units like morphemes, stems, and affixes. Here are a few techniques and models used
 Morphological analysis is the task of identifying and tagging words with their
morphological features (e.g., tense, case, gender). This can be done using rule-based or
statistical approaches.
 Output: "kitaplarınızdan" → Root: kitap, Case: Ablative, Person: 2nd, Number: Plural
2. Subword Tokenization:
 Using methods like Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, subword
tokenization can break down complex words into smaller units (e.g., prefixes, roots,
 Example: For the Turkish word "geliyorum" (I am coming), the tokenization might
 Result: This allows the model to capture the structure and tense (present continuous)
3. Character-Level Models:
words. This can be especially useful for morphologically rich languages where words
 Pretrained multilingual models like mBERT (Multilingual BERT) and XLM-R (XLM-
RoBERTa) can capture morphological categories by leveraging training data from many
languages. These models can learn the underlying grammatical structures and
 Example: mBERT, when fine-tuned on specific tasks, can model grammatical features
such as tense, number, and case for various languages, including languages like Arabic,
Turkish, and Finnish, without requiring separate models for each language.
 Long Short-Term Memory (LSTM) networks and transformers can also be enhanced
engine instrument panel) could be split into its components, and an LSTM or
Transformer model could incorporate the feature that "lentokone" is a noun (root),
6. Morphological Embeddings:
 Example: A model might learn that "un-" typically negates the meaning of a word (e.g.,
"unhappy") and represent this prefix as a unique embedding that can be combined with
In many languages, words are not explicitly separated by spaces, which presents a
unique challenge for language models in Natural Language Processing (NLP). This is
the case for languages like Chinese, Japanese, Thai, Vietnamese, and Malay, where
      text is written without clear word boundaries. In such languages, the task of
    segmentation becomes crucial, as it involves determining where one word ends and the
 No Spaces Between Words: Unlike languages like English, where spaces are used to
separate words, many languages do not use spaces or punctuation marks to indicate
word boundaries.
In many languages, words are not explicitly separated by spaces, which presents a
unique challenge for language models in Natural Language Processing (NLP). This is the
case for languages like Chinese, Japanese, Thai, Vietnamese, and Malay, where text is
written without clear word boundaries. In such languages, the task of segmentation
becomes crucial, as it involves determining where one word ends and the next begins, a
 No Spaces Between Words: Unlike languages like English, where spaces are used to
separate words, many languages do not use spaces or punctuation marks to indicate
word boundaries.
1. Character-Level Models:
used to process the text as sequences of characters rather than whole words.
This approach eliminates the need for explicit word boundaries, as the model
           themselves.
      o   Example: For Chinese or Japanese, character-level models like RNNs or
Random Fields (CRF), and Maximum Entropy Models have been applied to the
word segmentation problem. These models predict word boundaries based on the
o Neural Network Models: With the rise of deep learning, neural network-based
These models are trained end-to-end and can handle ambiguous segmentations
These models are trained on massive multilingual corpora and can learn to
process languages like Chinese, Japanese, and Thai without relying on traditional
word segmentation.
4. Lexical Resources:
lexicons to identify known words in the text and segment accordingly. This can
The distinction between spoken and written languages in Natural Language Processing
(NLP) is an important one, as these two forms of language exhibit different
characteristics and challenges. These differences can affect how language models are
trained and applied for tasks like text generation, translation, sentiment analysis, and
speech recognition.
Key Differences Between Spoken and Written Language:
o Example:
2. Vocabulary:
      o   Spoken Language: The vocabulary used in spoken language is often simpler, with
          frequent use of common, everyday words. There are also colloquial expressions,
          slang, and regional dialects.
o Example:
o Example:
4. Contextual Cues:
      o   Spoken language models for speech recognition need to deal with natural
          disfluencies, hesitations, and incomplete phrases. Speech recognition systems
          must be robust enough to handle these elements and convert speech into
          grammatically correct text.
      o   Example: Converting “Um, could you, like, help me with this?” into a clean
          sentence requires the model to understand and remove filler words like “um” and
          “like.”
2. Text Generation:
      o   Generating spoken text (e.g., for virtual assistants like Siri or Alexa) requires
          the language model to produce informal, conversational language that fits the
          context of an ongoing interaction.
      o   Example: If a user asks, "WhatÕs the weather like today?" a spoken language
          model might respond with something like, "ItÕs sunny and 75 degrees." In
          contrast, a written language model might provide a more detailed answer: "The
          weather today is sunny with a temperature of 75°F."
3. Machine Translation:
4. Disfluency Handling:
     o   In spoken language models, handling disfluencies is key. This includes the task
         of filtering out unnecessary parts of speech (e.g., "uh," "um") or correcting false
         starts (e.g., "I mean, I think we should go...").
     o   Example: A spoken language model might clean up a sentence like: "Um, I was,
         like, thinking about going to the park... but I donÕt know." It might be converted
         to: "I was thinking about going to the park, but I donÕt know."
1. Multimodal Models:
     o   Multimodal models combine both spoken and written data to handle both forms
         of language effectively. These models are trained on both speech (audio) and
         text to bridge the gap between spoken and written language. For example,
         DeepSpeech or wav2vec can recognize speech and then translate it into written
         form.
     o   Pretrained models like BERT, GPT-3, or T5 can be fine-tuned for both spoken
         and written language tasks. For instance, a model could be trained on both
         formal text (e.g., news articles) and informal spoken text (e.g., dialogue
         datasets).
     o   STT systems convert spoken language into written form, while TTS systems
         take written text and generate spoken output. These models are critical in
         bridging the gap between spoken and written language and are useful for voice
         assistants, transcription services, and accessibility tools.
     o   Voice Assistants: Models like Siri, Google Assistant, and Alexa rely on spoken
         language models to respond to verbal commands in a conversational manner.
     o   Speech Recognition Systems: These systems convert spoken input into written
         text. Examples include Dragon NaturallySpeaking and Google Speech-to-Text.
       o   Text Classification: Written language models are widely used for classifying
           text into categories, such as spam detection or sentiment analysis, based on
           formal written content.
   Multilingual Language Modeling (MLM) involves training a single language model on text
from multiple languages, enabling it to process and generate text across different
languages without requiring separate monolingual models. This approach is crucial for
improving NLP applications in low-resource languages and reducing redundancy in model
development.
      Some models (e.g., mBERT, mT5) use language embeddings to indicate the input
       language, helping the model switch between languages.
2. Popular Multilingual Language Models
 Use translation pairs (e.g., XLM with Translation Language Modeling (TLM)).
c) Unsupervised Alignment
d) Parameter-Efficient Fine-Tuning
Challenges
         Script & Grammar Differences: Handling languages with different syntax (e.g.,
          Arabic vs. English).
Future Trends
         Few-shot Transfer: Use a small amount of labeled data in the target language
          for adaptation.
         Unsupervised CLM: Align languages without parallel data (e.g., using back-
          translation, masked LM).
  c) Alignment Strategies
 Back-Translation:
 Adversarial Training:
 Self-Training:
XLM (Facebook) Encoder TLM & MLM 15+ (needs parallel data)
A) Linguistic Divergence
C) Evaluation Difficulties
6. Future Directions