0% found this document useful (0 votes)
2K views20 pages

Unit 5

unit 5 notes for NLP

Uploaded by

Anonymous XhmybK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views20 pages

Unit 5

unit 5 notes for NLP

Uploaded by

Anonymous XhmybK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

R22 B.Tech.

CSE NLP

Prerequisites:
1. Data structures and compiler design
Course Objectives:

Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:

CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP

Syntax I: Parsing Natural Language, Treebanks: A Data- REFERENCE BOOK:


Driven Approach to Syntax, Representation of Syntactic
1. Speech and Natural Language Processing - Daniel
Structure, Parsing Algorithms
Jurafsky & James H Martin, Pearson Publications.
UNIT – III
2. Natural Language Processing and Information Retrieval:
Syntax II: Models for Ambiguity Resolution in Parsing, Tanvier Siddiqui, U.S. Tiwary.
Multilingual Issues
Semantic Parsing I: Introduction, Semantic Interpretation,
System Paradigms, Word Sense
UNIT - IV
Semantic Parsing II: Predicate-Argument Structure,
Meaning Representation Systems
UNIT - V
Language Modeling: Introduction, N-Gram Models,
Language Model Evaluation, Bayesian parameter
estimation, Language Model Adaptation, Language Models-
class based, variable length, Bayesian topic based,
Multilingual and Cross Lingual Language Modeling
TEXT BOOKS:
1. Multilingual natural Language Processing Applications:
From Theory to Practice – Daniel M. Bikel and Imed Zitouni,
Pearson Publication.

Page 2 of 76
R22 B.Tech. CSE NLP

3. GeoQuery:
GeoQuery is a Natural Language Interface (NLI) designed
to interact with geographic database called Geobase. Geobase Language Modeling: Introduction, N-Gram Models,
contains about 800 prolog facts, which store geographic Language Model Evaluation, Bayesian parameter
information such as populations, neighbouring states, major
estimation, Language Model Adaptation, Language Models-
rivers, and major cities in a relational database.
class based, variable length, Bayesian topic based,
4. Robocup: CLang Multilingual and Cross Lingual Language Modeling
Robocup is an international competition where teams of
robots play soccer, and it’s organized by the artificial intelligence
community. The goal is to advance AI and robotics research Language Modeling:
through this challenging and fun domain. 5.1 Introduction:
What is language modeling?
2.2 Software’s: Language modeling, or LM, is the use of various statistical
 WASP and probabilistic techniques to determine the probability of a
 KRISPER given sequence of words occurring in a sentence. Language
 CHILL models analyze bodies of text data to provide a basis for their word
predictions.
Language modeling is used in artificial intelligence (AI),
natural language processing (NLP), natural language
understanding and natural language generation systems,
particularly ones that perform text generation, machine translation
and question answering.

How language modeling works:

Page 59 of 76
R22 B.Tech. CSE NLP

Language models determine word probability by Language modeling is crucial in modern NLP
analyzing text data. They interpret this data by feeding it through applications. It's the reason that machines can understand
an algorithm that establishes rules for context in natural language. qualitative information. Each language model type, in one way or
Then, the model applies these rules in language tasks to accurately another, turns qualitative information into quantitative
predict or produce new sentences. The model essentially learns the information. This allows people to communicate with machines as
features and characteristics of basic language and uses those they do with each other, to a limited extent.
features to understand new phrases.
Language modeling is used in a variety of industries
There are several different probabilistic approaches to including information technology, finance, healthcare,
modeling language. They vary depending on the purpose of the transportation, legal, military and government. In addition, it's
language model. From a technical perspective, the various likely that most people have interacted with a language model in
language model types differ in the amount of text data they some way at some point in the day, whether through Google
analyze and the math they use to analyze it. For example, a search, an autocomplete text function or engaging with a voice
language model designed to generate sentences for an automated assistant.
social media bot might use different math and analyze text data in
different ways than a language model designed for determining The roots of language modeling can be traced back to
the likelihood of a search query. 1948. That year, Claude Shannon published a paper titled "A
Mathematical Theory of Communication." In it, he detailed the
Language modeling types use of a stochastic model called the Markov chain to create a
statistical model for the sequences of letters in English text. This
There are several approaches to building language models. paper had a large impact on the telecommunications industry and
Some common statistical language modeling types are the laid the groundwork for information theory and language
following: modeling. The Markov model is still used today, and n-grams are
1. N-gram tied closely to the concept.
2. Unigram
3. Bidirectional
4. Exponential
5. Neural Language Models
6. Continuous space
Importance of language modeling
Page 60 of 76
R22 B.Tech. CSE NLP

Text generation:
This application uses prediction to generate coherent and
contextually relevant text. It has applications in creative writing,
content generation, and summarization of structured data and
other text.
Chatbots:
These bots engage in humanlike conversations with users
as well as generate accurate responses to questions. Chatbots are
used in virtual assistants, customer support applications and
information retrieval systems.
Machine translation:
This involves the translation of one language to another by
a machine. Google Translate and Microsoft Translator are two
programs that do this. Another is SDL Government, which is used
to translate foreign social media feeds in real time for the U.S.
government.
Parts-of-speech tagging:
Uses and examples of language modelling: This use involves the markup and categorization of words
Language models are the backbone of NLP. Below are by certain grammatical characteristics. This model is used in the
some NLP use cases and tasks that employ language modeling: study of linguistics. It was first and perhaps most famously used
in the study of the Brown Corpus, a body of random English prose
Speech recognition: that was designed to be studied by computers. This corpus has
This involves a machine being able to process speech been used to train several important language models, including
audio. Voice assistants such as Siri and Alexa commonly use one used by Google to improve search quality.
speech recognition. Parsing:
Page 61 of 76
R22 B.Tech. CSE NLP

This use involves analysis of any string of data or sentence use it to analyze unstructured data, such as product reviews and
that conforms to formal grammar and syntax rules. In language general posts about their product, as well as analyze internal data
modeling, this can take the form of sentence diagrams that depict such as employee surveys and customer support chats. Some
each word's relationship to the others. Spell-checking applications services that provide sentiment analysis tools are Repustate and
use language modeling and parsing. HubSpot's Service Hub. Google's NLP tool Bert is also used for
sentiment analysis.
Optical character recognition:
5.2 N-gram
This application involves the use of a machine to convert
images of text into machine-encoded text. The image can be a N-gram can be defined as the contiguous sequence of n
scanned document or document photo, or a photo with text items from a given sample of text or speech. The items can be
somewhere in it -- on a sign, for example. Optical character letters, words, or base pairs according to the application. The N-
recognition is often used in data entry when processing old paper grams typically are collected from a text or speech corpus (A long
records that need to be digitized. It can also be used to analyze and text dataset).
identify handwriting samples.
N-gram Model:
Information retrieval:
This simple approach to a language model creates a
This approach involves searching in a document for probability distribution for a sequence of n. The n can be any
information, searching for documents in general and searching for number and defines the size of the gram, or sequence of words or
metadata that corresponds to a document. Web browsers are the random variables being assigned a probability. This allows the
most common information retrieval applications. model to accurately predict the next word or variable in a sentence.
For example, if n = 5, a gram might look like this: "can you please
Observed data analysis: call me." The model then assigns probabilities using sequences of
These language models analyze observed data such as n size. Basically, n can be thought of as the amount of context the
sensor data, telemetric data and data from experiments. model is told to consider. Some types of n-grams are unigrams,
bigrams, trigrams and so on. N-grams can also help detect
Sentiment analysis: malware by analyzing strings in a file.
This application involves determining the sentiment An N-gram language model predicts the probability of a
behind a given phrase. Specifically, sentiment analysis is used to given N-gram within any sequence of words in the language. A
understand opinions and attitudes expressed in a text. Businesses good N-gram model can predict the next word in the sentence i.e
Page 62 of 76
R22 B.Tech. CSE NLP

the value of p(w|h) More Variables:


P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)
Example of N-gram such as Now generalize the above equation using chain rule:
unigram (“This”, “article”, “is”, “on”, “NLP”) or 𝑃(𝑋1,𝑋2,…,𝑋𝑛)=𝑃(𝑋1)𝑃(𝑋2∣𝑋1) 𝑃(𝑋3∣𝑋1,𝑋2)….
bi-gram (‘This article’, ‘article is’, ‘is on’, ’on NLP’).
𝑃(𝑋𝑛∣𝑋1,𝑋2,…𝑋𝑛)
Now, we will establish a relation on how to find the next word in P(“about five minutes from”)= P(about) X P(five | about) X
the sentence using. We need to calculate p(w|h), where is the P(minutes | about five) X P(from | about five minutes)
candidate for the next word. For example in the above example, Probability in word sentences:
lets’ consider, we want to calculate what is the probability of the 𝑃(𝑤1𝑤2𝑤3…𝑤𝑛)=∏𝑖 𝑃(𝑤𝑖∣𝑤1𝑤2…𝑤𝑛)
last word being “NLP” given the previous words:
Simplifying the above formula using Markov assumptions:
𝑃(𝑤𝑖∣𝑤1,𝑤2,…𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−𝑘,…𝑤𝑖−1)
𝑝(𝑁𝐿𝑃∣𝑡ℎ𝑖𝑠 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 𝑖𝑠 𝑜𝑛)  For unigram:
𝑃(𝑤1𝑤2,…𝑤𝑛)≈∏𝑖𝑃(𝑤𝑖) No history is used
After generalizing the above equation can be calculated as:  For Bi-gram:
𝑃(𝑤𝑖∣𝑤1𝑤2,..𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−1) One word
𝑝 (𝑤5∣ 𝑤1, 𝑤2, 𝑤3, 𝑤4) 𝑜𝑟 𝑃(𝑊) =𝑝 history
(𝑤𝑛∣𝑤1,𝑤2…𝑤𝑛)=p (wn∣w1,w2…wn)  For Tri-gram:
𝑃(𝑤𝑖∣𝑤1𝑤2,..𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−1 𝑤𝑖−2) Two
But how do we calculate it? The answer lies in the chain rule of
probability:
word history
 For Four-gram:
𝑃(𝑤𝑖∣𝑤1𝑤2,..𝑤𝑖−1)≈𝑃(𝑤𝑖∣𝑤𝑖−1 𝑤𝑖−2 𝑤𝑖−3)
𝑃(𝐴∣𝐵)=𝑃(𝐴,𝐵) / 𝑃(𝐵)
Three word history
𝑃(𝐴,𝐵)=𝑃(𝐴∣𝐵)𝑃(𝐵)
Page 63 of 76
R22 B.Tech. CSE NLP

Generally in practical applications, Bi-gram (previous one “about five minutes from……”
word), Tri-gram (previous two word), Four-gram (Previous
three word) are used. Count(about five minutes from)= P(about |<s>) X P(five | about)
X P( minutes | five) X
Unigram(1-gram): No history is used.
“about five minutes from……” P (from | minutes)
Assume in corpus dinner word is present with highest
probability. Count(about five minutes from college)= P(about |<s>) X P(five |
Unigram doesn’t take into account probabilities with previous about) X P( minutes | five) X P (from | minutes) X P (college |
words like from, minutes. from)
Unigram will predict dinner Count(about five minutes from class)= P(about |<s>) X P(five |
“about five minutes from dinner” about) X P( minutes | five) X P (from | minutes) X P (class| from)
Bi-gram(2-gram): One word history. P(college | about five minutes from)
“about five minutes from” = count(about five minutes from college)
Assumption: Next word may be college, class count (about five minutes from)

P(college | about five minutes from) P(class | about five minutes from)
= count(about five minutes from college) = count(about five minutes from class)
count(aboutfive minutes from) count (about five minutes from)

P(class | about five minutes from) As no. of previous state (history) increases, it is very difficult to
= count(about five minutes from class) match that set of word in corpus.

count(about five minutes from) Probabilities of larger collection of word is minimum. To


Page 64 of 76
R22 B.Tech. CSE NLP

overcome this problem Bi-gram model is used.


NEXT WORD Probability Next Word = count(Wi-1, Wi)
count (Wi-1)
Exercise 1: Estimating B-gram probabilities
P( </S> | do) 0/4
What is the most probable next word predicted by the model for
the following word sequence? P( <I> | do) 2/4
P( <am> | do) 0/4

Given Corpus: WORD FRQUENCY P( <Henry> | 1/4


do)
<S> I am henry </S> <S> 7
P( <Like> | do) 1/4
<S> I like college </S> </S> 7
P( <College> | 0/4
<S> Do henry like college I 6 </S> do)
<S> Henry I am </S> Am 2 P( do | do) 0/4
<S> Do I like henry </S> Henry 5
<S> Do I like college Like 5 </S> 2) <S> I like henry ?
<S> I do like henry </S> College 3 Next word prediction probability Wi-1=Henry
Do 4 NEXT WORD Probability Next Word = count(Wi-1, Wi)
1) <S> Do ? count (Wi-1)
Next word prediction probability Wi-1=do P( </S> | Henry) 3/5
P( <I> | Henry) 1/5
P( <am> | 0
Page 65 of 76
R22 B.Tech. CSE NLP

Henry) 1. <S> I like college </S>


P( <Henry> | 0 =P(I | <S>) X P(like | I) X P(college | like) X P(</S> | college)
Henry)
=3/7 X 3/6 X 3/5 X 3/3 = 9/70 = 0.13
P( <Like> | 1/5
2. <S> Do I like henry </S>
Henry)
=P(do | <S>) X P(I | do) X P(like | I) X P(Henry | like) X P(</S> |
P( <College> | 0
Henry)
Henry)
=3/7 X 2/4 X 3/6 X 2/5 X 3/5 = 9/350 = 0.0257
P( do | Henry) 0

Ans: First statement is more probable


2) Which of the following sentence is better. i.e., Gets a higher
probability with this model. Use Bi-gram. N-gram is a sequence of the N-words in the modeling of
NLP. Consider an example of the statement for modeling. “I love
Given Corpus: reading history books and watching documentaries”. In one-gram
WORD FRQUENCY
<S> I am henry </S> or unigram, there is a one-word sequence. As for the above
<S> 7 statement, in one gram it can be “I”, “love”, “history”, “books”,
“and”, “watching”, “documentaries”. In two-gram or the bi-gram,
<S> I like college </S> </S> 7 there is the two-word sequence i.e. “I love”, “love reading”, or
<S> Do henry like college </S> I 6 “history books”. In the three-gram or the tri-gram, there are the
three words sequences i.e. “I love reading”, “history books,” or
<S> Henry I am </S> Am 2 “and watching documentaries” [3]. The illustration of the N-gram
modeling i.e. for N=1,2,3 is given below in Figure.
<S> Do I like henry </S> Henry 5
<S> Do I like college </S> Like 5
<S> I do like henry </S> College 3
Do 4
Page 66 of 76
R22 B.Tech. CSE NLP

which word occurs often without pondering the previous words.


In 2-gram, only the previous word is considered for predicting the
current word. In 3-gram, two previous words are considered. In
the N-gram language model the following probabilities are
calculated:
P (“There was heavy rain”) = P (“There”, “was”, “heavy”, “rain”)
= P (“There”) P (“was” |“There”) P (“heavy”| “There was”) P
(“rain” |“There was heavy”).
As it is not practical to calculate the conditional
probability but by using the “Markov Assumptions”, this is
approximated to the bi-gram model as [4]:

For N-1 words, the N-gram modeling predicts most P (“There was heavy rain”) ~ P (“There”) P (“was”
occurred words that can follow the sequences. The model is the |“'There”) P (“heavy” |“was”) P (“rain” |“heavy”)
probabilistic language model which is trained on the collection of
the text. This model is useful in applications i.e. speech
recognition, and machine translations. A simple model has some Which N-Gram should be used as a language model?
limitations that can be improved by smoothing, interpolations, and – Bigger N, the model will be more accurate.
back off. So, the N-gram language model is about finding
probability distributions over the sequences of the word. • But we may not get good estimates for N-Gram probabilities.
Consider the sentences i.e. "There was heavy rain" and • The N-Gram tables will be more sparse.
"There was heavy flood". By using experience, it can be said that
the first statement is good. The N-gram language model tells that – Smaller N, the model will be less accurate.
the "heavy rain" occurs more frequently than the "heavy flood". • But we may get better estimates for N-Gram probabilities.
So, the first statement is more likely to occur and it will be then
selected by this model. • The N-Gram table will be less sparse.

In the one-gram model, the model usually relies on that – In reality, we do not use higher than Trigram (not more than

Page 67 of 76
R22 B.Tech. CSE NLP

Bigram). – Assign higher probability to “real” or “frequently


observed” sentences than
– How big are N-Gram tables with 10,000 words?
“ungrammatical” or “rarely observed” sentences?
• Unigram -- 10,000
 We train parameters of our model on a training set.
• Bigram – 10,000*10,000 = 100,000,000
 We test the model’s performance on data we haven’t seen.
• Trigram – 10,000*10,000*10,000 = 1,000,000,000,000 – A test set is an unseen dataset that is different from our
training set, totally unused.
The assumption that the probability of a word depends only on the  An evaluation metric tells us how well our model does on
previous word(s) is the test set.
called Markov assumption.  Extrinsic Evaluation of a N-gram language model is to use
it in an application and
• Markov models are the class of probabilistic models that assume measure how much the application improves.
that we can predict  To compare two language models A and B:
the probability of some future unit without looking too far into the – Use each of language model in a task such as spelling
past. corrector, MT system.
– Get an accuracy for A and for B
• A bigram is called a first-order Markov model (because it looks  How many misspelled words corrected properly.
one token into the  How many words translated correctly.
past); – Compare accuracy for A and B
 The model produces the better accuracy is the better model.
• A trigram is called a second-order Markov model;  Extrinsic evaluation can be time-consuming.
• In general a N-Gram is called a N-1 order Markov model.  An intrinsic evaluation metric is one that measures the
quality of a model
independent of any application.
 When a corpus of text is given and to compare two different
5.3 Language Model Evaluation:
n-gram models,
 Does our language model prefer good sentences to bad – Divide the data into training and test sets,
ones? – Train the parameters of both models on the training set,
Page 68 of 76
R22 B.Tech. CSE NLP

and  Perplexity can be seen as the weighted average branching


– Compare how well the two trained models fit the test set. factor of a language.
 Whichever model assigns a higher probability to the test set – The branching factor of a language is the number of
 In practice, probability-based metric called perplexity is possible next words that can follow any word.
used instead of raw  Let’s suppose a sentence consisting of random digits
probability as our metric for evaluating language models.  What is the perplexity of this sentence according to a
model that assign P=1/10 to each digit?
Perplexity:
 The best language model is one that best predicts an
unseen test set
– Gives the highest P(testset)
 The perplexity of a language model on a test set is the
inverse probability of the test
set, normalized by the number of words.
 Minimizing perplexity is the same as maximizing
probability.
 The perplexity PP for a test set W=w1w2…wn is
 Lower perplexity = better model
 Training 38 million words, test 1.5 million words, WSJ

 An intrinsic improvement in perplexity does not guarantee


 The perplexity PP for bigrams: an (extrinsic) improvement
in the performance of a language processing task like
speech recognition or machine translation.
– Nonetheless, because perplexity often correlates with
such improvements, it is commonly used as a quick
Page 69 of 76
R22 B.Tech. CSE NLP

check on an algorithm. on their frequency.


– But a model’s improvement in perplexity should For example we can replace by <UNK> all
always be confirmed by an end-to-end evaluation of a words that occur fewer than n times in the training set,
real task before concluding the evaluation of the model. where n is some small number, or
 The n-gram model, like many statistical models, is Equivalently select a vocabulary size V in
dependent on the training corpus. advance (say 50,000) and choose the top V words by
– the probabilities often encode specific facts about a frequency and replace the rest by <UNK>.
given training corpus. – Proceed to train the language model as before, treating
 N-grams only work well for word prediction if the test <UNK> like a regular word.
corpus looks like the training
corpus
– In real life, it often doesn’t 5.4 Bayesian parameter estimation:
– We need to train robust models that generalize!
– One kind of generalization: Getting rid of Zeros!  It is a parameter estimation method in which set of
 Things that don’t ever occur in the training set, but occur parameters of a model are considered as random variable,
in the test set which is governed by previous statistical distribution
 Zeros: things that don’t ever occur in the training set but posterior distribution,
do occur in the test set causes problem for two reasons.  P(Ø/S) = P(S | Ø) P(Ø)
– First, underestimating the probability of all sorts of P(S)
words that might occur,  Where S training sample with sequence of words
– Second, if probability of any word in test set is 0, entire W1…..Wt < P(Wi),…..P(Wk) > (where k is vocabulary size)
probability of test set is 0. Ø Set of parameters = for unigram model P( W1 | h1……,
P(Wk | hk)) >for n-gram
 We have to deal with words we haven’t seen before, which
P(Ø)  Prior distribution over different possible values of
we call unknown words.
Ø
 We can model these potential unknown words in the test
 A point estimate of Ø is done Maximum A Posteriori
set by adding a pseudo-word called <UNK> into our
(MAP) criteria as follows.
training set too.
 One way to handle unknown words is:
– Replace words in the training data by <UNK> based
Page 70 of 76
R22 B.Tech. CSE NLP

The most common approach to language model adaption is


called transfer learning, which involves initializing the
language model with pre-trained weights and fine-tuning it
on the target domain or task using a smaller amount of task-
specific data.
This process typically involves updating the final layers of
the language model, which are responsible for predicting
the target output, while keeping the lower-level layers,
which capture more general language pattern, fixed.
There are several advantages to using language model
adaption, including:
1. Improved performance on task-specific data: By
fine-tuning a pre-trained language model on task-
specific data, the model can better capture the specific
linguistic patterns and vocabulary of that domain,
leading to improved performance on task specific data.
2. Reduced training time and computational resources:
BY starting with a pre-trained language model, the
amount of training data and computational resources
required to achieve good performance on the target task
5.5 Language Model Adaptation: is reduced, making it a more efficient approach.
3. Better handling of rare and out-of-vocabulary
Language model adaption is the process of fine-tuning a
words: Pre-trained language models have learned to
pre-trained language model to a specific domain or task
represent a large vocabulary of words, which can be
with a smaller amount of task-specific data. This approach
beneficial for handling rare and out-of-vocabulary
can improve the performance of the language model on the
words in the target domain.
target domain or task by allowing it to better capture the
specific linguistic patterns and vocabulary of that domain.
Page 71 of 76
R22 B.Tech. CSE NLP

Language Model Adaption has been applied successfully 1. Word clustering: The first step is to cluster words based on
in a wide range of NLP tasks, including sentiment analysis, text their distributional similarity. This can be done using
classification, named entity recognition, and machine translation. unsupervised clustering algorithms such as kmeans clustering or
However, it does require a small amount of task-specific data, hierarchical clustering.
which may not always be available or representative of the target 2. Class construction: After clustering, each cluster is assigned a
domain. class label. The number of classes can be predefined or determined
automatically based on the size of the training corpus and the
Types of Language Models: desired level of granularity.
1. Class based Language Models 3. Probability estimation: Once the classes are constructed, the
2. Variable Length Language Models probability of a word given its class is estimated using a variety of
3. Discriminative Language Models techniques, such as maximum likelihood estimation or Bayesian
4. Syntax based Language Models estimation.
5. MaxEnt Language models 4. Language modeling: The final step is to use the estimated
6. Factored Language Models probabilities to build a language model that can predict the
7. Other Tree-based Language Models probability of a sequence of words.
8. Bayesian Topic-Based Language Models
9. Neural Network Language Models Class-based language models have several advantages
over traditional word-based models, including:
5.6 Class-Based Language Models: 1. Reduced sparsity: By grouping similar words together, class-
based models reduce the sparsity problem in language modeling,
Class-based language models are a type of probabilistic which can improve the accuracy of the model.
language model that groups words into classes based on their 2. Improved data efficiency: Since class-based models estimate
distributional similarity. The goal of class-based models is to the probability of a word given its class rather than estimating the
reduce the sparsity problem in language modeling by grouping probability of each individual word, they require less training data
similar words together and estimating the probability of a word and can be more data-efficient.
given its class rather than estimating the probability of each 3. Better handling of out-of-vocabulary words: Class-based
individual word. models can handle out-of vocabulary words better than word-
based models, since unseen words can often be assigned to an
The process of building a class-based language model existing class based on their distributional similarity.
typically involves the following steps:
Page 72 of 76
R22 B.Tech. CSE NLP

However, class-based models also have some limitations, dependencies between words in a sentence, regardless of the
such as the need for a large training corpus to build accurate word sentence length.
clusters and the potential loss of some information due to the
grouping of words into classes. Another approach is to use transformer-based models,
which can also handle variable-length input sequences.
Overall, class-based language models are a useful tool for Transformer-based models use a self-attention mechanism to
reducing the sparsity problem in language modeling and capture the dependencies between words in a sentence, allowing
improving the accuracy of language models, particularly in cases them to model long-range dependencies without the need for
where data is limited or out-of-vocabulary words are common. recurrent connections.

Variable-length language models can be evaluated using a


variety of metrics, such as perplexity or BLEU score. Perplexity
5.7 Variable-Length Language Models: measures how well the model can predict the next word in a
sequence, while BLEU score measures how well the model can
Variable-length language models are a type of language generate translations that match a reference translation.
model that can handle variable-length input sequences, rather than
fixed-length input sequences as used by n-gram models. 5.8 Bayesian topic based:
The main advantage of variable-length language models is Bayesian topic-based language models, also known as
that they can handle input sequences of any length, which is topic models, are a type of language model that are used to
particularly useful for tasks such as machine translation or uncover latent topics in a corpus of text. These models use
summarization, where the length of the input or output can vary Bayesian inference to estimate the probability distribution of
greatly. words in each topic, and the probability distribution of topics in
each document.
One approach to building variable-length language models
is to use recurrent neural networks (RNNs), which can model The basic idea behind topic models is that a document is a
sequences of variable length. RNNs use a hidden state that is mixture of several latent topics, and each word in the document is
updated at each time step based on the input at that time step and generated by one of these topics. The model tries to learn the
the previous hidden state. This allows the network to capture the distribution of these topics from the corpus, and uses this
Page 73 of 76
R22 B.Tech. CSE NLP

information to predict the probability distribution of words in each to the task of training a language model on data from one language
document. and using it to process input in another language. The goal is to
create a model that can transfer knowledge from one language to
One of the most popular Bayesian topic-based language another, even if the languages are unrelated. This can be useful for
models is Latent Dirichlet Allocation (LDA). LDA assumes that tasks such as cross lingual document classification, where the
the corpus is generated by a mixture of latent topics, and each topic model needs to be able to classify documents written in different
is a probability distribution over the words in the corpus. The languages.
model uses a Dirichlet prior over the topic distributions, which
encourages sparsity and prevents overfitting. There are several challenges associated with multilingual
LDA has been used for a variety of NLP tasks, including and cross lingual language
text classification, information retrieval, and topic modeling. It modeling, including:
has been shown to be effective in uncovering hidden themes and
patterns in large corpora of text, and can be used to identify key 1. Vocabulary size: Different languages have different
topics and concepts in a document. vocabularies, which can make it challenging to train a model that
can handle input from multiple languages.
2. Grammatical structure: Different languages have different
5.9 Multilingual and Cross Lingual grammatical structures, which can make it challenging to create a
Language Modeling: model that can handle input from multiple languages.
Multilingual and crosslingual language modeling are two 3. Data availability: It can be challenging to find enough training
related but distinct areas of natural language processing that deal data for all the languages of interest.
with modeling language data across multiple languages.
To overcome these challenges, researchers have
Multilingual language modeling refers to the task of developed various approaches to multilingual and cross lingual
training a language model on data from multiple languages. The language modeling, including:
goal is to create a single model that can handle input in multiple
languages. This can be useful for applications such as machine 1. Shared embedding space: One approach is to train a model
translation, where the model needs to be able to process input in with a shared embedding space, where the embeddings for words
different languages. in different languages are learned jointly. This can help address
the vocabulary size challenge.
Cross lingual language modeling, on the other hand, refers 2. Language-specific layers: Another approach is to use
Page 74 of 76
R22 B.Tech. CSE NLP

language-specific layers in the model to handle the differences in been shown to be effective for low-resource languages.
grammatical structure across languages.
3. Pretraining and transfer learning: Pretraining a model on Multilingual language models have many potential
large amounts of data in one language and then fine-tuning it on applications, including machine translation, language
smaller amounts of data in another language can help address the identification, and cross-lingual information retrieval. They can
data availability challenge. also be used for tasks such as sentiment analysis and named entity
recognition across multiple languages. However, there are also
Multilingual and cross lingual language modeling are challenges associated with multilingual language modeling,
active areas of research, with many potential applications in including the need for large amounts of multilingual data and the
machine translation, cross lingual information retrieval, and other difficulty of balancing the modeling of multiple languages.
areas.
2. Cross lingual Language Modeling:
1. Multilingual Language Modeling:
Crosslingual language modeling is a type of multilingual
Multilingual language modeling is the task of training a language modeling that focuses specifically on the problem of
single language model that can process input in multiple transferring knowledge between languages that are not necessarily
languages. The goal is to create a model that can handle the closely related. The goal is to create a language model that can
vocabulary and grammatical structures of multiple languages. understand multiple languages and can be used to perform tasks
across languages, even when there is limited data available for
One approach to multilingual language modeling is to some of the languages.
train the model on a mixture of data from multiple languages. The
model can then learn to share information across languages and One approach to crosslingual language modeling is to use
generalize to new languages. This approach can be challenging a shared encoder for multiple languages, which can be used to map
because of differences in vocabulary and grammar across input text into a common embedding space. This approach allows
languages. the model to transfer knowledge across languages and to leverage
shared structures and features across languages.
Another approach is to use a shared embedding space for
the different languages. In this approach, the embeddings for Another approach is to use parallel corpora, which are
words in different languages are learned jointly, allowing the pairs of texts in two different languages that have been aligned
model to transfer knowledge across languages. This approach has sentence-by-sentence. These parallel corpora can be used to train
Page 75 of 76
R22 B.Tech. CSE NLP

models that can map sentences in one language to sentences in


another language, which can be used for tasks like machine
translation.

Crosslingual language modeling has many potential


applications, including crosslingual information retrieval,
machine translation, and cross-lingual classification. It is
particularly useful for low-resource languages where there may be
limited labelled data available, as it allows knowledge from other
languages to be transferred to the low-resource language.

However, crosslingual language modeling also presents


several challenges, including the need for large amounts of
parallel data, the difficulty of aligning sentence pairs across
languages, and the potential for errors to propagate across
languages.

Page 76 of 76

You might also like