Unit 5
Unit 5
CSE NLP
Prerequisites:
1. Data structures and compiler design
Course Objectives:
Introduction to some of the problems and solutions of NLP and their relation
to linguistics and statistics.
Course Outcomes:
CS525PE Show sensitivity to linguistic phenomena and an ability to model them with
formal grammars.
Understand and carry out proper experimental methodology for training and
Natural Language evaluating empirical NLP systems
Manipulate probabilities, construct statistical models over strings and
Processing trees, and estimate parameters using supervised and unsupervised training
methods.
Design, implement, and analyze NLP algorithms; and design different
Professional Elective – II language modelling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their
Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods,
Complexity of the Approaches, Performances of the
Approaches, Features
UNIT - II
Page 1 of 76
R22 B.Tech. CSE NLP
Page 2 of 76
R22 B.Tech. CSE NLP
3. GeoQuery:
GeoQuery is a Natural Language Interface (NLI) designed
to interact with geographic database called Geobase. Geobase Language Modeling: Introduction, N-Gram Models,
contains about 800 prolog facts, which store geographic Language Model Evaluation, Bayesian parameter
information such as populations, neighbouring states, major
estimation, Language Model Adaptation, Language Models-
rivers, and major cities in a relational database.
class based, variable length, Bayesian topic based,
4. Robocup: CLang Multilingual and Cross Lingual Language Modeling
Robocup is an international competition where teams of
robots play soccer, and it’s organized by the artificial intelligence
community. The goal is to advance AI and robotics research Language Modeling:
through this challenging and fun domain. 5.1 Introduction:
What is language modeling?
2.2 Software’s: Language modeling, or LM, is the use of various statistical
WASP and probabilistic techniques to determine the probability of a
KRISPER given sequence of words occurring in a sentence. Language
CHILL models analyze bodies of text data to provide a basis for their word
predictions.
Language modeling is used in artificial intelligence (AI),
natural language processing (NLP), natural language
understanding and natural language generation systems,
particularly ones that perform text generation, machine translation
and question answering.
Page 59 of 76
R22 B.Tech. CSE NLP
Language models determine word probability by Language modeling is crucial in modern NLP
analyzing text data. They interpret this data by feeding it through applications. It's the reason that machines can understand
an algorithm that establishes rules for context in natural language. qualitative information. Each language model type, in one way or
Then, the model applies these rules in language tasks to accurately another, turns qualitative information into quantitative
predict or produce new sentences. The model essentially learns the information. This allows people to communicate with machines as
features and characteristics of basic language and uses those they do with each other, to a limited extent.
features to understand new phrases.
Language modeling is used in a variety of industries
There are several different probabilistic approaches to including information technology, finance, healthcare,
modeling language. They vary depending on the purpose of the transportation, legal, military and government. In addition, it's
language model. From a technical perspective, the various likely that most people have interacted with a language model in
language model types differ in the amount of text data they some way at some point in the day, whether through Google
analyze and the math they use to analyze it. For example, a search, an autocomplete text function or engaging with a voice
language model designed to generate sentences for an automated assistant.
social media bot might use different math and analyze text data in
different ways than a language model designed for determining The roots of language modeling can be traced back to
the likelihood of a search query. 1948. That year, Claude Shannon published a paper titled "A
Mathematical Theory of Communication." In it, he detailed the
Language modeling types use of a stochastic model called the Markov chain to create a
statistical model for the sequences of letters in English text. This
There are several approaches to building language models. paper had a large impact on the telecommunications industry and
Some common statistical language modeling types are the laid the groundwork for information theory and language
following: modeling. The Markov model is still used today, and n-grams are
1. N-gram tied closely to the concept.
2. Unigram
3. Bidirectional
4. Exponential
5. Neural Language Models
6. Continuous space
Importance of language modeling
Page 60 of 76
R22 B.Tech. CSE NLP
Text generation:
This application uses prediction to generate coherent and
contextually relevant text. It has applications in creative writing,
content generation, and summarization of structured data and
other text.
Chatbots:
These bots engage in humanlike conversations with users
as well as generate accurate responses to questions. Chatbots are
used in virtual assistants, customer support applications and
information retrieval systems.
Machine translation:
This involves the translation of one language to another by
a machine. Google Translate and Microsoft Translator are two
programs that do this. Another is SDL Government, which is used
to translate foreign social media feeds in real time for the U.S.
government.
Parts-of-speech tagging:
Uses and examples of language modelling: This use involves the markup and categorization of words
Language models are the backbone of NLP. Below are by certain grammatical characteristics. This model is used in the
some NLP use cases and tasks that employ language modeling: study of linguistics. It was first and perhaps most famously used
in the study of the Brown Corpus, a body of random English prose
Speech recognition: that was designed to be studied by computers. This corpus has
This involves a machine being able to process speech been used to train several important language models, including
audio. Voice assistants such as Siri and Alexa commonly use one used by Google to improve search quality.
speech recognition. Parsing:
Page 61 of 76
R22 B.Tech. CSE NLP
This use involves analysis of any string of data or sentence use it to analyze unstructured data, such as product reviews and
that conforms to formal grammar and syntax rules. In language general posts about their product, as well as analyze internal data
modeling, this can take the form of sentence diagrams that depict such as employee surveys and customer support chats. Some
each word's relationship to the others. Spell-checking applications services that provide sentiment analysis tools are Repustate and
use language modeling and parsing. HubSpot's Service Hub. Google's NLP tool Bert is also used for
sentiment analysis.
Optical character recognition:
5.2 N-gram
This application involves the use of a machine to convert
images of text into machine-encoded text. The image can be a N-gram can be defined as the contiguous sequence of n
scanned document or document photo, or a photo with text items from a given sample of text or speech. The items can be
somewhere in it -- on a sign, for example. Optical character letters, words, or base pairs according to the application. The N-
recognition is often used in data entry when processing old paper grams typically are collected from a text or speech corpus (A long
records that need to be digitized. It can also be used to analyze and text dataset).
identify handwriting samples.
N-gram Model:
Information retrieval:
This simple approach to a language model creates a
This approach involves searching in a document for probability distribution for a sequence of n. The n can be any
information, searching for documents in general and searching for number and defines the size of the gram, or sequence of words or
metadata that corresponds to a document. Web browsers are the random variables being assigned a probability. This allows the
most common information retrieval applications. model to accurately predict the next word or variable in a sentence.
For example, if n = 5, a gram might look like this: "can you please
Observed data analysis: call me." The model then assigns probabilities using sequences of
These language models analyze observed data such as n size. Basically, n can be thought of as the amount of context the
sensor data, telemetric data and data from experiments. model is told to consider. Some types of n-grams are unigrams,
bigrams, trigrams and so on. N-grams can also help detect
Sentiment analysis: malware by analyzing strings in a file.
This application involves determining the sentiment An N-gram language model predicts the probability of a
behind a given phrase. Specifically, sentiment analysis is used to given N-gram within any sequence of words in the language. A
understand opinions and attitudes expressed in a text. Businesses good N-gram model can predict the next word in the sentence i.e
Page 62 of 76
R22 B.Tech. CSE NLP
Generally in practical applications, Bi-gram (previous one “about five minutes from……”
word), Tri-gram (previous two word), Four-gram (Previous
three word) are used. Count(about five minutes from)= P(about |<s>) X P(five | about)
X P( minutes | five) X
Unigram(1-gram): No history is used.
“about five minutes from……” P (from | minutes)
Assume in corpus dinner word is present with highest
probability. Count(about five minutes from college)= P(about |<s>) X P(five |
Unigram doesn’t take into account probabilities with previous about) X P( minutes | five) X P (from | minutes) X P (college |
words like from, minutes. from)
Unigram will predict dinner Count(about five minutes from class)= P(about |<s>) X P(five |
“about five minutes from dinner” about) X P( minutes | five) X P (from | minutes) X P (class| from)
Bi-gram(2-gram): One word history. P(college | about five minutes from)
“about five minutes from” = count(about five minutes from college)
Assumption: Next word may be college, class count (about five minutes from)
P(college | about five minutes from) P(class | about five minutes from)
= count(about five minutes from college) = count(about five minutes from class)
count(aboutfive minutes from) count (about five minutes from)
P(class | about five minutes from) As no. of previous state (history) increases, it is very difficult to
= count(about five minutes from class) match that set of word in corpus.
For N-1 words, the N-gram modeling predicts most P (“There was heavy rain”) ~ P (“There”) P (“was”
occurred words that can follow the sequences. The model is the |“'There”) P (“heavy” |“was”) P (“rain” |“heavy”)
probabilistic language model which is trained on the collection of
the text. This model is useful in applications i.e. speech
recognition, and machine translations. A simple model has some Which N-Gram should be used as a language model?
limitations that can be improved by smoothing, interpolations, and – Bigger N, the model will be more accurate.
back off. So, the N-gram language model is about finding
probability distributions over the sequences of the word. • But we may not get good estimates for N-Gram probabilities.
Consider the sentences i.e. "There was heavy rain" and • The N-Gram tables will be more sparse.
"There was heavy flood". By using experience, it can be said that
the first statement is good. The N-gram language model tells that – Smaller N, the model will be less accurate.
the "heavy rain" occurs more frequently than the "heavy flood". • But we may get better estimates for N-Gram probabilities.
So, the first statement is more likely to occur and it will be then
selected by this model. • The N-Gram table will be less sparse.
In the one-gram model, the model usually relies on that – In reality, we do not use higher than Trigram (not more than
Page 67 of 76
R22 B.Tech. CSE NLP
Language Model Adaption has been applied successfully 1. Word clustering: The first step is to cluster words based on
in a wide range of NLP tasks, including sentiment analysis, text their distributional similarity. This can be done using
classification, named entity recognition, and machine translation. unsupervised clustering algorithms such as kmeans clustering or
However, it does require a small amount of task-specific data, hierarchical clustering.
which may not always be available or representative of the target 2. Class construction: After clustering, each cluster is assigned a
domain. class label. The number of classes can be predefined or determined
automatically based on the size of the training corpus and the
Types of Language Models: desired level of granularity.
1. Class based Language Models 3. Probability estimation: Once the classes are constructed, the
2. Variable Length Language Models probability of a word given its class is estimated using a variety of
3. Discriminative Language Models techniques, such as maximum likelihood estimation or Bayesian
4. Syntax based Language Models estimation.
5. MaxEnt Language models 4. Language modeling: The final step is to use the estimated
6. Factored Language Models probabilities to build a language model that can predict the
7. Other Tree-based Language Models probability of a sequence of words.
8. Bayesian Topic-Based Language Models
9. Neural Network Language Models Class-based language models have several advantages
over traditional word-based models, including:
5.6 Class-Based Language Models: 1. Reduced sparsity: By grouping similar words together, class-
based models reduce the sparsity problem in language modeling,
Class-based language models are a type of probabilistic which can improve the accuracy of the model.
language model that groups words into classes based on their 2. Improved data efficiency: Since class-based models estimate
distributional similarity. The goal of class-based models is to the probability of a word given its class rather than estimating the
reduce the sparsity problem in language modeling by grouping probability of each individual word, they require less training data
similar words together and estimating the probability of a word and can be more data-efficient.
given its class rather than estimating the probability of each 3. Better handling of out-of-vocabulary words: Class-based
individual word. models can handle out-of vocabulary words better than word-
based models, since unseen words can often be assigned to an
The process of building a class-based language model existing class based on their distributional similarity.
typically involves the following steps:
Page 72 of 76
R22 B.Tech. CSE NLP
However, class-based models also have some limitations, dependencies between words in a sentence, regardless of the
such as the need for a large training corpus to build accurate word sentence length.
clusters and the potential loss of some information due to the
grouping of words into classes. Another approach is to use transformer-based models,
which can also handle variable-length input sequences.
Overall, class-based language models are a useful tool for Transformer-based models use a self-attention mechanism to
reducing the sparsity problem in language modeling and capture the dependencies between words in a sentence, allowing
improving the accuracy of language models, particularly in cases them to model long-range dependencies without the need for
where data is limited or out-of-vocabulary words are common. recurrent connections.
information to predict the probability distribution of words in each to the task of training a language model on data from one language
document. and using it to process input in another language. The goal is to
create a model that can transfer knowledge from one language to
One of the most popular Bayesian topic-based language another, even if the languages are unrelated. This can be useful for
models is Latent Dirichlet Allocation (LDA). LDA assumes that tasks such as cross lingual document classification, where the
the corpus is generated by a mixture of latent topics, and each topic model needs to be able to classify documents written in different
is a probability distribution over the words in the corpus. The languages.
model uses a Dirichlet prior over the topic distributions, which
encourages sparsity and prevents overfitting. There are several challenges associated with multilingual
LDA has been used for a variety of NLP tasks, including and cross lingual language
text classification, information retrieval, and topic modeling. It modeling, including:
has been shown to be effective in uncovering hidden themes and
patterns in large corpora of text, and can be used to identify key 1. Vocabulary size: Different languages have different
topics and concepts in a document. vocabularies, which can make it challenging to train a model that
can handle input from multiple languages.
2. Grammatical structure: Different languages have different
5.9 Multilingual and Cross Lingual grammatical structures, which can make it challenging to create a
Language Modeling: model that can handle input from multiple languages.
Multilingual and crosslingual language modeling are two 3. Data availability: It can be challenging to find enough training
related but distinct areas of natural language processing that deal data for all the languages of interest.
with modeling language data across multiple languages.
To overcome these challenges, researchers have
Multilingual language modeling refers to the task of developed various approaches to multilingual and cross lingual
training a language model on data from multiple languages. The language modeling, including:
goal is to create a single model that can handle input in multiple
languages. This can be useful for applications such as machine 1. Shared embedding space: One approach is to train a model
translation, where the model needs to be able to process input in with a shared embedding space, where the embeddings for words
different languages. in different languages are learned jointly. This can help address
the vocabulary size challenge.
Cross lingual language modeling, on the other hand, refers 2. Language-specific layers: Another approach is to use
Page 74 of 76
R22 B.Tech. CSE NLP
language-specific layers in the model to handle the differences in been shown to be effective for low-resource languages.
grammatical structure across languages.
3. Pretraining and transfer learning: Pretraining a model on Multilingual language models have many potential
large amounts of data in one language and then fine-tuning it on applications, including machine translation, language
smaller amounts of data in another language can help address the identification, and cross-lingual information retrieval. They can
data availability challenge. also be used for tasks such as sentiment analysis and named entity
recognition across multiple languages. However, there are also
Multilingual and cross lingual language modeling are challenges associated with multilingual language modeling,
active areas of research, with many potential applications in including the need for large amounts of multilingual data and the
machine translation, cross lingual information retrieval, and other difficulty of balancing the modeling of multiple languages.
areas.
2. Cross lingual Language Modeling:
1. Multilingual Language Modeling:
Crosslingual language modeling is a type of multilingual
Multilingual language modeling is the task of training a language modeling that focuses specifically on the problem of
single language model that can process input in multiple transferring knowledge between languages that are not necessarily
languages. The goal is to create a model that can handle the closely related. The goal is to create a language model that can
vocabulary and grammatical structures of multiple languages. understand multiple languages and can be used to perform tasks
across languages, even when there is limited data available for
One approach to multilingual language modeling is to some of the languages.
train the model on a mixture of data from multiple languages. The
model can then learn to share information across languages and One approach to crosslingual language modeling is to use
generalize to new languages. This approach can be challenging a shared encoder for multiple languages, which can be used to map
because of differences in vocabulary and grammar across input text into a common embedding space. This approach allows
languages. the model to transfer knowledge across languages and to leverage
shared structures and features across languages.
Another approach is to use a shared embedding space for
the different languages. In this approach, the embeddings for Another approach is to use parallel corpora, which are
words in different languages are learned jointly, allowing the pairs of texts in two different languages that have been aligned
model to transfer knowledge across languages. This approach has sentence-by-sentence. These parallel corpora can be used to train
Page 75 of 76
R22 B.Tech. CSE NLP
Page 76 of 76