0% found this document useful (0 votes)

9 views26 pages

Module 2

The document provides an overview of language models in Natural Language Processing, focusing on N-grams and Markov models. It discusses the advantages and limitations of these models, including the use of Hidden Markov Models for tasks like part-of-speech tagging. Additionally, it explains the concept of corpora, their classifications, and their importance in training algorithms for various NLP applications.

Uploaded by

Swayam sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views26 pages

Module 2

Uploaded by

Swayam sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to Language Models: Markov models, N-grams

Language modeling is a foundational concept in the field of Natural Language

Processing (NLP), which lies at the intersection of computer science, linguistics,
and artificial intelligence. At its core, language modeling involves the prediction
of the next word or token in a sequence of words.

N-grams: Language Modeling

N-grams, which serve as the fundamental building blocks of language modeling.

N-grams are probabilistic language models that estimate the likelihood of a word
based on the preceding N-1 words.

In other words, they model the conditional probability of a word given its
context.

 Definition: An N-gram is a contiguous sequence of N items (words,

characters, or other tokens) from a given sample of text or speech.

 Example: In a bigram (2-gram) model, the probability of a word depends

solely on the previous word. So, to predict the third word in the sentence “I
love Natural Language,” the model only considers the second word, “love.”

Advantages of N-grams
1. Simplicity: N-grams are intuitive and relatively simple to understand and
implement.

2. Low Memory Usage: They require minimal memory for storage compared
to more complex models.

Limitations of N-grams

1. Limited Context: N-grams have a finite context window, which means they
cannot capture long-range dependencies or context beyond the previous N-1
words.

2. Sparsity: As N increases, the number of possible N-grams grows

exponentially, leading to sparse data and increased computational demands.

While N-grams provide a useful introduction to language modeling, they have

clear limitations when it comes to capturing nuanced language patterns.

To address these limitations, we turn to a more sophisticated approach: Markov

models.

Markov Models: Contextual Predictions

Markov models are a step up from N-grams in terms of contextual prediction.

They are based on the Markov property, which posits that the probability of a
future state depends solely on the current state.

In the context of language modeling, this translates to predicting the next word
based on the current word, which is known as a first-order Markov model or a
Markov chain.

Exploring Markov Models

 First-Order Markov Model: In this model, the probability of a word
depends only on the preceding word. For example, to predict the third word
in a sentence, the model considers only the second word.

 Higher-Order Markov Models: These models extend the context window

beyond one word. A second-order Markov model considers the probability
of a word based on the previous two words, and so on.

Advantages of Markov Models

1. Improved Contextual Understanding: Markov models capture contextual

information better than N-grams, as they can consider more extensive
context.

Limitations of Markov Models

1. Curse of Dimensionality: Higher-order Markov models suffer from the curse

of dimensionality. As the context size increases, the number of possible
states grows exponentially, making it challenging to estimate accurate
probabilities.

2. Limited Long-Range Dependencies: Even with higher orders, Markov

models struggle to capture very long-range dependencies in language.

While Markov models offer enhanced contextual prediction, they too have their
limitations, particularly when it comes to modeling complex language patterns.

Hidden Markov Models

Hidden Markov Model and applied it to part of speech tagging. Part of speech tagging is a
fully-supervised learning task, because we have a corpus of words labeled with the correct
part-of-speech tag. But many applications don’t have labeled data.
Markov chain
The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us
something about the probabilities of sequences of random variables, states, each of which can
take on values from some set.
These sets can be words, or tags, or symbols representing anything, like the weather.
A Markov chain makes a very strong assumption that if we want to predict the future in the
sequence, all that matters is the current state.
The states before the current state have no impact on the future except via the current state.
It’s as if to predict tomorrow’s weather you could examine today’s weather but you weren’t
allowed to look at yesterday’s weather.
Note: The detail description of HMM follow (HMM.pdf)

A Markov Chain is a sequence of states. A sequence means that there is

always a transition, a moment where a state goes from one state to another.
The idea is to generate a sequence of states, based on the existing
states and probability of an outcome after that.

Markov chains are one of the most important stochastic processes — the
process of some value changing randomly over time. They are called so
because they follow the rule of Markov Property, which says that

“the next process in a state depends on how it is right

now”
i.e., the information contained in the current state of the process is all that is
needed to determine the future states.

It doesn’t have a “memory” of how it was before. It is helpful to think of a

Markov chain as evolving through discrete steps in time, although the
“step” doesn’t need to have anything to do with time.

A 2 State Markov Model

Consider the above model, with states 0 & 1, and the probability of their transition to another
state are represented as: P(0|1): p, P(0|0): (1-p), P(1|0): q, & P(1|1): (1-q)

The Model

Now think of text as a sequence of states. And use a character-based

language model. These are essentially used to predict the next character at
a time. Given a state, tell the machine to predict the next character.
myText = "the they them .... the them ... the they .. then"

Consider the string myText above, we need to know from the machine the
frequency of the next character occurring when the size of the window is
selected, which is the number of characters grouped, treated as a whole,
that looks for the next character occurrence.
Here, we take the window size, k=3, which gives us the following
frequency table
X y Frequency
the _ 3
he_ t 3
e_t h 3
_th e 8
the y 2
the n 1
. . .
. . .

Once retrieve this tabular data, now based on the model predict what could
be the most likely output, when a certain sequence is encountered based on
the probability of the occurrence in a string.

The probability of the next character when a string ‘the’ is encountered.

Estimating Probabilities: Probability estimation for words, smoothing

techniques

What is Additive Smoothing?

Additive smoothing is a technique that adjusts the estimated probabilities of n-
grams by adding a small constant value (usually denoted as α) to the count of
each n-gram. This approach ensures that no probability is zero, even for n-grams
that were not observed in the training data.
Working of Additive Smoothing
The main idea behind additive smoothing is to distribute some probability mass
to unseen n-grams by adding a constant α to each n-gram count. This has the
effect of lowering the probability of observed n-grams slightly while ensuring
that unseen n-grams receive a small, non-zero probability.
The choice of the smoothing parameter αα is crucial:
 If α=1α=1: This is known as Laplace Smoothing. It treats all n-grams, whether
seen or unseen, with equal weight.
 If 0<α<10<α<1: This is often referred to as Lidstone Smoothing. It provides a
more fine-grained adjustment, typically resulting in better performance in
practice.
Laplace Smoothing in Language Models
Laplace Smoothing is a specific case of additive smoothing where the smoothing
parameter α is set to 1. The primary goal of Laplace Smoothing is to prevent the
probability of any n-gram from being zero, which would otherwise happen if the
n-gram was not observed in the training data.
The formula for calculating the smoothed probability using Laplace
Smoothing is:
P(wn∣wn−1,…,w1)=C(w1,…,wn)+1C(w1,…,wn−1)+VP(wn∣wn−1,…,w1)=C(w1,…,wn−1)+VC(w1,…

,wn)+1

Where:
 C(w1,…,wn)C(w1,…,wn)

is the count of the n-gram (w1,…,wn)(w1,…,wn) in the

training data.
 C(w1,…,wn−1)C(w1,…,wn−1)

is the count of the (n-1)-gram prefix.

 V is the size of the vocabulary (i.e., the total number of unique words in the
training data).

How Laplace Smoothing Works:

Laplace Smoothing works by adding 1 to the count of every possible n-gram,
including those that were not observed in the training data. This adjustment
ensures that no n-gram has a zero probability, which would indicate that it is
impossible according to the model. By doing so, Laplace Smoothing distributes
some probability mass to these unseen n-grams, making the model more
adaptable to new data.
Here’s a step-by-step breakdown of how Laplace Smoothing is applied:
1. Count the N-grams: First, count the occurrences of all n-grams in the
training data.
1. Add 1 to All Counts: Add 1 to the count of each n-gram, including those with
zero counts.
1. Adjust the Denominator: Add the size of the vocabulary V to the
denominator, accounting for the total number of possible n-grams.
Smoothing techniques commonly used in NLP
 Laplacian (add-one) Smoothing
 Lidstone (add-k) Smoothing
 Absolute Discounting
 Katz Backoff
 Kneser-Ney Smoothing
 Interpolation

NOTE:: token is the number of words in a document or a sentence, vocab is the number of different type
of words in a document or sentence. For example, in the following sentence, there are 10 tokens and 8
vocabs (because "I" and "like" occur two times).
"I like natural language processing and I like machine learning."

What is a corpus, and how is it

used in NLP?
A corpus (plural corpora), also known as a text corpus in linguistics, is
usually a large collection of texts, and it could be compared to a database in
which each text is a record. It is often designed to contain various types of
utterances, registers, and genres as examples of language that naturally
occurs in a specific context. In natural language processing (NLP), corpora
are used to train algorithms and develop statistical models. These corpora
can be used by linguists, lexicographers, data scientists, and experts in NLP
for various tasks, including word frequency analysis, part-of-speech
tagging, and text classification. As an essential tool for anyone working
with NLP, their implementation can be as varied as creating text-to-speech
modules or a system for machine translation. However, not all corpora are
the same, and each helps accomplish NLP tasks differently.

Corpora use cases

In the field of linguistics, a corpus is a large and structured set of texts
(nowadays, usually electronically stored and processed). The texts in a
corpus have been selected to represent a particular language or subject
matter. The notion of a corpus has been helpful in computational
linguistics, where corpus-based methods are used for statistical analysis
and hypothesis testing on the data, checking the number of occurrences, or
even validating linguistic rules within the confines of a specific language
territory. Corpora are used to perform research in many different
disciplines, not just linguistics. For example, in the field of medicine,
corpora are used to help researchers develop new treatments and drugs. In
the field of law, corpora can be used to help lawyers find relevant cases and
precedents. And in the field of history, corpora can be used to help
historians find primary sources for their research.

Types of corpora
Different types of corpora can be classified according to their content, size,
and structure. However, a corpus can also have other characteristics or
properties for its organization, and often a corpus will fit into more than
one classification.
Corpora classifications based on content

Text corpora are the most common type of corpora that contain texts from
different sources.

Speech corpora contain recordings of people speaking and verbatim audio

transcriptions, and are often used to study how people speak a particular
language or to develop speech recognition software.

Image corpora contain images to develop computer vision algorithms, and

usually, each image is tagged to allow for identification.

Video corpora include videos and are used to create algorithms for
tracking objects on video.
Corpora classifications based on size

Small corpora typically comprise just a few texts and can be used for
specific research tasks. For example, small corpora of medical texts might
be used to study a specific disease.

Large corpora are composed of hundreds or even millions of texts and are
often used for general research tasks, such as studying the overall patterns
of a language.

Corpora classifications based on structure

Monolingual corpora are the most common corpora classification and

contain texts only from a single language source.

Multilingual corpora, simply put, contain more than one language. They
can be classified further based on how the text was created and the
relationship between both languages.

Parallel corpora are made from two or more monolingual corpora where
one corpus is the source and the second one will be a direct translation. In
this type of corpora, both languages will be aligned to have matching
segments at the paragraph or sentence level.

Comparable corpora are made of two or more monolingual corpora built

using the same principles and, therefore, offer similar results. However, as
the text is not a translation of each other, they are not aligned.

Corpora classifications based on purpose and other factors

General corpora contain various types of texts that can be utilized in
different research fields, offering a baseline resource for general studies.
The source can be written text or spoken language, along with
transcriptions.

Specialized corpora are designed for specific research goals containing a

particular text type. These constraints can refer to a specific time frame or a
particular subject, among other things.

Diachronic corpora contain language data from different historical

periods, and language experts use these to study the changes and
development in a specific language.

Synchronic corpora would be the opposite of diachronic corpora, and all

texts must be compiled from the same period.

Monitor corpora are diachronic and expandable. They are continuously

updated to reflect the changes in language usage by incorporating new
words and expressions.

National corpora contain texts that represent language used in a specific

country.

Reference corpora, in general terms, are used as the base of comparison

with other corpora. However, these are expected to be large general
corpora that offer comprehensive coverage, which the community of users
can regard as the standard for the particular use case.
Learner corpora include samples produced by non-native speakers of a
language. This type of corpora allows researchers to compare the texts
created by native speakers against those produced by language learners.

Developmental corpora contain language data from monolingual speakers

at different stages in their language development. These can track and
understand first language acquisition and vocabulary development.

Raw corpora provide no annotations or additional information and are

given as originally collected.

Annotated corpora contain texts annotated with information about their

structure, content, or meaning. For example, a corpus of medical texts
might be annotated with information about the diseases mentioned in each
text. Annotated corpora are often used to develop computational linguistics
applications, such as question-answering systems.

So, how do you build a corpus?

As mentioned, a corpus is an extensive collection of texts. Building one
provides an essential resource to investigate language and the learning data
necessary to create different tools that can be implemented in numerous
applications. Here are some steps on how to go about building a corpus for
your specific project needs.
1. Define the scope

Decide what kind of corpus you want to create. As there are many different
types of corpora, each type serves a specific purpose. Understanding
exactly what kind of data you need is the first step to building an effective
corpus for your project.

2. Define the format

Collect texts in whatever format your project requires. The text collection
could be digital (e.g., websites or other digitally stored files) or physical
(e.g., books or other printed documents). The collection stage could also
require samples of spoken language that will need to be transcribed before
the text can be used.

3. Organize the data

Organize your texts into a coherent structure. Doing this will make it easier
to search and analyze the language data in your corpus later. Having your
text divided into different categories or topics is a common first approach
for text organization.

4. Use the right tools

Use a corpus-building tool or service to help create and manage your

corpus. Many software options and platforms are available to help you
collect or even generate new text for your project.

5. Annotate the data

Annotate your corpus with metadata. Tagging or annotations will describe

the contents of each text and can be used to categorize and search the
corpus for further implementation.

6. Analyze the data

Explore your corpus! Once you have built it, you can start to carry out all
sorts of interesting analyses, such as looking at word frequencies or finding
collocations and interesting language patterns.

NOTE: A corpus is a large and structured set of machine-readable texts that have been produced
in a natural communicative setting. Its plural is corpora. They can be derived in different ways like
text that was originally electronic, transcripts of spoken language and optical character recognition,
etc.
Key Characteristics of a Corpus:
1. Structured Collection of Text:

A corpus is typically organized and labeled in a way that makes it easy to

analyze. It might include metadata about each text, such as its source,
author, date of publication, or linguistic annotations (e.g., part-of-speech
tags).

2. Size and Scope:

Corpora vary in size, ranging from small, domain-specific collections to

massive datasets containing millions or billions of words. The size of the
corpus often depends on the NLP task at hand. Larger corpora provide
more data for training and improving the accuracy of models, particularly
for deep learning-based NLP models.

3. Diversity of Text Types:

A corpus may contain different types of text, such as books, articles, blog
posts, social media content, transcripts of conversations, or legal
documents. Depending on the task, a corpus may focus on specific domains
(e.g., medical or legal) or provide a broad representation of general
language usage.

Task-specific corpora:

 POS Tagging: Penn Treebank's WSJ section is tagged with a 45-tag

tagset. Use Ritter dataset for social media content.
 Named Entity Recognition: CoNLL 2003 NER task is newswire content
from Reuters RCV1 corpus. It considers four entity types. WNUT 2017
Emerging Entities task and OntoNotes 5.0 are other datasets.
 Constituency Parsing: Penn Treebank's WSJ section has dataset for this
purpose.
 Semantic role labelling: OntoNotes v5.0 is useful due to syntactic and
semantic annotations.
 Sentiment Analysis: IMDb has released 50K movie reviews. Others are
Amazon Customer Reviews of 130 million reviews, 6.7 million business
reviews from Yelp, and Sentiment140 of 160K tweets.
 Text Classification/Clustering: Reuters-21578 is a collection of news
documents from 1987 indexed by categories. 20 Newsgroups is another
dataset of about 20K documents from 20 newsgroups.
 Question Answering: Stanford Question Answering Dataset (SQuAD) is
a reading comprehension dataset with 100K questions plus 50K
unanswerable questions.

A Wordlist Corpus is a specific type of corpus that contains a list of words used
for tasks where word-level information is required.
Understanding Wordlist Corpus

A Wordlist Corpus is a collection of words organized in a specific format with

each word on a separate line. This type of corpus is widely used in NLP tasks that
require a predefined set of words such as creating custom dictionaries, spell-
checking applications, text normalization or filtering out certain words based on
the task’s requirements.
1. Text Preprocessing: Removes stop words, unwanted words or filter out non-
relevant terms.
1. Spell Checking: Ensures presence of correctly spelled words by referencing
them from dictionary-like corpus.
1. Text Normalization: Converts variations of the same word into a standard
format for further processing.
1. Word Filtering: When working with text corpus it may be necessary to filter
out certain words or phrases.
1. Building Custom Dictionaries: You can create custom dictionaries to
enhance Named Entity Recognition (NER), classification or other NLP tasks
that require domain-specific knowledge.

A corpus are determined by the following two factors −

Balance − The range of genre include in a corpus

Sampling − How the chunks for each genre are selected.

Corpus Balance
Another very important element of corpus design is corpus balance the range of
genre included in a corpus. We have already studied that representativeness of a
general corpus depends upon how balanced the corpus is. A balanced corpus
covers a wide range of text categories, which are supposed to be representatives
of the language. We do not have any reliable scientific measure for balance but
the best estimation and intuition works in this concern. In other words, we can
say that the accepted balance is determined by its intended uses only.

Sampling
Another important element of corpus design is sampling. Corpus
representativeness and balance is very closely associated with sampling. That is
why we can say that sampling is inescapable in corpus building.

Sampling unit − It refers to the unit which requires a sample. For example, for
written text, a sampling unit may be a newspaper, journal or a book.

Sampling frame − The list of al sampling units is called a sampling frame.

Population − It may be referred as the assembly of all sampling units. It is

defined in terms of language production, language reception or language as a
product.

Corpus Size
The size of the corpus depends upon the purpose for which it is intended as well
as on some practical considerations as follows −

With the advancement in technology, the corpus size also increases. The
following table of comparison will help you understand how the corpus size
works −

Year Name of the Corpus Size (in words)

1960s - 70s Brown and LOB 1 Million words
1980s The Birmingham corpora 20 Million words
1990s The British National corpus 100 Million words
Early 21st century The Bank of English corpus 650 Million words

A few examples of corpus:

TreeBank Corpus
It may be defined as linguistically parsed text corpus that annotates syntactic or
semantic sentence structure. Geoffrey Leech coined the term treebank, which
represents that the most common way of representing the grammatical analysis is
by means of a tree structure. Generally, Treebanks are created on the top of a
corpus, which has already been annotated with part-of-speech tags.

Types of TreeBank Corpus

Semantic and Syntactic Treebanks are the two most common types of Treebanks
in linguistics. Let us now learn more about these types −

Semantic Treebanks
These Treebanks use a formal representation of sentences semantic structure.
They vary in the depth of their semantic representation. Robot Commands
Treebank, Geoquery, Groningen Meaning Bank, RoboCup Corpus are some of
the examples of Semantic Treebanks.

Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are
expressions of the formal language obtained from the conversion of parsed
Treebank data. The outputs of such systems are predicate logic based meaning
representation. Various syntactic Treebanks in different languages have been
created so far. For example, Penn Arabic Treebank, Columbia Arabic Treebank
are syntactic Treebanks created in Arabia language. Sininca syntactic Treebank
created in Chinese language. Lucy, Susane and BLLIP WSJ syntactic corpus
created in English language.

Applications of TreeBank Corpus

Followings are some of the applications of TreeBanks −

In Computational Linguistics
If we talk about Computational Linguistic then the best use of TreeBanks is to
engineer state-of-the-art natural language processing systems such as part-of-
speech taggers, parsers, semantic analyzers and machine translation systems.

In Corpus Linguistics
In case of Corpus linguistics, the best use of Treebanks is to study syntactic
phenomena.
In Theoretical Linguistics and Psycholinguistics
The best use of Treebanks in theoretical and psycholinguistics is interaction
evidence.

PropBank Corpus
PropBank more specifically called Proposition Bank is a corpus, which is
annotated with verbal propositions and their arguments. The corpus is a verb-
oriented resource; the annotations here are more closely related to the syntactic
level. Martha Palmer et al., Department of Linguistic, University of Colorado
Boulder developed it. We can use the term PropBank as a common noun referring
to any corpus that has been annotated with propositions and their arguments.

In Natural Language Processing (NLP), the PropBank project has played a very
significant role. It helps in semantic role labeling.

VerbNet(VN)
VerbNet(VN) is the hierarchical domain-independent and largest lexical resource
present in English that incorporates both semantic as well as syntactic
information about its contents. VN is a broad-coverage verb lexicon having
mappings to other lexical resources such as WordNet, Xtag and FrameNet. It is
organized into verb classes extending Levin classes by refinement and addition of
subclasses for achieving syntactic and semantic coherence among class members.

Each VerbNet (VN) class contains −

A set of syntactic descriptions or syntactic frames

For depicting the possible surface realizations of the argument structure for
constructions such as transitive, intransitive, prepositional phrases, resultatives,
and a large set of diathesis alternations.

A set of semantic descriptions such as animate, human, organization

For constraining, the types of thematic roles allowed by the arguments, and
further restrictions may be imposed. This will help in indicating the syntactic
nature of the constituent likely to be associated with the thematic role.

WordNet
WordNet, created by Princeton is a lexical database for English language. It is the
part of the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms called Synsets. All the synsets are linked
with the help of conceptual-semantic and lexical relations. Its structure makes it
very useful for natural language processing (NLP).

POS(Parts-Of-Speech) Tagging in NLP

Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular
part of speech (adverb, adjective, verb, etc.) or grammatical category.
Through the addition of a layer of syntactic and semantic information to
the words, this procedure makes it easier to comprehend the sentence’s
structure and meaning.
In NLP applications, POS tagging is useful for machine translation, named
entity recognition, and information extraction, among other things. It also
works well for clearing out ambiguity in terms with numerous meanings
and revealing a sentence’s grammatical structure.
Numerous meanings and revealing a sentence’s grammatical structure.

Default tagging is a basic step for the part-of-speech tagging. It is

performed using the DefaultTagger class. The DefaultTagger class takes
‘tag’ as a single argument. NN is the tag for a singular noun.
DefaultTagger is most useful when it gets to work with most common part-
of-speech tag. That’s why a noun tag is recommended.

Example of POS Tagging

Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
 “The” is tagged as determiner (DT)
 “quick” is tagged as adjective (JJ)
 “brown” is tagged as adjective (JJ)
 “fox” is tagged as noun (NN)
 “jumps” is tagged as verb (VBZ)
 “over” is tagged as preposition (IN)
 “the” is tagged as determiner (DT)
 “lazy” is tagged as adjective (JJ)
 “dog” is tagged as noun (NN)
By offering insights into the grammatical structure, this tagging aids
machines in comprehending not just individual words but also the
connections between them inside a phrase. For many NLP applications,
like text summarization, sentiment analysis, and machine translation, this
kind of data is essential.

Workflow of POS Tagging in NLP

The following are the processes in a typical natural language processing

(NLP) example of part-of-speech (POS) tagging:
 Tokenization: Divide the input text into discrete tokens, which are
usually units of words or subwords. The first stage in NLP tasks is
tokenization.
 Loading Language Models: To utilize a library such as NLTK or
SpaCy, be sure to load the relevant language model. These models
offer a foundation for comprehending a language’s grammatical
structure since they have been trained on a vast amount of linguistic
data.
 Text Processing: If required, preprocess the text to handle special
characters, convert it to lowercase, or eliminate superfluous
information. Correct PoS labeling is aided by clear text.
 Linguistic Analysis: To determine the text’s grammatical structure,
use linguistic analysis. This entails understanding each word’s purpose
inside the sentence, including whether it is an adjective, verb, noun, or
other.
 Part-of-Speech Tagging: To determine the text’s grammatical
structure, use linguistic analysis. This entails understanding each
word’s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
 Results Analysis: Verify the accuracy and consistency of the PoS
tagging findings with the source text. Determine and correct any
possible problems or mistagging.

Types of POS Tagging in NLP

Assigning grammatical categories to words in a text is known as Part-of-
Speech (PoS) tagging, and it is an essential aspect of Natural Language
Processing (NLP). Different PoS tagging approaches exist, each with a
unique methodology. Here are a few typical kinds:
1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their
respective parts of speech using predetermined rules, contrasting with
machine learning-based POS tagging that requires training on annotated
text corpora. In a rule-based system, POS tags are assigned based on
specific word characteristics and contextual cues.
For instance, a rule-based POS tagger could designate the “noun” tag to
words ending in “tion” or “ment,” recognizing common noun-forming
suffixes. This approach offers transparency and interpretability, as it
doesn’t rely on training data.
Let’s consider an example of how a rule-based part-of-speech (POS)
tagger might operate:
Rule: Assign the POS tag “noun” to words ending in “-tion” or “-ment.”
Text: “The presentation highlighted the key achievements of the project’s
development.”
Rule based Tags:
 “The” – Determiner (DET)
 “presentation” – Noun (N)
 “highlighted” – Verb (V)
 “the” – Determiner (DET)
 “key” – Adjective (ADJ)
 “achievements” – Noun (N)
 “of” – Preposition (PREP)
 “the” – Determiner (DET)
 “project’s” – Noun (N)
 “development” – Noun (N)

In this instance, the predetermined rule is followed by the rule-based POS

tagger to label words. “Noun” tags are applied to words like “presentation,”
“achievements,” and “development” because of the aforementioned
restriction. Despite the simplicity of this example, rule-based taggers may
handle a broad variety of linguistic patterns by incorporating different
rules, which makes the tagging process transparent and comprehensible.

2. Transformation Based tagging

Transformation-based tagging (TBT) is a part-of-speech (POS) tagging
method that uses a set of rules to change the tags that are applied to
words inside a text. In contrast, statistical POS tagging uses trained
algorithms to predict tags probabilistically, while rule-based POS tagging
assigns tags directly based on predefined rules.
To change word tags in TBT, a set of rules is created depending on
contextual information. A rule could, for example, change a verb’s tag to a
noun if it comes after a determiner like “the.” The text is systematically
subjected to these criteria, and after each transformation, the tags are
updated.
When compared to rule-based tagging, TBT can provide higher accuracy,
especially when dealing with complex grammatical structures. To attain
ideal performance, nevertheless, it might require a large rule set and
additional computer power.
Consider the transformation rule: Change the tag of a verb to a noun if it
follows a determiner like “the.”
Text: “The cat chased the mouse”.
Initial Tags:
 “The” – Determiner (DET)
 “cat” – Noun (N)
 “chased” – Verb (V)
 “the” – Determiner (DET)
 “mouse” – Noun (N)
Transformation rule applied:
Change the tag of “chased” from Verb (V) to Noun (N) because it follows
the determiner “the.”
Updated tags:
 “The” – Determiner (DET)
 “cat” – Noun (N)
 “chased” – Noun (N)
 “the” – Determiner (DET)
 “mouse” – Noun (N)
In this instance, the tag “chased” was changed from a verb to a noun by
the TBT system using a transformation rule based on the contextual
pattern. The tagging is updated iteratively and the rules are applied
sequentially. Although this example is simple, given a well-defined set of
transformation rules, TBT systems can handle more complex grammatical
patterns.
3. Statistical POS Tagging
Utilizing probabilistic models, statistical part-of-speech (POS) tagging is a
computer linguistics technique that places grammatical categories on
words inside a text. If rule-based tagging uses massive annotated corpora
to train its algorithms, statistical tagging uses machine learning.
In order to capture the statistical linkages present in language, these
algorithms learn the probability distribution of word-tag sequences. CRFs
(conditional random fields) and Hidden Markov Models (HMMs) are
popular models for statistical point-of-sale classification. The algorithm
estimates the chance of observing a specific tag given the current word
and its context by learning from labeled samples during training.
The most likely tags for text that hasn’t been seen are then predicted
using the trained model. Statistical POS tagging works especially well for
languages with complicated grammatical structures because it is
exceptionally good at handling linguistic ambiguity and catching subtle
language trends.
 Hidden Markov Model POS tagging: Hidden Markov Models (HMMs)
serve as a statistical framework for part-of-speech (POS) tagging in
natural language processing (NLP). In HMM-based POS tagging, the
model undergoes training on a sizable annotated text corpus to discern
patterns in various parts of speech. Leveraging this training, the model
predicts the POS tag for a given word based on the probabilities
associated with different tags within its context.
Comprising states for potential POS tags and transitions between
them, the HMM-based POS tagger learns transition probabilities and
word-emission probabilities during training. To tag new text, the model,
employing the Viterbi algorithm, calculates the most probable
sequence of POS tags based on the learned probabilities.
Widely applied in NLP, HMMs excel at modeling intricate sequential
data, yet their performance may hinge on the quality and quantity of
annotated training data.
Advantages of POS Tagging
There are several advantages of Parts-Of-Speech (POS) Tagging
including:
 Text Simplification: Breaking complex sentences down into their
constituent parts makes the material easier to understand and easier to
simplify.
 Information Retrieval: Information retrieval systems are enhanced by
point-of-sale (POS) tagging, which allows for more precise indexing
and search based on grammatical categories.
 Named Entity Recognition: POS tagging helps to identify entities
such as names, locations, and organizations inside text and is a
precondition for named entity identification.
 Syntactic Parsing: It facilitates syntactic parsing, which helps with
phrase structure analysis and word link identification.
Disadvantages of POS Tagging
Some common disadvantages in part-of-speech (POS) tagging include:
 Ambiguity: The inherent ambiguity of language makes POS tagging
difficult since words can signify different things depending on the
context, which can result in misunderstandings.
 Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases
can be problematic for POS tagging systems since they don’t always
follow formal grammar standards.
 Out-of-Vocabulary Words: Out-of-vocabulary words (words not
included in the training corpus) can be difficult to handle since the
model might have trouble assigning the correct POS tags.
 Domain Dependence: For best results, POS tagging models trained
on a single domain should have a lot of domain-specific training data
because they might not generalize well to other domains.

Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Language Models
No ratings yet
Language Models
34 pages
NLP
No ratings yet
NLP
12 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP Techniques for Word Prediction
No ratings yet
NLP Techniques for Word Prediction
77 pages
Language Models
No ratings yet
Language Models
59 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
N Grams
No ratings yet
N Grams
51 pages
Lec 03
No ratings yet
Lec 03
31 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
NLP Week 03
No ratings yet
NLP Week 03
33 pages
NLP Lec 11
No ratings yet
NLP Lec 11
6 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Ngrams
No ratings yet
Ngrams
22 pages
N-Gram Language Models Explained
No ratings yet
N-Gram Language Models Explained
13 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Language Modeling in NLP
No ratings yet
Language Modeling in NLP
15 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Unit 2
No ratings yet
Unit 2
75 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
08 NLP - N-Gram Language Models
No ratings yet
08 NLP - N-Gram Language Models
65 pages
NLP Cat 2
No ratings yet
NLP Cat 2
78 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
N Grams
No ratings yet
N Grams
13 pages
NLP L IA2
No ratings yet
NLP L IA2
23 pages
Video v3
No ratings yet
Video v3
34 pages
Ngram
No ratings yet
Ngram
41 pages
N Grams - Nptel Notes
No ratings yet
N Grams - Nptel Notes
75 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages
Markov and Pos Report
No ratings yet
Markov and Pos Report
30 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Probabilistic Theory in Natural Language Processing
No ratings yet
Probabilistic Theory in Natural Language Processing
15 pages
Module 2
No ratings yet
Module 2
98 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Lecture 4 Language Models Updated
No ratings yet
Lecture 4 Language Models Updated
32 pages
IJISRT18DC138
No ratings yet
IJISRT18DC138
6 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
School Dictionary of The Slovenian Language On The Franek Web Portal - 2023 - Vilnius University Press
No ratings yet
School Dictionary of The Slovenian Language On The Franek Web Portal - 2023 - Vilnius University Press
15 pages
Unintended Consequences British English Teacher
No ratings yet
Unintended Consequences British English Teacher
6 pages
Bending
100% (1)
Bending
35 pages
Public Speaking Competition For Primary and Secondary Schools 2024
No ratings yet
Public Speaking Competition For Primary and Secondary Schools 2024
11 pages
1 Raden Muhammad Arie Andhiko Ajie (59-66)
No ratings yet
1 Raden Muhammad Arie Andhiko Ajie (59-66)
10 pages
Metro 2 Unit 6a-300dpi Removed
No ratings yet
Metro 2 Unit 6a-300dpi Removed
5 pages
Tenses Basic 12 Tenses With Examples - Dukarahisi - 1620817234738
No ratings yet
Tenses Basic 12 Tenses With Examples - Dukarahisi - 1620817234738
28 pages
Letter and Number Coding Puzzles
No ratings yet
Letter and Number Coding Puzzles
21 pages
Online Gas Booking System
73% (11)
Online Gas Booking System
36 pages
Kim Phung - Toefl
100% (2)
Kim Phung - Toefl
31 pages
Ill Be The Matriarch in This Life Chapter 1 200
0% (1)
Ill Be The Matriarch in This Life Chapter 1 200
4,103 pages
General Level Test
No ratings yet
General Level Test
6 pages
Aa Data
No ratings yet
Aa Data
38 pages
Complete Japanese
No ratings yet
Complete Japanese
31 pages
English Exam Practice
No ratings yet
English Exam Practice
4 pages
Class 11th Half Yearly Exam Syllabus 2025-26
No ratings yet
Class 11th Half Yearly Exam Syllabus 2025-26
2 pages
GESE Grade 3 A2.1 Lesson Plan
No ratings yet
GESE Grade 3 A2.1 Lesson Plan
4 pages
English Speaking Test Tips
No ratings yet
English Speaking Test Tips
4 pages
ERPH 255-29 March 2024
No ratings yet
ERPH 255-29 March 2024
26 pages
Paper Submission Status 2024
No ratings yet
Paper Submission Status 2024
15 pages
Specification and Implementation of A Microcomputer
No ratings yet
Specification and Implementation of A Microcomputer
73 pages
Test 4 Gramm
75% (4)
Test 4 Gramm
2 pages
Nota SPM Grammar PDF
No ratings yet
Nota SPM Grammar PDF
42 pages
Sociolinguistics: Definition and Functions
100% (3)
Sociolinguistics: Definition and Functions
20 pages
The Official Guide To The TOEFL IBT Test, Seventh Edition-509
No ratings yet
The Official Guide To The TOEFL IBT Test, Seventh Edition-509
1 page
OpenGL Tutorial
100% (1)
OpenGL Tutorial
360 pages
Sample Proposal
No ratings yet
Sample Proposal
2 pages
Genitive 'S
No ratings yet
Genitive 'S
3 pages
Identifying Clues and Inferring Meaning From Spoken and Written Texts
No ratings yet
Identifying Clues and Inferring Meaning From Spoken and Written Texts
30 pages
Understanding The Style and Tone of The Passage - Reading Comprehension - Pearson - Complete CLAT Prep
0% (1)
Understanding The Style and Tone of The Passage - Reading Comprehension - Pearson - Complete CLAT Prep
2 pages