Introduction to Language Models: Markov models, N-grams
Language modeling is a foundational concept in the field of Natural Language
Processing (NLP), which lies at the intersection of computer science, linguistics,
and artificial intelligence. At its core, language modeling involves the prediction
of the next word or token in a sequence of words.
N-grams: Language Modeling
N-grams, which serve as the fundamental building blocks of language modeling.
N-grams are probabilistic language models that estimate the likelihood of a word
based on the preceding N-1 words.
In other words, they model the conditional probability of a word given its
context.
   Definition: An N-gram is a contiguous sequence of N items (words,
    characters, or other tokens) from a given sample of text or speech.
   Example: In a bigram (2-gram) model, the probability of a word depends
    solely on the previous word. So, to predict the third word in the sentence “I
    love Natural Language,” the model only considers the second word, “love.”
Advantages of N-grams
1. Simplicity: N-grams are intuitive and relatively simple to understand and
    implement.
2. Low Memory Usage: They require minimal memory for storage compared
    to more complex models.
Limitations of N-grams
1. Limited Context: N-grams have a finite context window, which means they
    cannot capture long-range dependencies or context beyond the previous N-1
    words.
2. Sparsity: As N increases, the number of possible N-grams grows
    exponentially, leading to sparse data and increased computational demands.
While N-grams provide a useful introduction to language modeling, they have
clear limitations when it comes to capturing nuanced language patterns.
To address these limitations, we turn to a more sophisticated approach: Markov
models.
Markov Models: Contextual Predictions
Markov models are a step up from N-grams in terms of contextual prediction.
They are based on the Markov property, which posits that the probability of a
future state depends solely on the current state.
In the context of language modeling, this translates to predicting the next word
based on the current word, which is known as a first-order Markov model or a
Markov chain.
Exploring Markov Models
   First-Order Markov Model: In this model, the probability of a word
    depends only on the preceding word. For example, to predict the third word
    in a sentence, the model considers only the second word.
   Higher-Order Markov Models: These models extend the context window
    beyond one word. A second-order Markov model considers the probability
    of a word based on the previous two words, and so on.
Advantages of Markov Models
1. Improved Contextual Understanding: Markov models capture contextual
    information better than N-grams, as they can consider more extensive
    context.
Limitations of Markov Models
1. Curse of Dimensionality: Higher-order Markov models suffer from the curse
    of dimensionality. As the context size increases, the number of possible
    states grows exponentially, making it challenging to estimate accurate
    probabilities.
2. Limited Long-Range Dependencies: Even with higher orders, Markov
    models struggle to capture very long-range dependencies in language.
While Markov models offer enhanced contextual prediction, they too have their
limitations, particularly when it comes to modeling complex language patterns.
                                Hidden Markov Models
Hidden Markov Model and applied it to part of speech tagging. Part of speech tagging is a
fully-supervised learning task, because we have a corpus of words labeled with the correct
part-of-speech tag. But many applications don’t have labeled data.
Markov chain
The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us
something about the probabilities of sequences of random variables, states, each of which can
take on values from some set.
These sets can be words, or tags, or symbols representing anything, like the weather.
A Markov chain makes a very strong assumption that if we want to predict the future in the
sequence, all that matters is the current state.
The states before the current state have no impact on the future except via the current state.
It’s as if to predict tomorrow’s weather you could examine today’s weather but you weren’t
allowed to look at yesterday’s weather.
Note: The detail description of HMM follow (HMM.pdf)
A Markov Chain is a sequence of states. A sequence means that there is
always a transition, a moment where a state goes from one state to another.
The idea is to generate a sequence of states, based on the existing
states and probability of an outcome after that.
Markov chains are one of the most important stochastic processes — the
process of some value changing randomly over time. They are called so
because they follow the rule of Markov Property, which says that
“the next process in a state depends on how it is right
now”
i.e., the information contained in the current state of the process is all that is
needed to determine the future states.
It doesn’t have a “memory” of how it was before. It is helpful to think of a
Markov chain as evolving through discrete steps in time, although the
“step” doesn’t need to have anything to do with time.
A 2 State Markov Model
Consider the above model, with states 0 & 1, and the probability of their transition to another
state are represented as: P(0|1): p, P(0|0): (1-p), P(1|0): q, & P(1|1): (1-q)
The Model
Now think of text as a sequence of states. And use a character-based
language model. These are essentially used to predict the next character at
a time. Given a state, tell the machine to predict the next character.
myText = "the they them .... the them ... the they .. then"
Consider the string myText above, we need to know from the machine the
frequency of the next character occurring when the size of the window is
selected, which is the number of characters grouped, treated as a whole,
that looks for the next character occurrence.
Here, we take the window size, k=3, which gives us the following
frequency table
X         y    Frequency
the       _      3
he_       t      3
e_t       h      3
_th       e      8
the       y      2
the       n      1
.     .        .
.     .        .
Once retrieve this tabular data, now based on the model predict what could
be the most likely output, when a certain sequence is encountered based on
the probability of the occurrence in a string.
The probability of the next character when a string ‘the’ is encountered.
Estimating Probabilities: Probability estimation for words, smoothing
techniques
What is Additive Smoothing?
Additive smoothing is a technique that adjusts the estimated probabilities of n-
grams by adding a small constant value (usually denoted as α) to the count of
each n-gram. This approach ensures that no probability is zero, even for n-grams
that were not observed in the training data.
Working of Additive Smoothing
The main idea behind additive smoothing is to distribute some probability mass
to unseen n-grams by adding a constant α to each n-gram count. This has the
effect of lowering the probability of observed n-grams slightly while ensuring
that unseen n-grams receive a small, non-zero probability.
The choice of the smoothing parameter αα is crucial:
 If α=1α=1: This is known as Laplace Smoothing. It treats all n-grams, whether
    seen or unseen, with equal weight.
 If 0<α<10<α<1: This is often referred to as Lidstone Smoothing. It provides a
    more fine-grained adjustment, typically resulting in better performance in
    practice.
Laplace Smoothing in Language Models
Laplace Smoothing is a specific case of additive smoothing where the smoothing
parameter α is set to 1. The primary goal of Laplace Smoothing is to prevent the
probability of any n-gram from being zero, which would otherwise happen if the
n-gram was not observed in the training data.
The formula for calculating the smoothed probability using Laplace
Smoothing is:
P(wn∣wn−1,…,w1)=C(w1,…,wn)+1C(w1,…,wn−1)+VP(wn∣wn−1,…,w1)=C(w1,…,wn−1)+VC(w1,…
                                                                                           
,wn)+1
        
Where:
       C(w1,…,wn)C(w1,…,wn)
                                
                                     is the count of the n-gram (w1,…,wn)(w1,…,wn) in the
                                                                                   
        training data.
       C(w1,…,wn−1)C(w1,…,wn−1)
                             
                                  is the count of the (n-1)-gram prefix.
                                       
       V is the size of the vocabulary (i.e., the total number of unique words in the
        training data).
How Laplace Smoothing Works:
Laplace Smoothing works by adding 1 to the count of every possible n-gram,
including those that were not observed in the training data. This adjustment
ensures that no n-gram has a zero probability, which would indicate that it is
impossible according to the model. By doing so, Laplace Smoothing distributes
some probability mass to these unseen n-grams, making the model more
adaptable to new data.
Here’s a step-by-step breakdown of how Laplace Smoothing is applied:
1. Count the N-grams: First, count the occurrences of all n-grams in the
   training data.
1. Add 1 to All Counts: Add 1 to the count of each n-gram, including those with
   zero counts.
1. Adjust the Denominator: Add the size of the vocabulary V to the
   denominator, accounting for the total number of possible n-grams.
Smoothing techniques commonly used in NLP
       Laplacian (add-one) Smoothing
       Lidstone (add-k) Smoothing
       Absolute Discounting
       Katz Backoff
       Kneser-Ney Smoothing
       Interpolation
NOTE:: token is the number of words in a document or a sentence, vocab is the number of different type
of words in a document or sentence. For example, in the following sentence, there are 10 tokens and 8
vocabs (because "I" and "like" occur two times).
"I like natural language processing and I like machine learning."
What is a corpus, and how is it
used in NLP?
A corpus (plural corpora), also known as a text corpus in linguistics, is
usually a large collection of texts, and it could be compared to a database in
which each text is a record. It is often designed to contain various types of
utterances, registers, and genres as examples of language that naturally
occurs in a specific context. In natural language processing (NLP), corpora
are used to train algorithms and develop statistical models. These corpora
can be used by linguists, lexicographers, data scientists, and experts in NLP
for various tasks, including word frequency analysis, part-of-speech
tagging, and text classification. As an essential tool for anyone working
with NLP, their implementation can be as varied as creating text-to-speech
modules or a system for machine translation. However, not all corpora are
the same, and each helps accomplish NLP tasks differently.
Corpora use cases
In the field of linguistics, a corpus is a large and structured set of texts
(nowadays, usually electronically stored and processed). The texts in a
corpus have been selected to represent a particular language or subject
matter. The notion of a corpus has been helpful in computational
linguistics, where corpus-based methods are used for statistical analysis
and hypothesis testing on the data, checking the number of occurrences, or
even validating linguistic rules within the confines of a specific language
territory. Corpora are used to perform research in many different
disciplines, not just linguistics. For example, in the field of medicine,
corpora are used to help researchers develop new treatments and drugs. In
the field of law, corpora can be used to help lawyers find relevant cases and
precedents. And in the field of history, corpora can be used to help
historians find primary sources for their research.
Types of corpora
Different types of corpora can be classified according to their content, size,
and structure. However, a corpus can also have other characteristics or
properties for its organization, and often a corpus will fit into more than
one classification.
Corpora classifications based on content
Text corpora are the most common type of corpora that contain texts from
different sources.
Speech corpora contain recordings of people speaking and verbatim audio
transcriptions, and are often used to study how people speak a particular
language or to develop speech recognition software.
Image corpora contain images to develop computer vision algorithms, and
usually, each image is tagged to allow for identification.
Video corpora include videos and are used to create algorithms for
tracking objects on video.
Corpora classifications based on size
Small corpora typically comprise just a few texts and can be used for
specific research tasks. For example, small corpora of medical texts might
be used to study a specific disease.
Large corpora are composed of hundreds or even millions of texts and are
often used for general research tasks, such as studying the overall patterns
of a language.
Corpora classifications based on structure
Monolingual corpora are the most common corpora classification and
contain texts only from a single language source.
Multilingual corpora, simply put, contain more than one language. They
can be classified further based on how the text was created and the
relationship between both languages.
Parallel corpora are made from two or more monolingual corpora where
one corpus is the source and the second one will be a direct translation. In
this type of corpora, both languages will be aligned to have matching
segments at the paragraph or sentence level.
Comparable corpora are made of two or more monolingual corpora built
using the same principles and, therefore, offer similar results. However, as
the text is not a translation of each other, they are not aligned.
Corpora classifications based on purpose and other factors
General corpora contain various types of texts that can be utilized in
different research fields, offering a baseline resource for general studies.
The source can be written text or spoken language, along with
transcriptions.
Specialized corpora are designed for specific research goals containing a
particular text type. These constraints can refer to a specific time frame or a
particular subject, among other things.
Diachronic corpora contain language data from different historical
periods, and language experts use these to study the changes and
development in a specific language.
Synchronic corpora would be the opposite of diachronic corpora, and all
texts must be compiled from the same period.
Monitor corpora are diachronic and expandable. They are continuously
updated to reflect the changes in language usage by incorporating new
words and expressions.
National corpora contain texts that represent language used in a specific
country.
Reference corpora, in general terms, are used as the base of comparison
with other corpora. However, these are expected to be large general
corpora that offer comprehensive coverage, which the community of users
can regard as the standard for the particular use case.
Learner corpora include samples produced by non-native speakers of a
language. This type of corpora allows researchers to compare the texts
created by native speakers against those produced by language learners.
Developmental corpora contain language data from monolingual speakers
at different stages in their language development. These can track and
understand first language acquisition and vocabulary development.
Raw corpora provide no annotations or additional information and are
given as originally collected.
Annotated corpora contain texts annotated with information about their
structure, content, or meaning. For example, a corpus of medical texts
might be annotated with information about the diseases mentioned in each
text. Annotated corpora are often used to develop computational linguistics
applications, such as question-answering systems.
So, how do you build a corpus?
As mentioned, a corpus is an extensive collection of texts. Building one
provides an essential resource to investigate language and the learning data
necessary to create different tools that can be implemented in numerous
applications. Here are some steps on how to go about building a corpus for
your specific project needs.
1. Define the scope
Decide what kind of corpus you want to create. As there are many different
types of corpora, each type serves a specific purpose. Understanding
exactly what kind of data you need is the first step to building an effective
corpus for your project.
2. Define the format
Collect texts in whatever format your project requires. The text collection
could be digital (e.g., websites or other digitally stored files) or physical
(e.g., books or other printed documents). The collection stage could also
require samples of spoken language that will need to be transcribed before
the text can be used.
3. Organize the data
Organize your texts into a coherent structure. Doing this will make it easier
to search and analyze the language data in your corpus later. Having your
text divided into different categories or topics is a common first approach
for text organization.
4. Use the right tools
Use a corpus-building tool or service to help create and manage your
corpus. Many software options and platforms are available to help you
collect or even generate new text for your project.
5. Annotate the data
Annotate your corpus with metadata. Tagging or annotations will describe
the contents of each text and can be used to categorize and search the
corpus for further implementation.
6. Analyze the data
Explore your corpus! Once you have built it, you can start to carry out all
sorts of interesting analyses, such as looking at word frequencies or finding
collocations and interesting language patterns.
NOTE: A corpus is a large and structured set of machine-readable texts that have been produced
in a natural communicative setting. Its plural is corpora. They can be derived in different ways like
text that was originally electronic, transcripts of spoken language and optical character recognition,
etc.
    Key Characteristics of a Corpus:
    1. Structured Collection of Text:
    A corpus is typically organized and labeled in a way that makes it easy to
    analyze. It might include metadata about each text, such as its source,
    author, date of publication, or linguistic annotations (e.g., part-of-speech
    tags).
    2. Size and Scope:
    Corpora vary in size, ranging from small, domain-specific collections to
    massive datasets containing millions or billions of words. The size of the
    corpus often depends on the NLP task at hand. Larger corpora provide
    more data for training and improving the accuracy of models, particularly
    for deep learning-based NLP models.
    3. Diversity of Text Types:
    A corpus may contain different types of text, such as books, articles, blog
    posts, social media content, transcripts of conversations, or legal
    documents. Depending on the task, a corpus may focus on specific domains
    (e.g., medical or legal) or provide a broad representation of general
    language usage.
    Task-specific corpora:
   POS Tagging: Penn Treebank's WSJ section is tagged with a 45-tag
    tagset. Use Ritter dataset for social media content.
   Named Entity Recognition: CoNLL 2003 NER task is newswire content
    from Reuters RCV1 corpus. It considers four entity types. WNUT 2017
    Emerging Entities task and OntoNotes 5.0 are other datasets.
   Constituency Parsing: Penn Treebank's WSJ section has dataset for this
    purpose.
   Semantic role labelling: OntoNotes v5.0 is useful due to syntactic and
    semantic annotations.
   Sentiment Analysis: IMDb has released 50K movie reviews. Others are
    Amazon Customer Reviews of 130 million reviews, 6.7 million business
    reviews from Yelp, and Sentiment140 of 160K tweets.
   Text Classification/Clustering: Reuters-21578 is a collection of news
    documents from 1987 indexed by categories. 20 Newsgroups is another
    dataset of about 20K documents from 20 newsgroups.
   Question Answering: Stanford Question Answering Dataset (SQuAD) is
    a reading comprehension dataset with 100K questions plus 50K
    unanswerable questions.
    A Wordlist Corpus is a specific type of corpus that contains a list of words used
    for tasks where word-level information is required.
    Understanding Wordlist Corpus
    A Wordlist Corpus is a collection of words organized in a specific format with
    each word on a separate line. This type of corpus is widely used in NLP tasks that
    require a predefined set of words such as creating custom dictionaries, spell-
    checking applications, text normalization or filtering out certain words based on
    the task’s requirements.
    1. Text Preprocessing: Removes stop words, unwanted words or filter out non-
       relevant terms.
    1. Spell Checking: Ensures presence of correctly spelled words by referencing
       them from dictionary-like corpus.
    1. Text Normalization: Converts variations of the same word into a standard
       format for further processing.
    1. Word Filtering: When working with text corpus it may be necessary to filter
       out certain words or phrases.
1. Building Custom Dictionaries: You can create custom dictionaries to
   enhance Named Entity Recognition (NER), classification or other NLP tasks
   that require domain-specific knowledge.
A corpus are determined by the following two factors −
Balance − The range of genre include in a corpus
Sampling − How the chunks for each genre are selected.
Corpus Balance
Another very important element of corpus design is corpus balance the range of
genre included in a corpus. We have already studied that representativeness of a
general corpus depends upon how balanced the corpus is. A balanced corpus
covers a wide range of text categories, which are supposed to be representatives
of the language. We do not have any reliable scientific measure for balance but
the best estimation and intuition works in this concern. In other words, we can
say that the accepted balance is determined by its intended uses only.
Sampling
Another important element of corpus design is sampling. Corpus
representativeness and balance is very closely associated with sampling. That is
why we can say that sampling is inescapable in corpus building.
Sampling unit − It refers to the unit which requires a sample. For example, for
written text, a sampling unit may be a newspaper, journal or a book.
Sampling frame − The list of al sampling units is called a sampling frame.
Population − It may be referred as the assembly of all sampling units. It is
defined in terms of language production, language reception or language as a
product.
Corpus Size
The size of the corpus depends upon the purpose for which it is intended as well
as on some practical considerations as follows −
With the advancement in technology, the corpus size also increases. The
following table of comparison will help you understand how the corpus size
works −
Year Name of the Corpus      Size (in words)
1960s - 70s Brown and LOB 1 Million words
1980s The Birmingham corpora 20 Million words
1990s The British National corpus    100 Million words
Early 21st century The Bank of English corpus    650 Million words
A few examples of corpus:
TreeBank Corpus
It may be defined as linguistically parsed text corpus that annotates syntactic or
semantic sentence structure. Geoffrey Leech coined the term treebank, which
represents that the most common way of representing the grammatical analysis is
by means of a tree structure. Generally, Treebanks are created on the top of a
corpus, which has already been annotated with part-of-speech tags.
Types of TreeBank Corpus
Semantic and Syntactic Treebanks are the two most common types of Treebanks
in linguistics. Let us now learn more about these types −
Semantic Treebanks
These Treebanks use a formal representation of sentences semantic structure.
They vary in the depth of their semantic representation. Robot Commands
Treebank, Geoquery, Groningen Meaning Bank, RoboCup Corpus are some of
the examples of Semantic Treebanks.
Syntactic Treebanks
Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are
expressions of the formal language obtained from the conversion of parsed
Treebank data. The outputs of such systems are predicate logic based meaning
representation. Various syntactic Treebanks in different languages have been
created so far. For example, Penn Arabic Treebank, Columbia Arabic Treebank
are syntactic Treebanks created in Arabia language. Sininca syntactic Treebank
created in Chinese language. Lucy, Susane and BLLIP WSJ syntactic corpus
created in English language.
Applications of TreeBank Corpus
Followings are some of the applications of TreeBanks −
In Computational Linguistics
If we talk about Computational Linguistic then the best use of TreeBanks is to
engineer state-of-the-art natural language processing systems such as part-of-
speech taggers, parsers, semantic analyzers and machine translation systems.
In Corpus Linguistics
In case of Corpus linguistics, the best use of Treebanks is to study syntactic
phenomena.
In Theoretical Linguistics and Psycholinguistics
The best use of Treebanks in theoretical and psycholinguistics is interaction
evidence.
PropBank Corpus
PropBank more specifically called Proposition Bank is a corpus, which is
annotated with verbal propositions and their arguments. The corpus is a verb-
oriented resource; the annotations here are more closely related to the syntactic
level. Martha Palmer et al., Department of Linguistic, University of Colorado
Boulder developed it. We can use the term PropBank as a common noun referring
to any corpus that has been annotated with propositions and their arguments.
In Natural Language Processing (NLP), the PropBank project has played a very
significant role. It helps in semantic role labeling.
VerbNet(VN)
VerbNet(VN) is the hierarchical domain-independent and largest lexical resource
present in English that incorporates both semantic as well as syntactic
information about its contents. VN is a broad-coverage verb lexicon having
mappings to other lexical resources such as WordNet, Xtag and FrameNet. It is
organized into verb classes extending Levin classes by refinement and addition of
subclasses for achieving syntactic and semantic coherence among class members.
Each VerbNet (VN) class contains −
A set of syntactic descriptions or syntactic frames
For depicting the possible surface realizations of the argument structure for
constructions such as transitive, intransitive, prepositional phrases, resultatives,
and a large set of diathesis alternations.
A set of semantic descriptions such as animate, human, organization
For constraining, the types of thematic roles allowed by the arguments, and
further restrictions may be imposed. This will help in indicating the syntactic
nature of the constituent likely to be associated with the thematic role.
WordNet
WordNet, created by Princeton is a lexical database for English language. It is the
part of the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are
grouped into sets of cognitive synonyms called Synsets. All the synsets are linked
with the help of conceptual-semantic and lexical relations. Its structure makes it
very useful for natural language processing (NLP).
POS(Parts-Of-Speech) Tagging in NLP
Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular
part of speech (adverb, adjective, verb, etc.) or grammatical category.
Through the addition of a layer of syntactic and semantic information to
the words, this procedure makes it easier to comprehend the sentence’s
structure and meaning.
In NLP applications, POS tagging is useful for machine translation, named
entity recognition, and information extraction, among other things. It also
works well for clearing out ambiguity in terms with numerous meanings
and revealing a sentence’s grammatical structure.
Numerous meanings and revealing a sentence’s grammatical structure.
Default tagging is a basic step for the part-of-speech tagging. It is
performed using the DefaultTagger class. The DefaultTagger class takes
‘tag’ as a single argument. NN is the tag for a singular noun.
DefaultTagger is most useful when it gets to work with most common part-
of-speech tag. That’s why a noun tag is recommended.
Example of POS Tagging
Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
 “The” is tagged as determiner (DT)
 “quick” is tagged as adjective (JJ)
 “brown” is tagged as adjective (JJ)
 “fox” is tagged as noun (NN)
 “jumps” is tagged as verb (VBZ)
 “over” is tagged as preposition (IN)
 “the” is tagged as determiner (DT)
 “lazy” is tagged as adjective (JJ)
 “dog” is tagged as noun (NN)
By offering insights into the grammatical structure, this tagging aids
machines in comprehending not just individual words but also the
connections between them inside a phrase. For many NLP applications,
like text summarization, sentiment analysis, and machine translation, this
kind of data is essential.
Workflow of POS Tagging in NLP
The following are the processes in a typical natural language processing
(NLP) example of part-of-speech (POS) tagging:
 Tokenization: Divide the input text into discrete tokens, which are
  usually units of words or subwords. The first stage in NLP tasks is
  tokenization.
 Loading Language Models: To utilize a library such as NLTK or
  SpaCy, be sure to load the relevant language model. These models
  offer a foundation for comprehending a language’s grammatical
  structure since they have been trained on a vast amount of linguistic
  data.
 Text Processing: If required, preprocess the text to handle special
  characters, convert it to lowercase, or eliminate superfluous
  information. Correct PoS labeling is aided by clear text.
 Linguistic Analysis: To determine the text’s grammatical structure,
  use linguistic analysis. This entails understanding each word’s purpose
  inside the sentence, including whether it is an adjective, verb, noun, or
  other.
 Part-of-Speech Tagging: To determine the text’s grammatical
  structure, use linguistic analysis. This entails understanding each
  word’s purpose inside the sentence, including whether it is an
  adjective, verb, noun, or other.
 Results Analysis: Verify the accuracy and consistency of the PoS
  tagging findings with the source text. Determine and correct any
  possible problems or mistagging.
Types of POS Tagging in NLP
Assigning grammatical categories to words in a text is known as Part-of-
Speech (PoS) tagging, and it is an essential aspect of Natural Language
Processing (NLP). Different PoS tagging approaches exist, each with a
unique methodology. Here are a few typical kinds:
1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their
respective parts of speech using predetermined rules, contrasting with
machine learning-based POS tagging that requires training on annotated
text corpora. In a rule-based system, POS tags are assigned based on
specific word characteristics and contextual cues.
For instance, a rule-based POS tagger could designate the “noun” tag to
words ending in “tion” or “ment,” recognizing common noun-forming
suffixes. This approach offers transparency and interpretability, as it
doesn’t rely on training data.
Let’s consider an example of how a rule-based part-of-speech (POS)
tagger might operate:
Rule: Assign the POS tag “noun” to words ending in “-tion” or “-ment.”
Text: “The presentation highlighted the key achievements of the project’s
development.”
Rule based Tags:
 “The” – Determiner (DET)
 “presentation” – Noun (N)
 “highlighted” – Verb (V)
 “the” – Determiner (DET)
 “key” – Adjective (ADJ)
 “achievements” – Noun (N)
 “of” – Preposition (PREP)
 “the” – Determiner (DET)
 “project’s” – Noun (N)
 “development” – Noun (N)
In this instance, the predetermined rule is followed by the rule-based POS
tagger to label words. “Noun” tags are applied to words like “presentation,”
“achievements,” and “development” because of the aforementioned
restriction. Despite the simplicity of this example, rule-based taggers may
handle a broad variety of linguistic patterns by incorporating different
rules, which makes the tagging process transparent and comprehensible.
2. Transformation Based tagging
Transformation-based tagging (TBT) is a part-of-speech (POS) tagging
method that uses a set of rules to change the tags that are applied to
words inside a text. In contrast, statistical POS tagging uses trained
algorithms to predict tags probabilistically, while rule-based POS tagging
assigns tags directly based on predefined rules.
To change word tags in TBT, a set of rules is created depending on
contextual information. A rule could, for example, change a verb’s tag to a
noun if it comes after a determiner like “the.” The text is systematically
subjected to these criteria, and after each transformation, the tags are
updated.
When compared to rule-based tagging, TBT can provide higher accuracy,
especially when dealing with complex grammatical structures. To attain
ideal performance, nevertheless, it might require a large rule set and
additional computer power.
Consider the transformation rule: Change the tag of a verb to a noun if it
follows a determiner like “the.”
Text: “The cat chased the mouse”.
Initial Tags:
 “The” – Determiner (DET)
 “cat” – Noun (N)
 “chased” – Verb (V)
 “the” – Determiner (DET)
 “mouse” – Noun (N)
Transformation rule applied:
Change the tag of “chased” from Verb (V) to Noun (N) because it follows
the determiner “the.”
Updated tags:
 “The” – Determiner (DET)
 “cat” – Noun (N)
 “chased” – Noun (N)
 “the” – Determiner (DET)
 “mouse” – Noun (N)
In this instance, the tag “chased” was changed from a verb to a noun by
the TBT system using a transformation rule based on the contextual
pattern. The tagging is updated iteratively and the rules are applied
sequentially. Although this example is simple, given a well-defined set of
transformation rules, TBT systems can handle more complex grammatical
patterns.
3. Statistical POS Tagging
Utilizing probabilistic models, statistical part-of-speech (POS) tagging is a
computer linguistics technique that places grammatical categories on
words inside a text. If rule-based tagging uses massive annotated corpora
to train its algorithms, statistical tagging uses machine learning.
In order to capture the statistical linkages present in language, these
algorithms learn the probability distribution of word-tag sequences. CRFs
(conditional random fields) and Hidden Markov Models (HMMs) are
popular models for statistical point-of-sale classification. The algorithm
estimates the chance of observing a specific tag given the current word
and its context by learning from labeled samples during training.
The most likely tags for text that hasn’t been seen are then predicted
using the trained model. Statistical POS tagging works especially well for
languages with complicated grammatical structures because it is
exceptionally good at handling linguistic ambiguity and catching subtle
language trends.
 Hidden Markov Model POS tagging: Hidden Markov Models (HMMs)
    serve as a statistical framework for part-of-speech (POS) tagging in
    natural language processing (NLP). In HMM-based POS tagging, the
    model undergoes training on a sizable annotated text corpus to discern
    patterns in various parts of speech. Leveraging this training, the model
    predicts the POS tag for a given word based on the probabilities
    associated with different tags within its context.
  Comprising states for potential POS tags and transitions between
  them, the HMM-based POS tagger learns transition probabilities and
  word-emission probabilities during training. To tag new text, the model,
  employing the Viterbi algorithm, calculates the most probable
  sequence of POS tags based on the learned probabilities.
  Widely applied in NLP, HMMs excel at modeling intricate sequential
  data, yet their performance may hinge on the quality and quantity of
  annotated training data.
Advantages of POS Tagging
There are several advantages of Parts-Of-Speech (POS) Tagging
including:
 Text Simplification: Breaking complex sentences down into their
   constituent parts makes the material easier to understand and easier to
   simplify.
 Information Retrieval: Information retrieval systems are enhanced by
   point-of-sale (POS) tagging, which allows for more precise indexing
   and search based on grammatical categories.
 Named Entity Recognition: POS tagging helps to identify entities
   such as names, locations, and organizations inside text and is a
   precondition for named entity identification.
 Syntactic Parsing: It facilitates syntactic parsing, which helps with
   phrase structure analysis and word link identification.
Disadvantages of POS Tagging
Some common disadvantages in part-of-speech (POS) tagging include:
 Ambiguity: The inherent ambiguity of language makes POS tagging
  difficult since words can signify different things depending on the
  context, which can result in misunderstandings.
 Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases
  can be problematic for POS tagging systems since they don’t always
  follow formal grammar standards.
 Out-of-Vocabulary Words: Out-of-vocabulary words (words not
  included in the training corpus) can be difficult to handle since the
  model might have trouble assigning the correct POS tags.
 Domain Dependence: For best results, POS tagging models trained
  on a single domain should have a lot of domain-specific training data
  because they might not generalize well to other domains.