0% found this document useful (0 votes)
7 views78 pages

Unit 1 and 2

Uploaded by

velan151205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views78 pages

Unit 1 and 2

Uploaded by

velan151205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 78

UNIT-I

Major Challenges of Natural Language Processing

In this evolving landscape of artificial intelligence(AI), Natural Language Processing(NLP) stands


out as an advanced technology that fills the gap between humans and machines. In this article,
we will discover the Major Challenges of Natural language Processing(NLP) faced by
organizations. Understanding these challenges helps you explore the advanced NLP but also
leverages its capabilities to revolutionize How we interact with machines and everything from
customer service automation to complicated data analysis.

What is Natural Language Processing? (NLP)

Natural Language is a powerful tool of Artificial Intelligence that enables computers to


understand, interpret and generate human readable text that is meaningful. NLP is a method
used for processing and analyzing the text data. In Natural Language Processing the text is
tokenized means the text is break into tokens, it could be words, phrases or character. It is the
first step in NLP task. The text is cleaned and preprocessed before applying Natural Language
Processing technique.

Natural Language Processing technique is used in machine translation, healthcare, finance,


customer service, sentiment analysis and extracting valuable information from the text data.
NLP is also used in text generation and language modeling. Natural Processing technique can
also be used in answering the questions. Many companies uses Natural Language Processing
technique to solve their text related problems. Tools such as ChatGPT, Google Bard that trained
on large corpus of test of data uses Natural Language Processing technique to solve the user
queries.

10 Major Challenges of Natural Language Processing(NLP)

Natural Language Processing (NLP) faces various challenges due to the complexity and diversity
of human language. Let's discuss 10 major challenges in NLP:

1. Language differences

The human language and understanding is rich and intricated and there many languages spoken
by humans. Human language is diverse and thousand of human languages spoken around the
world with having its own grammar, vocabular and cultural nuances. Human cannot understand
all the languages and the productivity of human language is high. There is ambiguity in natural
language since same words and phrases can have different meanings and different context. This
is the major challenges in understating of natural language.

There is a complex syntactic structures and grammatical rules of natural languages. The rules
are such as word order, verb, conjugation, tense, aspect and agreement. There is rich semantic
content in human language that allows speaker to convey a wide range of meaning through
words and sentences. Natural Language is pragmatics which means that how language can be
used in context to approach communication goals. The human language evolves time to time
with the processes such as lexical change.

2.Training Data

Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target. Training
data is composed of both the features (inputs) and their corresponding labels (outputs). For
NLP, features might include text data, and labels could be categories, sentiments, or any other
relevant annotations.

It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.

3. Development Time and Resource Requirements

Development Time and Resource Requirements for Natural Language Processing (NLP) projects
depends on various factors consisting the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of expert involved. Here are some key
points:

 Complexity of the task: Task such as classification of text or analyzing the sentiment of
the text may require less time compared to more complex tasks such as machine
translation or answering the questions.

 Availability and Quality Data: For Natural Language Processing models requires high-
quality of annotated data. It can be time consuming to collect, annotate, and preprocess
the large text datasets and can be resource-intensive specially for tasks that requires
specialized domain knowledge or fine-tuned annotations.

 Selection of algorithm and development of model: It is difficult to choose the right


algorithms machine learning algorithms that is best for Natural Language Processing
task.
 Evaluation and Training: It requires powerful computation resources that consists of
powerful hardware (GPUs or TPUs) and time for training the algorithms iteration. It is
also important to evaluate the performance of the model with the help of suitable
metrics and validation techniques for conforming the quality of the results.

4. Navigating Phrasing Ambiguities in NLP

It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of


human languages. The cause of phrasing ambiguities is when a phrase can be evaluated in
multiple ways that leads to uncertainty in understanding the meaning. Here are some key points
for navigating phrasing ambiguities in NLP:

 Contextual Understanding: Contextual information like previous sentences, topic focus,


or conversational cues can give valuable clues for solving ambiguities.

 Semantic Analysis: The content of the semantic text is analyzed to find meaning based
on word, lexical relationships and semantic roles. Tools such as word sense
disambiguation, semantics role labeling can be helpful in solving phrasing ambiguities.

 Syntactic Analysis: The syntactic structure of the sentence is analyzed to find the
possible evaluation based on grammatical relationships and syntactic patterns.

 Pragmatic Analysis: Pragmatic factors such as intentions of speaker, implicatures to infer


meaning of a phrase. This analysis consists of understanding the pragmatic context.

 Statistical methods: Statistical methods and machine learning models are used to learn
patterns from data and make predictions about the input phrase.

5. Misspellings and Grammatical Errors

Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis. Here
are some key points for solving misspelling and grammatical error in NLP:

 Spell Checking: Implement spell-check algorithms and dictionaries to find and correct
misspelled words.

 Text Normalization: The is normalized by converting into a standard format which may
contains tasks such as conversion of text to lowercase, removal of punctuation and
special characters, and expanding contractions.
 Tokenization: The text is split into individual tokens with the help of tokenization
techniques. This technique allows to identify and isolate misspelled words and
grammatical error that makes it easy to correct the phrase.

 Language Models: With the help of language models that is trained on large corpus of
data to predict the likelihood of word or phrase that is correct or not based on its
context.

6. Mitigating Innate Biases in NLP Algorithms

It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness, equity,
and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.

 Collection of data and annotation: It is very important to confirm that the training data
used to develop NLP algorithms is diverse, representative and free from biases.

 Analysis and Detection of bias: Apply bias detection and analysis method on training
data to find biases that is based on demographic factors such as race, gender, age.

 Data Preprocessing: Data Preprocessing the most important process to train data to
mitigate biases like debiasing word embeddings, balance class distributions and
augmenting underrepresented samples.

 Fair representation learning: Natural Language Processing models are trained to learn
fair representations that are invariant to protect attributes like race or gender.

 Auditing and Evaluation of Models: Natural Language models are evaluated for fairness
and bias with the help of metrics and audits. NLP models are evaluated on diverse
datasets and perform post-hoc analyses to find and mitigate innate biases in NLP
algorithms.

7. Words with Multiple Meanings

Words with multiple meaning plays a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meaning are known as polysemous or
homonymous have different meaning based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings in
NLP:

 Semantic analysis: Implement semantic analysis techniques to find the underlying


meaning of the word in various contexts. Word embedding or semantic networks are the
semantic representation can find the semantic similarity and relatedness between
different word sense.

 Domain specific knowledge: It is very important to have a specific domain-knowledge in


Natural Processing tasks that can be helpful in providing valuable context and constraints
for determining the correct context of the word.

 Multi-word Expression (MWEs): The meaning of the entire sentence or phrase is


analyzed to disambiguate the word with multiple meanings.

 Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to find the
semantic relationships between different words context.

8. Addressing Multilingualism

It is very important to address language diversity and multilingualism in Natural Language


Processing to confirm that NLP systems can handle the text data in multiple languages
effectively. Here are some key points to address language diversity and multilingualism:

 Multilingual Corpora: Multilingual corpus consists of text data in various languages and
serve as valuable resources for training NLP models and systems.

 Cross-Lingual Transfer Learning: This is a type of techniques that is used to transfer


knowledge learned from one language to another.

 Language Identification: Design language identification models to automatically detect


the language of a given text.

 Machine Translation: Machine Translation provides the facility to communicate and


inform access across language barriers and can be used as preprocessing step for
multilingual NLP tasks.

9. Reducing Uncertainty and False Positives in NLP

It is very crucial task to reduce uncertainty and false positives in Natural Language Process (NLP)
to improve the accuracy and reliability of the NLP models. Here are some key points to
approach the solution:

 Probabilistic Models: Use probabilistic models to figure out the uncertainty in


predictions. Probabilistic models such as Bayesian networks gives probabilistic estimates
of outputs that allow uncertainty quantification and better decision making.
 Confidence Scores: The confidence scores or probability estimates is calculated for NLP
predictions to assess the certainty of the output of the model. Confidence scores helps
us to identify cases where the model is uncertain or likely to produce false positives.

 Threshold Tuning: For the classification tasks the decision thresholds is adjusted to make
the balance between sensitivity (recall) and specificity. False Positives in NLP can be
reduced by setting the appropriate thresholds.

 Ensemble Methods: Apply ensemble learning techniques to join multiple model to


reduce uncertainty.

10. Facilitating Continuous Conversations with NLP

Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines
gives to capability to analyze and interpret user input as it is received involving algorithms are
optimized and systems for low latency processing to confirm quick responses to user queries
and inputs.

Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history tracking,
and generating relevant responses based on the ongoing dialogue. Apply intent recognition
algorithm to find the underlying goals and intentions expressed by users in their messages.

How to overcome NLP Challenges

It requires a combination of innovative technologies, experts of domain, and methodological


approached to over the challenges in NLP. Here are some key points to overcome the challenge
of NLP tasks:

 Quantity and Quality of data: High quality of data and diverse data is used to train the
NLP algorithms effectively. Data augmentation, data synthesis, crowdsourcing are the
techniques to address data scarcity issues.

 Ambiguity: The NLP algorithm should be trained to disambiguate the words and
phrases.

 Out-of-vocabulary Words: The techniques are implemented to handle out-of-vocabulary


words such as tokenization, character-level modeling, and vocabulary expansion.

 Lack of Annotated Data: Techniques such transfer learning and pre-training can be used
to transfer knowledge from large dataset to specific tasks with limited labeled data.
ORIGINS OF NLP

As we know Natural language processing (NLP) is an exciting area that has grown at some stage
in time, influencing the junction of linguistics, synthetic intelligence (AI), and computer
technology knowledge.

This article takes you on an in-depth journey through the history of NLP, diving into its complex
records and monitoring its development. From its early beginnings to the contemporary
improvements of NLP, the story of NLP is an intriguing one that continues to revolutionize how
we interact with generations.

History and Evolution of NLP

History of Natural Language Processing (NLP)

 The Dawn of NLP (1950s-1970s)

 The Statistical Revolution (1980s-1990s)

 The Deep Learning Era (2000s-Present)

What is Natural Language Processing (NLP)?


Natural Language Processing (NLP) is a field of computer science and artificial intelligence (AI)
concerned with the interaction between computers and human language. Its core objective is to
enable computers to understand, analyze, and generate human language in a way that is similar
to how humans do. This includes tasks like:

 Understanding the meaning: Being able to extract the meaning from text, speech, or
other forms of human language.

 Analyzing structure: Recognizing the grammatical structure and syntax of language,


including parts of speech and sentence construction.

 Generating human-like language: Creating text or speech that is natural, coherent, and
grammatically correct.

Ultimately, NLP aims to bridge the gap between human communication and machine
comprehension, fostering seamless interaction between us and technology.

History of Natural Language Processing (NLP)

The history of NLP (Natural Language Processing) is divided into three segments that are as
follows:

The Dawn of NLP (1950s-1970s)

In the 1950s, the dream of effortless communication across languages fueled the birth of NLP.
Machine translation (MT) was the driving force, and rule-based systems emerged as the initial
approach.

How Rule-Based Systems Worked:

These systems functioned like complex translation dictionaries on steroids. Linguists


meticulously crafted a massive set of rules that captured the grammatical structure (syntax) and
vocabulary of specific languages.

Imagine the rules as a recipe for translation. Here's a simplified breakdown:

1. Sentence Breakdown: The system would first analyze the source language sentence and
break it down into its parts of speech (nouns, verbs, adjectives, etc.).

2. Matching Rules: Each word or phrase would be matched against the rule base to find its
equivalent in the target language, considering grammatical roles and sentence structure.

3. Rearrangement: Finally, the system would use the rules to rearrange the translated
words and phrases to form a grammatically correct sentence in the target language.
Limitations of Rule-Based Systems:

While offering a foundation for MT, this approach had several limitations:

 Inflexibility: Languages are full of nuances and exceptions. Rule-based systems struggled
to handle idioms, slang, and variations in sentence structure. A slight deviation from the
expected format could throw the entire translation off.

 Scalability Issues: Creating and maintaining a vast rule base for every language pair was
a time-consuming and laborious task. Imagine the immense effort required for just a
handful of languages!

 Limited Scope: These systems primarily focused on syntax and vocabulary, often failing
to capture the deeper meaning and context of the text. This resulted in translations that
sounded grammatically correct but unnatural or even nonsensical.

Despite these limitations, rule-based systems laid the groundwork for future NLP
advancements. They demonstrated the potential for computers to understand and manipulate
human language, paving the way for more sophisticated approaches that would emerge later.

The Statistical Revolution (1980s-1990s)

 A Shift Towards Statistics: The 1980s saw a paradigm shift towards statistical NLP
approaches. Machine learning algorithms emerged as powerful tools for NLP tasks.

 The Power of Data: Large collections of text data (corpora) became crucial for training
these statistical models.

 Learning from Patterns: Unlike rule-based systems, statistical models learn patterns
from data, allowing them to handle variations and complexities of natural language.

The Deep Learning Era (2000s-Present)

 The Deep Learning Revolution: The 2000s ushered in the era of deep learning,
significantly impacting NLP.

 Artificial Neural Networks (ANNs): These complex algorithms, inspired by the human
brain, became the foundation of deep learning advancements in NLP.

 Advanced Architectures: Deep learning architectures like recurrent neural networks and
transformers further enhanced NLP capabilities. Briefly mention these architectures
without going into technical details.

The Advent of Rule-Based Systems


The 1960's and 1970's witnessed the emergence of rule-primarily based systems inside
the realm of NLP. Collaborations among linguists and computer scientists precipitated the
development of structures that trusted predefined policies to analyze and understand human
language.

The aim became to codify linguistic recommendations, at the side of syntax and grammar, into
algorithms that would be completed by way of computer systems to machine and generate
human-like text.

During this period, the General Problem Solver (GPS) received prominence. They had been
developed with the resources of Allen Newell and Herbert A. Simon; in 1957, GPS wasn't
explicitly designed for language processing. However, it established the functionality of rule-
based total systems by showcasing how computers must solve issues with the use of predefined
policies and heuristics.

What are the current Challenges in the field of NLP?

The enthusiasm surrounding rule-primarily based systems definitely changed into tempered by
the realization that human language is inherently complicated. Its nuances, ambiguities, and
context-established meanings proved hard to capture virtually through rigid recommendations.
As a result, rule-based NLP structures struggled with actual worldwide language applications,
prompting researchers to discover possible techniques. While statistical models represented a
sizable leap forward, the actual revolution in NLP got here with the arrival of neural networks.
Inspired by the form and function of the human mind, neural networks have developed
incredible capabilities in studying complicated styles from statistics.

In the mid-2010s, the utility of deep learning strategies, especially recurrent neural
networks (RNNs) and lengthy short-time period reminiscence (LSTM) networks, triggered
significant breakthroughs in NLP. These architectures allowed machines to capture sequential
dependencies in language, permitting more nuanced information and era of text. As NLP
persisted in strengthening, moral troubles surrounding bias, fairness, and transparency became
more and more prominent. The biases discovered in training information regularly manifested
in NLP models raise worries about the functionality reinforcement of societal inequalities.
Researchers and practitioners started out addressing those issues, advocating for responsible AI
improvement and the incorporation of moral considerations into the fabric of NLP.

The Evolution of Multimodal NLP

Multimodal NLP represents the subsequent frontier in the evolution of herbal language
processing. Traditionally, NLP focused, in preference, on processing and understanding textual
records.
However, the appearance of multimedia-rich content material on the net and the proliferation
of devices organized with cameras and microphones have propelled the need for NLP structures
to address an extensive style of modalities at the side of pictures, audio, and video.

1. Image Captioning: One of the early programs of multimodal NLP is image captioning,
wherein models generate textual descriptions for photos. This challenge calls for the
model to now not only successfully understand items inside a photograph but also
understand the context and relationships among objects. The integration of visible facts
with linguistic know-how poses a considerable assignment; however, it opens avenues
for added immersive applications.

2. Speech-to-Text and Audio Processing: Multimodal NLP extends its attainment into audio
processing, with applications beginning from speech-to-textual content conversion to
the evaluation of audio content material. Speech recognition systems, ready with NLP
abilities, permit more herbal interactions with devices through voice instructions. This
has implications for accessibility and usefulness, making technology extra inclusive for
humans with varying levels of literacy.

3. Video Understanding: As the amount of video content on the net keeps growing, there
may be a burgeoning need for NLP structures to recognize and summarize video data.
This entails now not only first-class-recognizing devices and moves inside movies but
also knowledge of the narrative shape and context. Video information opens doors to
programs in content fabric recommendation, video summarization, and even
sentiment evaluation based totally on visible and auditory cues.

4. Social Media Analysis: Multimodal NLP becomes especially relevant within the context
of social media, wherein users share a vast range of content material fabric, which
includes text, pictures, and movement pictures. Analyzing and understanding the
sentiment, context, and capability implications of social media content material calls for
NLP structures to be gifted in processing multimodal information. This has implications
for content material cloth moderation, logo tracking, and trends evaluation on social
media platforms.

The Emergence of Explainable AI in NLP

As NLP models become increasingly complicated and powerful, there may be a developing call
for transparency and interpretability. The black-box nature of deep mastering models,
especially neural networks, has raised issues about their selection-making tactics. In response,
the sphere of explainable AI (XAI) has won prominence, aiming to shed light on the internal
workings of complicated models and make their outputs more understandable to customers.
1. Interpretable Models: Traditional devices studying models, which include choice timber
and linear models, are inherently extra interpretable because of their particular
illustration of policies. However, as NLP embraced the power of deep studying, mainly
with models like BERT and GPT, interpretability has ended up being a big task.
Researchers are actively exploring techniques to decorate the interpretability of neural
NLP without sacrificing their ordinary performance.

2. Attention Mechanisms and Interpretability: The interest mechanism, an essential


component of many logo-new NLP models, performs a pivotal position in determining
which components of the input collection the version makes an area of expertise at
some point of processing. Leveraging interest mechanisms for interpretability entails
visualizing the attention weights and showcasing which words or tokens contribute more
significantly to the version's choice. This gives precious insights into how the model
processes information.

3. Rule-based Totally Explanations: Integrating rule-based totally reasons into NLP includes
incorporating human-comprehensible regulations alongside the complex neural
community architecture. This hybrid approach seeks balance between the expressive
energy of deep mastering and the transparency of rule-primarily based structures. By
imparting rule-based reasons, customers can gain insights into why the version made a
particular prediction or choice.

4. User-Friendly Interfaces: Making AI systems reachable to non-professionals calls for


person-friendly interfaces that gift model outputs and causes cleanly and intuitively.
Visualization gear and interactive interfaces empower clients to explore model behavior,
understand predictions, and verify the reliability of NLP programs. Such interfaces bridge
the space between technical experts and prevent-users, fostering a more inclusive and
informed interaction with AI.

5. Ethical Considerations in Explainability: The pursuit of explainable AI in NLP is


intertwined with moral issues. Ensuring that factors aren't the most effective and
accurate but are unbiased and truthful is important. Researchers and practitioners have
to navigate the sensitive balance between version transparency and the capability to
reveal touchy records. Striking this balance is vital for building acceptance as accurate
within AI structures and addressing problems related to duty and equity.

The Evolution of Language Models

Language models form the spine of NLP, powering programs starting from chatbots and digital
assistants to device translation and sentiment analysis. The evolution of language models
reflects the non-forestall quest for extra accuracy, context cognisance, and green natural
language information.

In the early days of NLP, notice the dominance of rule-based systems trying to codify linguistic
policies into algorithms. However, the restrictions of these structures in handling the complexity
of human language paved the manner for statistical trends. Statistical techniques, along with n-
gram models and Hidden Markov Models, leveraged massive datasets to grow to be privy to
styles and probabilities, improving the accuracy of language processing obligations.

Word Embeddings and Distributed Representations

The advent of phrase embeddings, along with Word2Vec and GloVe, marked a paradigm shift in
how machines constitute and understand words. These embeddings enabled phrases to be
represented as dense vectors in a non-forestall vector region, capturing semantic relationships
and contextual data. Distributed representations facilitated more excellent nuanced language
expertise and stepped forward the overall performance of downstream NLP responsibilities.

The mid-2010s witnessed the rise of deep learning in NLP, with the software of recurrent neural
networks (RNNs) and prolonged short-time period memory (LSTM) networks. These
architectures addressed the stressful conditions of taking pictures of sequential dependencies in
language, allowing models to method and generate textual content with a higher understanding
of context. RNNs and LSTMs laid the basis for the following improvements in neural NLP.

The Transformer Architecture

In 2017, the advent of the Transformer shape by using Vaswani et al. They marked a
contemporary leap forward in NLP. Transformers, characterized via manner of self-attention
mechanisms, outperformed previous factors in numerous language obligations.

The Transformer structure has grown to be the cornerstone of the latest trends, allowing
parallelization and green studying of contextual facts at some stage in lengthy sequences.

BERT and Pre-educated Models

Bidirectional Encoder Representations from Transformers (BERT), introduced with the aid
of Google in 2018, verified the strength of pre-schooling big-scale language models on massive
corpora. BERT and subsequent models like GPT (Generative Pre-educated
Transformer) completed super performance via studying contextualized representations of
words and terms. These pre-professional models, first-class-tuned for unique duties, have
turned out to be the pressure behind breakthroughs in understanding natural language.
The evolution of language models persisted with enhancements like XLNet, which addressed
boundaries to taking snapshots in a bidirectional context. XLNet delivered a permutation
language modeling goal, allowing the model to remember all feasible versions of a sequence.
This method similarly progressed the know-how of contextual data and examined the iterative
nature of advancements in language modeling.

Ethical Considerations in NLP: A Closer Look

The fast development in NLP has added transformative adjustments in numerous industries,
from healthcare and finance to training and enjoyment. However, with splendid power comes
first-rate duty, and the ethical issues surrounding NLP have emerged as an increasing number of
essentials.

1. Transparency and Accountability: The black-discipline nature of a few advanced NLP


models poses demanding situations related to transparency and obligation. Users might
also moreover need help understanding why a version made a specific prediction or
selection. Enhancing transparency includes imparting reasons for model outputs and
permitting customers to realize the choice-making manner. Establishing clean traces of
responsibility is equally important, making sure that developers and companies take
responsibility for the ethical implications of their NLP packages.

2. Bias in NLP Models: One of the primary moral concerns in NLP revolves around the
capability bias present in education statistics and its impact on model predictions. If
schooling records show present societal biases, NLP models may inadvertently
perpetuate and make the biases more substantial. For example, biased language in
ancient texts or news articles can lead to biased representations in language models,
influencing their outputs.

3. Fairness and Equity: Ensuring fairness and fairness in NLP programs is a complex
assignment. NLP trends should be evaluated for their overall performance at some point
by excellent demographic agencies to pick out and mitigate disparities. Addressing
problems associated with equity entails now not only refining algorithms but also
adopting a holistic approach that considers the numerous views and testimonies of
customers.

What are Language Models in NLP?


Language models are a fundamental component of natural language processing (NLP) and
computational linguistics. They are designed to understand, generate, and predict human
language. These models analyze the structure and use of language to perform tasks such as
machine translation, text generation, and sentiment analysis.

This article explores language models in depth, highlighting their development, functionality,
and significance in natural language processing.

What is a Language Model in Natural Language Processing?

A language model in natural language processing (NLP) is a statistical or machine learning model
that is used to predict the next word in a sequence given the previous words. Language models
play a crucial role in various NLP tasks such as machine translation, speech recognition, text
generation, and sentiment analysis. They analyze and understand the structure and use of
human language, enabling machines to process and generate text that is contextually
appropriate and coherent.

Grammar-Based Language Models

Grammar-based LMs use formal grammar rules (syntax) to determine whether a sentence is
valid and sometimes assign probabilities.

(a) Context-Free Grammars (CFGs)

 Represent sentences using production rules.

 Example:

 S → NP VP

 NP → Det N

 VP → V NP

 Det → "the" | "a"

 N → "dog" | "cat"

 V → "chased" | "saw"

 Sentence: “The dog chased the cat” ✔ (valid according to rules).

(b) Strengths

 Captures syntactic structure.


 Useful for parsing and machine translation.

 Explains why a sentence is valid (interpretability).

(c) Weaknesses

 Not good at capturing probability of word sequences.

 Ambiguity: multiple parse trees possible.

 Building complete grammars is difficult for natural languages.

3. Statistical Language Models

Statistical LMs use probability distributions learned from data (large corpora). Instead of strict
rules, they estimate how likely a sequence of words is.

But computing this for long sequences is intractable → Markov Assumption.

(b) N-gram Models

 Approximate probability by considering only the last n-1 words:

P(wi∣w1,…,wi−1)≈P(wi∣wi−n+1,…,wi−1)P(w_i | w_1, …, w_{i-1}) \approx P(w_i | w_{i-n+1}, …,


w_{i-1})P(wi∣w1,…,wi−1)≈P(wi∣wi−n+1,…,wi−1)

 Example (Bigram model, n=2):

P(w1,w2,…,wn)=∏i=1nP(wi∣wi−1)P(w_1, w_2, …, w_n) = \prod_{i=1}^n P(w_i | w_{i-1})P(w1,w2,


…,wn)=i=1∏nP(wi∣wi−1)

 Trigram (n=3): Considers two previous words.

(c) Estimation

Probabilities are estimated using counts from corpora:

P(wi∣wi−1)=C(wi−1,wi)C(wi−1)P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}P(wi∣wi−1


)=C(wi−1)C(wi−1,wi)

where C(x)C(x)C(x) = count of word/phrase in corpus.

(d) Issues

 Data Sparsity: Some word sequences may never appear in training data.
 Solution: Smoothing (e.g., Laplace smoothing, Good-Turing, Kneser-Ney).

 Long-distance dependencies are poorly captured in n-grams.

4. Comparison: Grammar-based LM vs. Statistical LM

Aspect Grammar-based LM Statistical LM (N-grams, etc.)

Basis Formal grammar rules (syntax) Probabilities from data

Interpretability High (explainable structure) Lower (just probability numbers)

Data Requirement Small (rules manually created) Large corpora needed

Strengths Syntax representation, parsing Captures word sequence probabilities

Weaknesses Ambiguity, incomplete coverage Data sparsity, poor long-distance modeling

5. Modern Perspective

 Hybrid models: Probabilistic CFGs (PCFGs) combine grammar + probabilities.

 Neural LMs: Today, RNNs, LSTMs, and Transformers have replaced traditional statistical
LMs. They model long-distance dependencies better.

REGULAR EXPRESSIONS

A Regular Expression or RegEx is a special sequence of characters that uses a search pattern to
find a string or set of strings.

It can detect the presence or absence of a text by matching it with a particular pattern and
also can split a pattern into one or more sub-patterns.

Regex Module in Python

Python has a built-in module named "re" that is used for regular expressions in Python. We
can import this module by using import statement.

Importing re module in Python using following command:

import re
How to Use RegEx in Python?

You can use RegEx in Python after importing re module.

Example:

This Python code uses regular expressions to search for the word "portal" in the given string
and then prints the start and end indices of the matched word within the string.

import re

s = 'GeeksforGeeks: A computer science portal for geeks'

match = re.search(r'portal', s)

print('Start Index:', match.start())

print('End Index:', match.end())

Output

Start Index: 34

End Index: 40

Note: Here r character (r’portal’) stands for raw, not regex. The raw string is slightly different
from a regular string, it won’t interpret the \ character as an escape character. This is because
the regular expression engine uses \ character for its own escaping purpose.

Before starting with the Python regex module let's see how to actually write regex using
metacharacters or special sequences.

RegEx Functions

The re module in Python provides various functions that help search, match, and manipulate
strings using regular expressions.

Below are main functions available in the re module:


Function Description

re.findall() finds and returns all matching occurrences in a list

re.compile() Regular expressions are compiled into pattern objects

re.split() Split string by the occurrences of a character or a pattern.

Replaces all occurrences of a character or patter with a


re.sub()
replacement string.

It's similar to re.sub() method but it returns a tuple:


resubn
(new_string, number_of_substitutions)

re.escape() Escapes special character

re.search() Searches for first occurrence of character or pattern

Let's see the working of these RegEx functions with definition and examples:

1. re.findall()

Returns all non-overlapping matches of a pattern in the string as a list. It scans the string from
left to right.

Example: This code uses regular expression \d+ to find all sequences of one or more digits in
the given string.

import re

string = """Hello my Number is 123456789 and

my friend's number is 987654321"""


regex = '\d+'

match = re.findall(regex, string)

print(match)

Output

['123456789', '987654321']

2. re.compile()

Compiles a regex into a pattern object, which can be reused for matching or substitutions.

Example 1: This pattern [a-e] matches all lowercase letters between 'a' and 'e', in the input
string "Aye, said Mr. Gibenson Stark". The output should be ['e', 'a', 'd', 'b', 'e'], which are
matching characters.

import re

p = re.compile('[a-e]')

print(p.findall("Aye, said Mr. Gibenson Stark"))

Output

['e', 'a', 'd', 'b', 'e', 'a']

Explanation:

 First occurrence is 'e' in "Aye" and not 'A', as it is Case Sensitive.

 Next Occurrence is 'a' in "said", then 'd' in "said", followed by 'b' and 'e' in
"Gibenson", the Last 'a' matches with "Stark".

 Metacharacter backslash '\' has a very important role as it signals various sequences. If
the backslash is to be used without its special meaning as metacharacter, use'\\'

Example 2: The code uses regular expressions to find and list all single digits and sequences of
digits in the given input strings. It finds single digits with \d and sequences of digits with \d+.

import re

p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

p = re.compile('\d+')

print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

Output

['1', '1', '4', '1', '8', '8', '6']

['11', '4', '1886']

Example 3: Word and non-word characters

 \w matches a single word character.

 \w+ matches a group of word characters.

 \W matches non-word characters.

import re

p = re.compile('\w')

print(p.findall("He said * in some_lang."))

p = re.compile('\w+')

print(p.findall("I went to him at 11 A.M., he \

said *** in some_language."))

p = re.compile('\W')

print(p.findall("he said *** in some_language."))

Output
['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']

['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']

[' ', ' ', '*', '*', '*', ' ', ' ', '.']

Example 4: The regular expression pattern 'ab*' to find and list all occurrences of 'ab' followed
by zero or more 'b' characters. In the input string "ababbaabbb". It returns the following list
of matches: ['ab', 'abb', 'abbb'].

import re

p = re.compile('ab*')

print(p.findall("ababbaabbb"))

Output

['ab', 'abb', 'a', 'abbb']

Explanation:

 Output 'ab', is valid because of single 'a' accompanied by single 'b'.

 Output 'abb', is valid because of single 'a' accompanied by 2 'b'.

 Output 'a', is valid because of single 'a' accompanied by 0 'b'.

 Output 'abbb', is valid because of single 'a' accompanied by 3 'b'.

3. re.split()

Splits a string wherever the pattern matches. The remaining characters are returned as list
elements.

Syntax:

re.split(pattern, string, maxsplit=0, flags=0)

 pattern: Regular expression to match split points.

 string: The input string to split.

 maxsplit (optional): Limits the number of splits. Default is 0 (no limit).

 flags (optional): Apply regex flags like re.IGNORECASE.


Example 1: Splitting by non-word characters or digits

This example demonstrates how to split a string using different patterns like non-word
characters (\W+), apostrophes, and digits (\d+).

from re import split

print(split('\W+', 'Words, words , Words'))

print(split('\W+', "Word's words Words"))

print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))

print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))

Output

['Words', 'words', 'Words']

['Word', 's', 'words', 'Words']

['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']

['On ', 'th Jan ', ', at ', ':', ' AM']

Example 2: Using maxsplit and flags

This example shows how to limit the number of splits using maxsplit, and how flags can
control case sensitivity.

import re

print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1))

print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags=re.IGNORECASE))

print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))

Output

['On ', 'th Jan 2016, at 11:02 AM']

['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']

Note: In the second and third cases of the above , [a-f]+ splits the string using any
combination of lowercase letters from 'a' to 'f'. The re.IGNORECASE flag includes uppercase
letters in the match.

4. re.sub()

The re.sub() function replaces all occurrences of a pattern in a string with a replacement
string.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

 pattern: The regex pattern to search for.

 repl: The string to replace matches with.

 string: The input string to process.

 count (optional): Maximum number of substitutions (default is 0, which means replace


all).

 flags (optional): Regex flags like re.IGNORECASE.

Example 1: The following examples show different ways to replace the pattern 'ub' with '~*',
using various flags and count values.

import re

# Case-insensitive replacement of all 'ub'

print(re.sub('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE))

# Case-sensitive replacement of all 'ub'

print(re.sub('ub', '~*', 'Subject has Uber booked already'))

# Replace only the first 'ub', case-insensitive


print(re.sub('ub', '~*', 'Subject has Uber booked already', count=1, flags=re.IGNORECASE))

# Replace "AND" with "&", ignoring case

print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE))

Output

S~*ject has ~*er booked already

S~*ject has Uber booked already

S~*ject has Uber booked already

Baked Beans & Spam

5. re.subn()

re.subn() function works just like re.sub(), but instead of returning only the modified string, it
returns a tuple: (new_string, number_of_substitutions)

Syntax:

re.subn(pattern, repl, string, count=0, flags=0)

Example: Substitution with count

This example shows how re.subn() gives both the replaced string and the number of times
replacements were made.

import re

# Case-sensitive replacement

print(re.subn('ub', '~*', 'Subject has Uber booked already'))

# Case-insensitive replacement

t = re.subn('ub', '~*', 'Subject has Uber booked already', flags=re.IGNORECASE)


print(t)

print(len(t)) # tuple length

print(t[0]) # modified string

Output

('S~*ject has Uber booked already', 1)

('S~*ject has ~*er booked already', 2)

S~*ject has ~*er booked already

6. re.escape()

re.escape() function adds a backslash (\) before all special characters in a string. This is useful
when you want to match a string literally, including any characters that have special meaning
in regex (like ., *, [, ], etc.).

Syntax:

re.escape(string)

Example: Escaping special characters

This example shows how re.escape() treats spaces, brackets, dashes, and tabs as literal
characters.

import re

print(re.escape("This is Awesome even 1 AM"))

print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

Output

This\ is\ Awesome\ even\ 1\ AM

I\ Asked\ what\ is\ this\ \[a\-9\]\,\ he\ said\ \ \ \^WoW

7. re.search()
The re.search() function searches for the first occurrence of a pattern in a string. It returns
a match object if found, otherwise None.

Note: Use it when you want to check if a pattern exists or extract the first match.

Example: Search and extract values

This example searches for a date pattern with a month name (letters) followed by a day
(digits) in a sentence.

import re

regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24")

if match:

print("Match at index %s, %s" % (match.start(), match.end()))

print("Full match:", match.group(0))

print("Month:", match.group(1))

print("Day:", match.group(2))

else:

print("The regex pattern does not match.")

Output

Match at index 14, 21

Full match: June 24

Month: June

Day: 24

Meta-characters
Metacharacters are special characters in regular expressions used to define search patterns.
The re module in Python supports several metacharacters that help you perform powerful
pattern matching.

Below is a quick reference table:

MetaCharacters Description

Used to drop the special meaning of


\
character following it

[] Represent a character class

^ Matches the beginning

$ Matches the end

. Matches any character except newline

Means OR (Matches with any of the


|
characters separated by it.

? Matches zero or one occurrence

Any number of occurrences (including 0


*
occurrences)

+ One or more occurrences

{} Indicate the number of occurrences of a


MetaCharacters Description

preceding regex to match.

() Enclose a group of Regex

Let's discuss each of these metacharacters in detail:

1. \ - Backslash

The backslash (\) makes sure that the character is not treated in a special way. This can be
considered a way of escaping metacharacters.

For example, if you want to search for the dot(.) in the string then you will find that dot(.) will
be treated as a special character as is one of the metacharacters (as shown in the above
table). So for this case, we will use the backslash(\) just before the dot(.) so that it will lose its
specialty. See the below example for a better understanding.

Example: The first search (re.search(r'.', s)) matches any character, not just the period, while
the second search (re.search(r'\.', s)) specifically looks for and matches the period character.

import re

s = 'geeks.forgeeks'

# without using \

match = re.search(r'.', s)

print(match)

# using \

match = re.search(r'\.', s)

print(match)
Output

<re.Match object; span=(0, 1), match='g'>

<re.Match object; span=(5, 6), match='.'>

2. [] - Square Brackets

Square Brackets ([]) represent a character class consisting of a set of characters that we wish
to match. For example, the character class [abc] will match any single a, b, or c.

We can also specify a range of characters using - inside the square brackets. For example,

 [0, 3] is sample as [0123]

 [a-c] is same as [abc]

We can also invert the character class using the caret(^) symbol. For example,

 [^0-3] means any character except 0, 1, 2, or 3

 [^a-c] means any character except a, b, or c

Example: In this code, you're using regular expressions to find all the characters in the string
that fall within the range of 'a' to 'm'. The re.findall() function returns a list of all such
characters. In the given string, the characters that match this pattern are: 'c', 'k', 'b', 'f', 'j', 'e',
'h', 'l', 'd', 'g'.

import re

string = "The quick brown fox jumps over the lazy dog"

pattern = "[a-m]"

result = re.findall(pattern, string)

print(result)

Output
['h', 'e', 'i', 'c', 'k', 'b', 'f', 'j', 'm', 'e', 'h', 'e', 'l', 'a', 'd', 'g']

3. ^ - Caret

Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts with
the given character(s) or not. For example -

 ^g will check if the string starts with g such as geeks, globe, girl, g, etc.

 ^ge will check if the string starts with ge such as geeks, geeksforgeeks, etc.

Example: This code uses regular expressions to check if a list of strings starts with "The". If a
string begins with "The," it's marked as "Matched" otherwise, it's labeled as "Not matched".

import re

regex = r'^The'

strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox']

for string in strings:

if re.match(regex, string):

print(f'Matched: {string}')

else:

print(f'Not matched: {string}')

Output

Matched: The quick brown fox

Matched: The lazy dog

Not matched: A quick brown fox

4. $ - Dollar

Dollar($) symbol matches the end of the string i.e checks whether the string ends with the
given character(s) or not. For example-

 s$ will check for the string that ends with a such as geeks, ends, s, etc.

 ks$ will check for the string that ends with ks such as geeks, geeksforgeeks, ks, etc.
Example: This code uses a regular expression to check if the string ends with "World!". If a
match is found, it prints "Match found!" otherwise, it prints "Match not found".

import re

string = "Hello World!"

pattern = r"World!$"

match = re.search(pattern, string)

if match:

print("Match found!")

else:

print("Match not found.")

Output

Match found!

5. . - Dot

Dot(.) symbol matches only a single character except for the newline character (\n). For
example -

 a.b will check for the string that contains any character at the place of the dot such as
acb, acbd, abbb, etc

 .. will check if the string contains at least 2 characters

Example: This code uses a regular expression to search for the pattern "brown.fox" within the
string. The dot (.) in the pattern represents any character. If a match is found, it prints "Match
found!" otherwise, it prints "Match not found".

import re
string = "The quick brown fox jumps over the lazy dog."

pattern = r"brown.fox"

match = re.search(pattern, string)

if match:

print("Match found!")

else:

print("Match not found.")

Output

Match found!

6. | - Or

The | operator means either pattern on its left or right can match. a|b will match any string
that contains a or b such as acd, bcd, abcd, etc.

7. ? - Question Mark

The question mark (?) indicates that the preceding element should be matched zero or one
time. It allows you to specify that the element is optional, meaning it may occur once or not
at all.

For example, ab?c will be matched for the string ac, acb, dabc but will not be matched for
abbc because there are two b. Similarly, it will not be matched for abdc because b is not
followed by c.

8.* - Star

Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol.

For example, ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be
matched for abdc because b is not followed by c.

9. + - Plus

Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol.
For example, ab+c will be matched for the string abc, abbc, dabc, but will not be matched for
ac, abdc, because there is no b in ac and b, is not followed by c in abdc.

10. {m, n} - Braces

Braces match any repetitions preceding regex from m to n both inclusive.

For example, a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be
matched for strings like abc, bc because there is only one a or no a in both the cases.

11. (<regex>) - Group

Group symbol is used to group sub-patterns.

For example, (a|b)cd will match for strings like acd, abcd, gacd, etc.

Special Sequences

Special sequences do not match for the actual character in the string instead it tells the
specific location in the search string where the match must occur. It makes it easier to write
commonly used patterns.

List of special sequences

Special
Sequence Description Examples

for geeks
Matches if the
\A string begins with \Afor
the given character
for the world

\b Matches if the \bge geeks


word begins or
ends with the
given character. \ get
b(string) will check
for the beginning
of the word and
(string)\b will
Special
Sequence Description Examples

It is the opposite of together


the \b i.e. the
\B string should not \Bge
start or end with
forge
the given regex.

Matches any 123


decimal digit, this
\d \d
is equivalent to the
set class [0-9] gee1

Matches any non- geeks


digit character, this
\D \D
is equivalent to the
set class [^0-9] geek1

gee ks
Matches any
\s whitespace \s
character.
a bc a

a bd
Matches any non-
\S whitespace \S
character
abcd
Special
Sequence Description Examples

Matches any 123


alphanumeric
\w character, this is \w
equivalent to the
geeKs4
class [a-zA-Z0-9_].

>$
Matches any non-
\W alphanumeric \W
character.
gee<>

abcdab
Matches if the
\Z string ends with ab\Z
the given regex
abababab

Sets for character matching

A Set is a set of characters enclosed in '[]' brackets. Sets are used to match a single character
in the set of characters specified between brackets. Below is the list of Sets:

Set Description

Quantifies the preceding character or group


\{n,\}
and matches at least n occurrences.

Quantifies the preceding character or group


*
and matches zero or more occurrences.

[0123] Matches the specified digits (0, 1, 2, or 3)


Set Description

[^arn] matches for any character EXCEPT a, r, and n

\d Matches any digit (0-9).

matches for any two-digit numbers from 00


[0-5][0-9]
and 59

Matches any alphanumeric character (a-z, A-


\w
Z, 0-9, or _).

Matches any lower case alphabet between a


[a-n]
and n.

\D Matches any non-digit character.

matches where one of the specified


[arn]
characters (a, r, or n) are present

matches any character between a and z,


[a-zA-Z]
lower case OR upper case

[0-9] matches any digit between 0 and 9

Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let's see some of the commonly used methods and
attributes of the match object.

1. Getting the string and the regex

match.re attribute returns the regular expression passed and match.string attribute returns
the string passed.

Example:

The code searches for the letter "G" at a word boundary in the string "Welcome to
GeeksForGeeks" and prints the regular expression pattern (res.re) and the original
string (res.string).

import re

s = "Welcome to GeeksForGeeks"

res = re.search(r"\bG", s)

print(res.re)

print(res.string)

Output

re.compile('\\bG')

Welcome to GeeksForGeeks

2. Getting index of matched object

 start() method returns the starting index of the matched substring

 end() method returns the ending index of the matched substring

 span() method returns a tuple containing the starting and the ending index of the
matched substring

Example: Getting index of matched object


The code searches for substring "Gee" at a word boundary in string "Welcome to
GeeksForGeeks" and prints start index of the match (res.start()), end index of the match
(res.end()) and span of the match (res.span()).

import re

s = "Welcome to GeeksForGeeks"

res = re.search(r"\bGee", s)

print(res.start())

print(res.end())

print(res.span())

Output

11

14

(11, 14)

3. Getting matched substring

group() method returns the part of the string for which the patterns match. See the below
example for a better understanding.

Example: Getting matched substring

The code searches for a sequence of two non-digit characters followed by a space and the
letter 't' in the string "Welcome to GeeksForGeeks" and prints the matched text
using res.group().

import re

s = "Welcome to GeeksForGeeks"

res = re.search(r"\D{2} t", s)


print(res.group())

Output

me t

In the above example, our pattern specifies for the string that contains at least 2 characters
which are followed by a space, and that space is followed by a t.

Basic RegEx Patterns

Let's understand some of the basic regular expressions. They are as follows:

1. Character Classes

Character classes allow matching any one character from a specified set. They are enclosed in
square brackets [].

import re

print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \

A computer science portal for geeks'))

Output

['Geeks', 'Geeks', 'geeks']

2. Ranges

In RegEx, a range allows matching characters or digits within a span using - inside []. For
example, [0-9] matches digits, [A-Z] matches uppercase letters.

import re

print('Range',re.search(r'[a-zA-Z]', 'x'))

Output

Range <re.Match object; span=(0, 1), match='x'>

3. Negation
Negation in a character class is specified by placing a ^ at the beginning of the brackets,
meaning match anything except those characters.

Syntax:

[^a-z]

Example:

import re

print(re.search(r'[^a-z]', 'c'))

print(re.search(r'G[^e]', 'Geeks'))

Output

None

None

3. Shortcuts

Shortcuts are shorthand representations for common character classes. Let's discuss some of
the shortcuts provided by the regular expression engine.

 \w - matches a word character

 \d - matches digit character

 \s - matches whitespace character (space, tab, newline, etc.)

 \b - matches a zero-length character

import re

print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))

print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks'))

Output
Geeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>

GeeksforGeeks: None

4. Beginning and End of String

The ^ character chooses the beginning of a string and the $ character chooses the end of a
string.

import re

# Beginning of String

match = re.search(r'^Geek', 'Campus Geek of the month')

print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month')

print('Beg. of String:', match)

# End of String

match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')

print('End of String:', match)

Output

Beg. of String: None

Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>

End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>

5. Any Character

The . character represents any single character outside a bracketed character class.
import re

print('Any Character', re.search(r'p.th.n', 'python 3'))

Output

Any Character <_sre.SRE_Match object; span=(0, 6), match='python'>

6. Optional Characters

Regular expression engine allows you to specify optional characters using the ? character. It
allows a character or character class either to present once or else not to occur. Let's consider
the example of a word with an alternative spelling - color or colour.

import re

print('Color',re.search(r'colou?r', 'color'))

print('Colour',re.search(r'colou?r', 'colour'))

Output

Color <_sre.SRE_Match object; span=(0, 5), match='color'>

Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>

7. Repetition

Repetition enables you to repeat the same character or character class. Consider an example
of a date that consists of day, month, and year. Let's use a regular expression to identify the
date (mm-dd-yyyy).

import re

print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}','18-08-2020'))

Output

Date{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>


Here, the regular expression engine checks for two consecutive digits. Upon finding the
match, it moves to the hyphen character. After then, it checks the next two consecutive digits
and the process is repeated.

Let's discuss three other regular expressions under repetition.

7.1 Repetition ranges

The repetition range is useful when you have to accept one or more formats. Consider a
scenario where both three digits, as well as four digits, are accepted. Let's have a look at the
regular expression.

import re

print('Three Digit:', re.search(r'[\d]{3,4}', '189'))

print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))

Output

Three Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>

Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>

7.2 Open-Ended Ranges

There are scenarios where there is no limit for a character repetition. In such scenarios, you
can set the upper limit as infinitive. A common example is matching street addresses. Let's
have a look

import re

print(re.search(r'[\d]{1,}','5th Floor, A-118,\

Sector-136, Noida, Uttar Pradesh - 201305'))

Output

<_sre.SRE_Match object; span=(0, 1), match='5'>


7.3 Shorthand

Shorthand characters allow you to use + character to specify one or more ({1,}) and *
character to specify zero or more ({0,}.

import re

print(re.search(r'[\d]+', '5th Floor, A-118,\

Sector-136, Noida, Uttar Pradesh - 201305'))

Output

<_sre.SRE_Match object; span=(0, 1), match='5'>

8. Grouping

Grouping is the process of separating an expression into groups by using parentheses, and it
allows you to fetch each individual matching group.

import re

grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020')

print(grp)

Output

<_sre.SRE_Match object; span=(0, 10), match='26-08-2020'>

Let's see some of its functionality.

8.1 Return the entire match

The re module allows you to return the entire match using the group() method

import re

grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')

print(grp.group())
Output

26-08-2020

8.2 Return a tuple of matched groups

You can use groups() method to return a tuple that holds individual matched groups

import re

grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')

print(grp.groups())

Output

('26', '08', '2020')

8.3 Retrieve a single group

Upon passing the index to a group method, you can retrieve just a single group.

import re

grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')

print(grp.group(3))

Output

2020

8.4 Name your groups

The re module allows you to name your groups. Let's look into the syntax.

import re

match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',

'26-08-2020')

print(match.group('mm'))
Output

08

8.5 Individual match as a dictionary

We have seen how regular expression provides a tuple of individual groups. Not only tuple,
but it can also provide individual match as a dictionary in which the name of each group acts
as the dictionary key.

import re

match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',

'26-08-2020')

print(match.groupdict())

Output

{'dd': '26', 'mm': '08', 'yyyy': '2020'}

9. Lookahead

In the case of a negated character class, it won't match if a character is not present to check
against the negated character. We can overcome this case by using lookahead; it accepts or
rejects a match based on the presence or absence of content.

import re

print('negation:', re.search(r'n[^e]', 'Python'))

print('lookahead:', re.search(r'n(?!e)', 'Python'))

Output

negation: None

lookahead: <_sre.SRE_Match object; span=(5, 6), match='n'>


Lookahead can also disqualify the match if it is not followed by a particular character. This
process is called a positive lookahead, and can be achieved by simply replacing !
character with = character.

import re

print('positive lookahead', re.search(r'n(?=e)', 'jasmine'))

Output

positive lookahead <_sre.SRE_Match object; span=(5, 6), match='n'>

10. Substitution

The regular expression can replace the string and returns the replaced one using the re.sub
method. It is useful when you want to avoid characters such as /, -, ., etc. before storing it to a
database. It takes three arguments:

 the regular expression

 the replacement string

 the source string being searched

Introduction of Finite Automata:

Finite automata are abstract machines used to recognize patterns in input sequences, forming
the basis for understanding regular languages in computer science.

 Consist of states, transitions, and input symbols, processing each symbol step-by-step.

 If ends in an accepting state after processing the input, then the input is accepted;
otherwise, rejected.

 Finite automata come in deterministic (DFA) and non-deterministic (NFA), both of


which can recognize the same set of regular languages.

 Widely used in text processing, compilers, and network protocols.


Figure: Features of Finite Automata

Features of Finite Automata

 Input: Set of symbols or characters provided to the machine.

 Output: Accept or reject based on the input pattern.

 States of Automata: The conditions or configurations of the machine.

 State Relation: The transitions between states.

 Output Relation: Based on the final state, the output decision is made.

Formal Definition of Finite Automata

A finite automaton can be defined as a tuple:

{ Q, Σ, q, F, δ }, where:

 Q: Finite set of states

 Σ: Set of input symbols

 q: Initial state

 F: Set of final states

 δ: Transition function

Types of Finite Automata

There are two types of finite automata:

 Deterministic Fintie Automata (DFA)


 Non-Deterministic Finite Automata (NFA)

1. Deterministic Finite Automata (DFA)

A DFA is represented as {Q, Σ, q, F, δ}. In DFA, for each input symbol, the machine transitions
to one and only one state. DFA does not allow any null transitions, meaning every state must
have a transition defined for every input symbol.

DFA consists of 5 tuples {Q, Σ, q, F, δ}.


Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X Σ --> Q.

Example:

Construct a DFA that accepts all strings ending with 'a'.

Given:

Σ = {a, b},

Q = {q0, q1},

F = {q1}

Fig 1. State Transition Diagram for DFA with Σ = {a, b}

State\
Symbol a b

q0 q1 q0

q1 q1 q0
State\
Symbol a b

In this example, if the string ends in 'a', the machine reaches state q1, which is an accepting
state.

2) Non-Deterministic Finite Automata (NFA)

NFA is similar to DFA but includes the following features:

 It can transition to multiple states for the same input.

 It allows null (ϵ) moves, where the machine can change states without consuming any
input.

Example:

Construct an NFA that accepts strings ending in 'a'.

Given:

Σ = {a, b},

Q = {q0, q1},

F = {q1}

Fig 2. State Transition Diagram for NFA with Σ = {a, b}

State Transition Table for above Automaton,


State\Symbol a b

{q0,q1
q0 q0
}

q1 φ φ

In an NFA, if any transition leads to an accepting state, the string is accepted.

Comparison of DFA and NFA

Although NFAs appear more flexible, they do not have more computational power than DFAs.
Every NFA can be converted to an equivalent DFA, although the resulting DFA may have more
states.

 DFA: Single transition for each input symbol, no null moves.

 NFA: Multiple transitions and null moves allowed.

 Power: Both DFA and NFA recognize the same set of regular languages.

English Morphology and Transducers for Lexicon and Rules

1. English Morphology

(a) Definition

 Morphology = study of the internal structure of words and how they are formed.

 In NLP, morphology helps in analyzing roots, prefixes, suffixes, and inflections to


process words effectively.

(b) Types of Morphology


1. Inflectional Morphology

o Changes the form of a word to express tense, number, gender, case, etc.

o Does not change the basic meaning or word class.

o Examples:

 walk → walks → walked → walking

 cat → cats

2. Derivational Morphology

o Creates a new word with a new meaning or grammatical category.

o Examples:

 happy → happiness (adjective → noun)

 teach → teacher (verb → noun)

(c) Morphological Processes

 Affixation: Adding prefixes, suffixes.

o Example: un- (prefix), -ness (suffix).

 Compounding: Joining words.

o Example: blackboard, ice-cream.

 Reduplication: Repeating part/all of a word.

o Rare in English (boo-boo).

 Conversion (Zero Derivation): Changing class without adding affix.

o Example: to email (verb) ↔ an email (noun).

(d) Morphological Analysis in NLP

 Stemming: Cutting off affixes to get the stem.


o Example: playing, played → play.

o Porter Stemmer algorithm is commonly used.

 Lemmatization: Uses dictionary to return the base form (lemma).

o Example: better → good.

 Applications: Search engines, spell checkers, machine translation.

2. Finite-State Transducers (FSTs) for Lexicon and Rules

(a) Definition

 A Finite-State Transducer (FST) is like a Finite-State Automaton (FSA), but instead of


only accepting/rejecting, it maps between two representations.

 In morphology:

o Lexical level (abstract word form) ↔ Surface level (actual word form)

(b) Components

 States and transitions (like FSA).

 Each transition carries input and output symbols.

 Example:

 Lexical: cat + PLURAL

 Surface: cats

FST maps +PLURAL → s.

(c) How FST Works in Morphology

 Input: Lexical representation (stem + features).

 Output: Surface word form.

Example:
 Lexical: walk + PAST

 FST → walked

Steps:

1. Lexicon lookup: Identify base form (walk).

2. Morphological rule application: Add suffix -ed.

3. Surface form generation: walked.

(d) Applications in NLP

 Morphological generation: Creating correct word forms (e.g., MT systems).

 Morphological parsing: Decomposing words into root + features.

 Spell-checking: Recognizing valid forms via rules.

 Speech recognition/synthesis: Mapping between pronunciations and word forms.

(e) Example of Rule in FST

Rule: If verb ends with y → change y to i before adding -es.

 Lexical: try + 3rdPersonSingular

 Surface: tries

FST Transition:

 Input: try + s

 Output: tries

3. Relation Between Morphology and Transducers

 Morphology describes how words are formed.

 FSTs implement morphological rules computationally, mapping lexical entries to actual


word forms.
Tokenization, Detecting & Correcting Spelling Errors, and Minimum Edit Distance

1. Tokenization

(a) Definition

 Tokenization is the process of splitting a text into smaller units (tokens) such as words,
sentences, or subwords.

 It is usually the first step in NLP pipelines.

(b) Types of Tokenization

1. Word Tokenization

o Splits text into words.

o Example:

o Input: "NLP is fun!"

o Output: ["NLP", "is", "fun"]

2. Sentence Tokenization

o Splits text into sentences using punctuation rules.

o Example:

o Input: "I love NLP. It is powerful."

o Output: ["I love NLP.", "It is powerful."]

3. Subword/Character Tokenization

o Useful for morphologically rich languages.

o Example (Byte-Pair Encoding, BPE):


unhappiness → un + happi + ness.

(c) Challenges in Tokenization


 Ambiguity in punctuation (e.g., “Mr. Smith” vs. end of sentence).

 Compounds and contractions (e.g., “don’t” → “do” + “not”).

 Multilingual scripts (e.g., Chinese/Japanese don’t use spaces).

2. Detecting and Correcting Spelling Errors

(a) Types of Spelling Errors

1. Non-word errors – word not in dictionary.

o Example: “speling” → “spelling”.

2. Real-word errors – valid word but wrong in context.

o Example: “I went too the store” → should be “to”.

(b) Approaches to Spelling Error Detection

1. Dictionary Lookup

o Check each word against a dictionary.

o If absent → mark as misspelled.

2. Statistical/Language Model Approach

o Uses context (bigram/trigram/transformers) to detect real-word errors.

o Example: “I like read books” → should be “reading”.

(c) Approaches to Spelling Error Correction

1. Edit Distance (Levenshtein distance)

o Suggests corrections by finding dictionary words with minimum edit distance.

2. Noisy Channel Model

o Assumes user intended correct word but an error occurred in transmission


(typing).
o Correction chosen as:

o P(w)P(w)P(w) = language model probability.

o P(error∣w)P(error | w)P(error∣w) = probability of making that mistake.

3. Context-Sensitive Correction

o Uses context (n-grams or neural LMs) to choose between candidates.

o Example: “Their going to school” → “They’re going to school”.

3. Minimum Edit Distance

(a) Definition

 The minimum edit distance between two strings is the smallest number of operations
required to transform one string into another.

 Operations:

o Insertion (add a character).

o Deletion (remove a character).

o Substitution (replace a character).

(b) Example

Find edit distance between kitten and sitting:

1. kitten → sitten (substitute ‘k’→‘s’)

2. sitten → sittin (substitute ‘e’→‘i’)

3. sittin → sitting (insert ‘g’)


✅ Minimum Edit Distance = 3

(c) Applications in NLP

 Spelling correction (finding nearest correct word).


 Plagiarism detection / text similarity.

 Machine translation evaluation (BLEU/TER metrics).

 Speech recognition (matching predicted vs. actual output).

UNIT-II

Unsmoothed N-grams, Evaluation, Smoothing, Interpolation & Backoff:

Language modeling involves determining the probability of a sequence of words. It is


fundamental to many Natural Language Processing (NLP) applications such as speech
recognition, machine translation and spam filtering where predicting or ranking the likelihood
of phrases and sentences is crucial.

N-gram

N-gram is a language modelling technique that is defined as the contiguous sequence of n


items from a given sample of text or speech. The N-grams are collected from a text or speech
corpus. Items can be:

 Words like “This”, “article”, “is”, “on”, “NLP” → unigrams

 Word pairs likw “This article”, “article is”, “is on”, “on NLP” → bigrams
 Triplets (trigrams) or larger combinations

N-gram Language Model

N-gram models predict the probability of a word given the previous n−1 words. For example,
a trigram model uses the preceding two words to predict the next word:

Goal: Calculate p(w∣h)p(w∣h), the probability that the next word is ww, given
context/history hh.

Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as
the next word:

p("NLP"∣"This","article","is","on")p("NLP"∣"This","article","is","on")

Chain Rule of Probability

The probability of a sequence of words is computed as:

P(w1,w2,…,wn)=∏i=1nP(wi∣w1,w2,…,wi−1)P(w1,w2,…,wn)=∏i=1nP(wi∣w1,w2,…,wi−1)

Markov Assumption

To reduce complexity, N-gram models assume the probability of a word depends only on the
previous n−1 words.

P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)

Evaluating Language Models

1. Entropy: Measures the uncertainty or information content in a distribution.

H(p)=∑xp(x)⋅(−log⁡(p(x)))H(p)=∑xp(x)⋅(−log(p(x)))

It always give non negative.

2. Cross-Entropy: Measures how well a probability distribution predicts a sample from test
data.

H(p,q)=−∑xp(x)log⁡(q(x))H(p,q)=−∑xp(x)log(q(x))

Usually ≥ entropy; reflects model “surprise” at the test data.

3. Perplexity: Exponential of cross-entropy; lower values indicate a better model.

Perplexity(W)=∏i=1N1P(wi∣wi−1)NPerplexity(W)=N∏i=1NP(wi∣wi−1)1
Implementing N-Gram Language Modelling in NLTK

 words = nltk.word_tokenize(' '.join(reuters.words())): tokenizes the entire Reuters


corpus into words

 tri_grams = list(trigrams(words)): creates 3-word sequences from the tokenized words

 model = defaultdict(lambda: defaultdict(lambda: 0)): initializes nested dictionary for


trigram counts

 model[(w1, w2)][w3] += 1: counts occurrences of third word w3 after (w1, w2)

 model[w1_w2][w3] /= total_count: converts counts to probabilities

 return max(next_word_probs, key=next_word_probs.get): returns the most likely next


word based on highest probability

import nltk

from nltk import trigrams

from nltk.corpus import reuters

from collections import defaultdict

nltk.download('reuters')

nltk.download('punkt')

words = nltk.word_tokenize(' '.join(reuters.words()))

tri_grams = list(trigrams(words))

model = defaultdict(lambda: defaultdict(lambda: 0))

for w1, w2, w3 in tri_grams:

model[(w1, w2)][w3] += 1
for w1_w2 in model:

total_count = float(sum(model[w1_w2].values()))

for w3 in model[w1_w2]:

model[w1_w2][w3] /= total_count

def predict_next_word(w1, w2):

next_word_probs = model[w1, w2]

if next_word_probs:

return max(next_word_probs, key=next_word_probs.get)

else:

return "No prediction available"

print("Next Word:", predict_next_word('the', 'stock'))

Output:

Next Word: of

Advantages

 Simple and Fast: Easy to build and fast to run for small n.

 Interpretable: Easy to understand and debug.

 Good Baseline: Useful as a starting point for many NLP tasks.

Limitations

 Limited Context: Only considers a few previous words, missing long-range


dependencies.

 Data Sparsity: Needs lots of data; rare n-grams are common as n increases.
 High Memory: Bigger n-gram models require lots of storage.

 Poor with Unseen Words: Struggles with new or rare words unless smoothing is
applied.

1. N-grams and Language Models

(a) Definition

 An N-gram model predicts the next word based on the previous n−1n-1n−1 words.

 Uses Markov assumption → probability of a word depends only on a few previous


words.

2. Unsmoothed N-grams

(a) Definition

 Raw probabilities estimated directly from frequency counts.

Word Classes and Part-of-Speech (PoS) Tagging

1. Word Classes

(a) Definition

 Word classes (also called lexical categories or parts of speech) are groups of words that
share similar grammatical properties.

 In English, major word classes are:

(b) Major Word Classes

1. Nouns (N)

o Represent people, places, things, ideas.


o Examples: dog, book, happiness.

2. Verbs (V)

o Express actions, processes, states.

o Examples: run, eat, is, seem.

3. Adjectives (Adj)

o Describe nouns.

o Examples: happy, large, beautiful.

4. Adverbs (Adv)

o Modify verbs, adjectives, or other adverbs.

o Examples: quickly, very, silently.

5. Pronouns (Pro)

o Replace nouns.

o Examples: he, she, it, they.

6. Prepositions (Prep)

o Show relationships between nouns/pronouns and other words.

o Examples: in, on, under, with.

7. Conjunctions (Conj)

o Connect words, phrases, clauses.

o Examples: and, but, although.

8. Determiners (Det)

o Specify nouns.

o Examples: the, a, this, those.

9. Interjections (Int)

o Express emotions.
o Examples: oh!, wow!, alas!

2. Part-of-Speech (PoS) Tagging

(a) Definition

 PoS tagging is the process of assigning the correct part-of-speech label to each word in
a sentence.

 Example:

o Sentence: "The cat sat on the mat"

o Tags: [Det, Noun, Verb, Prep, Det, Noun]

(b) Importance of PoS Tagging

 Essential for syntactic parsing.

 Helps in information extraction.

 Useful in speech recognition, machine translation, and text-to-speech.

(c) Challenges in PoS Tagging

 Ambiguity: Some words belong to multiple classes.

o Example: "book" = noun (a book) or verb (to book a ticket).

 Context-dependence: Correct tag depends on sentence context.

o Example: "can" = modal verb (I can swim) vs. noun (a tin can).

(d) Approaches to PoS Tagging

1. Rule-Based Tagging

o Uses handcrafted linguistic rules.


o Example: If a word ends in -ing and is preceded by “is” → tag as VBG (verb,
gerund).

o Advantages: interpretable.

o Limitations: requires many rules, not scalable.

2. Stochastic (Statistical) Tagging

o Uses probability models trained on tagged corpora.

o Hidden Markov Models (HMMs) are common:

P(tags∣words)∝P(words∣tags)⋅P(tags)P(tags \mid words) \propto P(words \mid tags) \cdot


P(tags)P(tags∣words)∝P(words∣tags)⋅P(tags)

o Viterbi algorithm finds best tag sequence.

3. Transformation-Based Tagging (Brill Tagger)

o Starts with simple tagging (like most frequent tag).

o Applies transformation rules to improve accuracy.

4. Neural Network-Based Tagging (Modern Approach)

o Uses deep learning (RNNs, LSTMs, Transformers).

o Models learn word embeddings + context.

o Achieves state-of-the-art performance (e.g., BERT-based taggers).

(e) PoS Tagsets

Different corpora use standardized tagsets:

 Penn Treebank Tagset (45 tags) – widely used in English NLP.

o Examples:

 NN → Noun (singular), NNS → Noun (plural)

 VB → Verb base form, VBD → Verb past tense

 JJ → Adjective, RB → Adverb
 Universal PoS Tagset (17 tags) – simpler, cross-linguistic.

3. Example of PoS Tagging

Sentence: "She quickly read the book"

 She → PRP (Pronoun)

 quickly → RB (Adverb)

 read → VBD (Verb, past tense)

 the → DT (Determiner)

 book → NN (Noun, singular)

PoS Tagging Approaches: Rule-based, Stochastic, and Transformation-based

1. Rule-Based Tagging

(a) Definition

 Uses a set of handcrafted linguistic rules to assign tags.

 Developed in early NLP systems before statistical methods.

(b) Working

1. Lexicon lookup:

o Each word is assigned possible tags from a dictionary (lexicon).

o Example: “book” → could be NN (noun) or VB (verb).

2. Disambiguation using rules:


o Contextual rules decide the correct tag.

(c) Example

 Rule: If a word ends with -ing → likely VBG (verb gerund).

 Rule: If a word follows a determiner → likely NN (noun).

Sentence: “The book is interesting”

 "book" (NN), "is" (VBZ), "interesting" (VBG).

(d) Pros and Cons

 ✅ Advantages: Transparent, interpretable, linguistically sound.

 ❌ Disadvantages: Requires many rules, hard to scale, fails on unseen words.

2. Stochastic (Statistical) Tagging

(a) Definition

 Uses probabilistic models trained on annotated corpora to choose the most likely tag
sequence.

(b) Types

1. Most Frequent Tag

o Assigns each word its most frequent tag in training data.

o Example: “book” is usually tagged as NN → always tagged NN.

o Problem: ignores context.

2. N-gram Models (HMM Tagging)

o Uses Hidden Markov Models (HMMs) to model sequence of tags.

o Goal: find the best tag sequence TTT for word sequence WWW.

o Formula:

P(T∣W)∝P(W∣T)⋅P(T)P(T \mid W) \propto P(W \mid T) \cdot P(T)P(T∣W)∝P(W∣T)⋅P(T)

 P(T)P(T)P(T) = probability of tag sequence (transition probability).


 P(W∣T)P(W \mid T)P(W∣T) = likelihood of words given tags (emission
probability).

o Solved using Viterbi Algorithm (dynamic programming).

3. Maximum Entropy Tagging

o Discriminative model, chooses tag with highest conditional probability given


features.

(c) Example (Bigram HMM)

Sentence: “I book a ticket”

 "book" → ambiguous (NN or VB).

 HMM chooses "VB" because sequence "PRP + VB" is more probable than "PRP + NN".

(d) Pros and Cons

 ✅ Advantages: Robust, handles ambiguity statistically, data-driven.

 ❌ Disadvantages: Needs large annotated corpus, cannot capture long dependencies


well.

3. Transformation-Based Tagging (Brill Tagger)

(a) Definition

 Hybrid approach introduced by Eric Brill (1995).

 Combines rule-based and statistical learning.

(b) Working

1. Initialization: Assign each word a simple tag (e.g., most frequent tag).

2. Transformation Rules: Learn rules that fix errors based on context.

o Rules are automatically learned from annotated corpus.

o Rules are ordered by how much they improve accuracy.

(c) Example of Transformation Rules


 Rule 1: Change a tag from NN → VB if the previous word is a pronoun.

 Rule 2: Change tag from VBD → VBN if preceded by “has/have/had”.

Sentence: “I book a flight”

 Initial tagging: PRP NN DT NN (mistake: “book” = NN).

 Rule applies: If a pronoun is before “book”, change NN → VB.

 Final tagging: PRP VB DT NN.

(d) Pros and Cons

 ✅ Advantages: Learns rules automatically, interpretable like rule-based.

 ✅ More accurate than simple statistical taggers.

 ❌ Disadvantages: Slower training, still requires annotated corpus.

4. Comparative Summary

Example
Approach Basis Strengths Weaknesses
Method

Englex Interpretable, no Hard to scale,


Rule-based Handwritten rules
parser training data needed brittle

HMM, Needs large


Stochastic Probability/statistics Robust, data-driven
MaxEnt annotated corpus

Transformation- Rules learned from Balance between Requires training


Brill Tagger
based data rules & statistics data, slower

5. Applications in NLP

 Rule-based: Early grammar checkers, language teaching tools.

 Stochastic: Modern PoS taggers, speech recognition.

 Brill tagger: Used historically in parsing pipelines, still inspires hybrid methods.
Issues in Part-of-Speech (PoS) Tagging

1. Introduction

 PoS Tagging = assigning each word in a sentence its correct part of speech (Noun, Verb,
Adjective, etc.).

 Though highly useful, PoS tagging faces many linguistic and computational challenges.

 Errors usually arise due to ambiguity, sparsity, and context dependence.

2. Major Issues in PoS Tagging

(a) Ambiguity

1. Lexical Ambiguity

o Words can belong to multiple classes.

o Example:

 “book” → Noun (a book) or Verb (to book a ticket).

 “can” → Noun (a tin can) or Verb (I can swim).

2. Contextual Ambiguity

o Meaning depends on sentence structure.

o Example:

 “Visiting relatives can be boring.”

 “Visiting” = Verb (if action) or Adjective (if describing relatives).


(b) Unknown / Out-of-Vocabulary (OOV) Words

 Words not present in the training corpus.

 Common with:

o Proper nouns (new names, places).

o Neologisms (newly coined words, e.g., “selfie”).

o Foreign words / borrowed terms.

 Example: “The new iPhoneX was launched” → “iPhoneX” may not be in lexicon.

(c) Morphological Richness

 Languages like Turkish, Finnish, Hindi have complex morphology.

 A single root can generate thousands of word forms.

 Example (Turkish): “evlerinizden” → from your houses

o Root = “ev” (house), suffixes indicate plural, possession, case.

 Makes tagging much harder compared to English.

(d) Multiword Expressions (MWEs)

 Fixed phrases behave as a single unit but may confuse taggers.

 Example: “kick the bucket” → means “to die”.

 Literal tagging would assign kick (VB), the (DT), bucket (NN), but semantically it’s one
expression.

(e) Domain Adaptation Problem

 Taggers trained on one domain (e.g., news articles) may perform poorly on another
(e.g., medical text, social media).
 Example:

o News: “Apple launches new product” → “Apple” = Proper Noun (ORG).

o Recipes: “Add apple slices” → “apple” = Noun (fruit).

(f) Data Sparsity

 For statistical taggers, unseen word-tag combinations cause errors.

 Example: Rare word forms, low-resource languages → poor coverage.

(g) Tagset Variation

 Different corpora use different tagsets (Penn Treebank: 45 tags, Universal Tagset: 17
tags).

 Lack of standardization makes training, evaluation, and cross-lingual adaptation


harder.

(h) Real-Word Errors

 Even when the word exists in lexicon, choosing correct PoS is difficult.

 Example:

o “Time flies like an arrow.”

 “flies” = Noun (plural insects) OR Verb (3rd person singular).

o Tagger may misclassify depending on context.

3. Impact of Issues on Applications

 Parsing errors in syntactic analysis.

 Machine Translation errors due to wrong grammatical role.

 Speech recognition mistakes (wrong stress/intonation).


 Information Retrieval issues (incorrect keyword indexing).

4. Approaches to Mitigate Issues

1. Lexicon Expansion → regularly update dictionary with new words.

2. Morphological Analyzers → handle rich inflections.

3. Contextual Models → use neural networks (BiLSTMs, Transformers) for context-


sensitive tagging.

4. Domain Adaptation → retraining or fine-tuning on specific domain data.

5. Universal PoS Tagset → reduces cross-domain inconsistency.

Hidden Markov Models (HMM) and Maximum Entropy (MaxEnt) Models

1. Hidden Markov Models (HMM)

(a) Definition

 A Hidden Markov Model is a probabilistic model used for sequence labeling tasks (like
PoS tagging, speech recognition, NER).

 It assumes:

1. Markov Assumption – the probability of the current state depends only on the
previous state.

2. Output Independence Assumption – the observed word depends only on the


current hidden state (tag).
(b) Components of HMM

1. States → Hidden labels (e.g., PoS tags: Noun, Verb, Adj).

2. Observations → Visible sequence (words in the sentence).

3. Transition Probabilities (P(tᵢ | tᵢ₋₁)) → Probability of moving from one tag to another.

o Example: P(Verb | Noun).

4. Emission Probabilities (P(wᵢ | tᵢ)) → Probability of a word given a tag.

o Example: P(“run” | Verb).

5. Start/End Probabilities → Likelihood of beginning/ending with a certain tag.

(c) Working of HMM Tagger

 Goal: Find the most likely sequence of tags T=t1,t2,…,tnT = t_1, t_2, …, t_nT=t1,t2,…,tn
for a given word sequence W=w1,w2,…,wnW = w_1, w_2, …, w_nW=w1,w2,…,wn.

 Formula:

T^=arg⁡max⁡TP(T∣W)∝P(W∣T)⋅P(T)\hat{T} = \arg\max_{T} P(T|W) \propto P(W|T) \cdot


P(T)T^=argTmaxP(T∣W)∝P(W∣T)⋅P(T)

 Decoding Algorithm: Viterbi Algorithm (Dynamic Programming).

(d) Example

Sentence: “I book a ticket”

 Possible tags: book → {NN, VB}.

 Transition probabilities favor PRP → VB → DT → NN over PRP → NN → DT → NN.

 HMM outputs: PRP VB DT NN.

(e) Pros and Cons

 ✅ Handles sequence data well, mathematically elegant.


 ✅ Performs well with enough training data.

 ❌ Strong independence assumptions (word depends only on tag).

 ❌ Struggles with long-distance dependencies.

2. Maximum Entropy Models (MaxEnt)

(a) Definition

 A Maximum Entropy Model (also called a logistic regression model for classification) is
a discriminative probabilistic model.

 Unlike HMM, which models joint probability P(W,T)P(W, T)P(W,T), MaxEnt models
conditional probability P(T∣W)P(T | W)P(T∣W).

(b) Principle of Maximum Entropy

 Among all possible probability distributions that fit the training data, choose the one
with the highest entropy (least biased, most uniform).

 Ensures no extra assumptions are made beyond observed data.

(c) Features in MaxEnt Tagging

 Features can be arbitrary, context-based indicators.

 Example features:

o Word identity (wᵢ = “book”).

o Prefix/suffix (suffix=“ing” → likely verb).

o Previous/next words.

o Capitalization (if capitalized at sentence start → Proper Noun).

(d) Probability Estimation


For word position iii:

P(ti∣context)=1Z(context)exp⁡(∑jλjfj(ti,context))P(t_i | context) = \frac{1}{Z(context)} \exp\


left(\sum_j \lambda_j f_j(t_i, context)\right)P(ti∣context)=Z(context)1exp(j∑λjfj(ti,context))

 fjf_jfj = feature function.

 λj\lambda_jλj = learned weights.

 Z(context)Z(context)Z(context) = normalization factor.

(e) Example

Sentence: “I book a ticket”

 Features:

o Previous word = “I” → “book” likely verb.

o Suffix = “-ook” → could be noun or verb.

 MaxEnt combines features with weights and predicts VB.

(f) Pros and Cons

 ✅ Flexible: Can use many overlapping/contextual features.

 ✅ No independence assumptions (unlike HMM).

 ✅ Generally more accurate than HMM.

 ❌ Training is computationally expensive.

 ❌ Needs large labeled corpora for good performance.

3. HMM vs MaxEnt

Aspect Hidden Markov Model (HMM) Maximum Entropy (MaxEnt)

Model type Generative (models joint P(W, T)) Discriminative (models P(T
Assumptions Markov + Output independence No independence assumption

Features Only tags & words (limited) Any contextual features

Inference Viterbi decoding Softmax over feature weights

Efficient, interpretable, sequence-


Strengths Flexible, high accuracy, context-aware
oriented

Weaknesses Rigid assumptions, less accurate Expensive training, needs more data

4. Applications in NLP

 HMM:

o Speech recognition

o PoS tagging

o Named Entity Recognition (early systems)

 MaxEnt:

o PoS tagging

o Information extraction

o Machine translation (feature-rich models)

You might also like