Unit 1 and 2
Unit 1 and 2
Natural Language Processing (NLP) faces various challenges due to the complexity and diversity
of human language. Let's discuss 10 major challenges in NLP:
1. Language differences
The human language and understanding is rich and intricated and there many languages spoken
by humans. Human language is diverse and thousand of human languages spoken around the
world with having its own grammar, vocabular and cultural nuances. Human cannot understand
all the languages and the productivity of human language is high. There is ambiguity in natural
language since same words and phrases can have different meanings and different context. This
is the major challenges in understating of natural language.
There is a complex syntactic structures and grammatical rules of natural languages. The rules
are such as word order, verb, conjugation, tense, aspect and agreement. There is rich semantic
content in human language that allows speaker to convey a wide range of meaning through
words and sentences. Natural Language is pragmatics which means that how language can be
used in context to approach communication goals. The human language evolves time to time
with the processes such as lexical change.
2.Training Data
Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target. Training
data is composed of both the features (inputs) and their corresponding labels (outputs). For
NLP, features might include text data, and labels could be categories, sentiments, or any other
relevant annotations.
It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.
Development Time and Resource Requirements for Natural Language Processing (NLP) projects
depends on various factors consisting the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of expert involved. Here are some key
points:
      Complexity of the task: Task such as classification of text or analyzing the sentiment of
       the text may require less time compared to more complex tasks such as machine
       translation or answering the questions.
      Availability and Quality Data: For Natural Language Processing models requires high-
       quality of annotated data. It can be time consuming to collect, annotate, and preprocess
       the large text datasets and can be resource-intensive specially for tasks that requires
       specialized domain knowledge or fine-tuned annotations.
      Semantic Analysis: The content of the semantic text is analyzed to find meaning based
       on word, lexical relationships and semantic roles. Tools such as word sense
       disambiguation, semantics role labeling can be helpful in solving phrasing ambiguities.
      Syntactic Analysis: The syntactic structure of the sentence is analyzed to find the
       possible evaluation based on grammatical relationships and syntactic patterns.
      Statistical methods: Statistical methods and machine learning models are used to learn
       patterns from data and make predictions about the input phrase.
Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis. Here
are some key points for solving misspelling and grammatical error in NLP:
      Spell Checking: Implement spell-check algorithms and dictionaries to find and correct
       misspelled words.
      Text Normalization: The is normalized by converting into a standard format which may
       contains tasks such as conversion of text to lowercase, removal of punctuation and
       special characters, and expanding contractions.
      Tokenization: The text is split into individual tokens with the help of tokenization
       techniques. This technique allows to identify and isolate misspelled words and
       grammatical error that makes it easy to correct the phrase.
      Language Models: With the help of language models that is trained on large corpus of
       data to predict the likelihood of word or phrase that is correct or not based on its
       context.
It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness, equity,
and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.
      Collection of data and annotation: It is very important to confirm that the training data
       used to develop NLP algorithms is diverse, representative and free from biases.
      Analysis and Detection of bias: Apply bias detection and analysis method on training
       data to find biases that is based on demographic factors such as race, gender, age.
      Data Preprocessing: Data Preprocessing the most important process to train data to
       mitigate biases like debiasing word embeddings, balance class distributions and
       augmenting underrepresented samples.
      Fair representation learning: Natural Language Processing models are trained to learn
       fair representations that are invariant to protect attributes like race or gender.
      Auditing and Evaluation of Models: Natural Language models are evaluated for fairness
       and bias with the help of metrics and audits. NLP models are evaluated on diverse
       datasets and perform post-hoc analyses to find and mitigate innate biases in NLP
       algorithms.
Words with multiple meaning plays a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meaning are known as polysemous or
homonymous have different meaning based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings in
NLP:
      Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to find the
       semantic relationships between different words context.
8. Addressing Multilingualism
      Multilingual Corpora: Multilingual corpus consists of text data in various languages and
       serve as valuable resources for training NLP models and systems.
It is very crucial task to reduce uncertainty and false positives in Natural Language Process (NLP)
to improve the accuracy and reliability of the NLP models. Here are some key points to
approach the solution:
      Threshold Tuning: For the classification tasks the decision thresholds is adjusted to make
       the balance between sensitivity (recall) and specificity. False Positives in NLP can be
       reduced by setting the appropriate thresholds.
Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines
gives to capability to analyze and interpret user input as it is received involving algorithms are
optimized and systems for low latency processing to confirm quick responses to user queries
and inputs.
Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history tracking,
and generating relevant responses based on the ongoing dialogue. Apply intent recognition
algorithm to find the underlying goals and intentions expressed by users in their messages.
      Quantity and Quality of data: High quality of data and diverse data is used to train the
       NLP algorithms effectively. Data augmentation, data synthesis, crowdsourcing are the
       techniques to address data scarcity issues.
      Ambiguity: The NLP algorithm should be trained to disambiguate the words and
       phrases.
      Lack of Annotated Data: Techniques such transfer learning and pre-training can be used
       to transfer knowledge from large dataset to specific tasks with limited labeled data.
ORIGINS OF NLP
As we know Natural language processing (NLP) is an exciting area that has grown at some stage
in time, influencing the junction of linguistics, synthetic intelligence (AI), and computer
technology knowledge.
This article takes you on an in-depth journey through the history of NLP, diving into its complex
records and monitoring its development. From its early beginnings to the contemporary
improvements of NLP, the story of NLP is an intriguing one that continues to revolutionize how
we interact with generations.
      Understanding the meaning: Being able to extract the meaning from text, speech, or
       other forms of human language.
      Generating human-like language: Creating text or speech that is natural, coherent, and
       grammatically correct.
Ultimately, NLP aims to bridge the gap between human communication and machine
comprehension, fostering seamless interaction between us and technology.
The history of NLP (Natural Language Processing) is divided into three segments that are as
follows:
In the 1950s, the dream of effortless communication across languages fueled the birth of NLP.
Machine translation (MT) was the driving force, and rule-based systems emerged as the initial
approach.
   1. Sentence Breakdown: The system would first analyze the source language sentence and
      break it down into its parts of speech (nouns, verbs, adjectives, etc.).
   2. Matching Rules: Each word or phrase would be matched against the rule base to find its
      equivalent in the target language, considering grammatical roles and sentence structure.
   3. Rearrangement: Finally, the system would use the rules to rearrange the translated
      words and phrases to form a grammatically correct sentence in the target language.
Limitations of Rule-Based Systems:
While offering a foundation for MT, this approach had several limitations:
      Inflexibility: Languages are full of nuances and exceptions. Rule-based systems struggled
       to handle idioms, slang, and variations in sentence structure. A slight deviation from the
       expected format could throw the entire translation off.
      Scalability Issues: Creating and maintaining a vast rule base for every language pair was
       a time-consuming and laborious task. Imagine the immense effort required for just a
       handful of languages!
      Limited Scope: These systems primarily focused on syntax and vocabulary, often failing
       to capture the deeper meaning and context of the text. This resulted in translations that
       sounded grammatically correct but unnatural or even nonsensical.
Despite these limitations, rule-based systems laid the groundwork for future NLP
advancements. They demonstrated the potential for computers to understand and manipulate
human language, paving the way for more sophisticated approaches that would emerge later.
      A Shift Towards Statistics: The 1980s saw a paradigm shift towards statistical NLP
       approaches. Machine learning algorithms emerged as powerful tools for NLP tasks.
      The Power of Data: Large collections of text data (corpora) became crucial for training
       these statistical models.
      Learning from Patterns: Unlike rule-based systems, statistical models learn patterns
       from data, allowing them to handle variations and complexities of natural language.
      The Deep Learning Revolution: The 2000s ushered in the era of deep learning,
       significantly impacting NLP.
      Artificial Neural Networks (ANNs): These complex algorithms, inspired by the human
       brain, became the foundation of deep learning advancements in NLP.
      Advanced Architectures: Deep learning architectures like recurrent neural networks and
       transformers further enhanced NLP capabilities. Briefly mention these architectures
       without going into technical details.
The aim became to codify linguistic recommendations, at the side of syntax and grammar, into
algorithms that would be completed by way of computer systems to machine and generate
human-like text.
During this period, the General Problem Solver (GPS) received prominence. They had been
developed with the resources of Allen Newell and Herbert A. Simon; in 1957, GPS wasn't
explicitly designed for language processing. However, it established the functionality of rule-
based total systems by showcasing how computers must solve issues with the use of predefined
policies and heuristics.
The enthusiasm surrounding rule-primarily based systems definitely changed into tempered by
the realization that human language is inherently complicated. Its nuances, ambiguities, and
context-established meanings proved hard to capture virtually through rigid recommendations.
As a result, rule-based NLP structures struggled with actual worldwide language applications,
prompting researchers to discover possible techniques. While statistical models represented a
sizable leap forward, the actual revolution in NLP got here with the arrival of neural networks.
Inspired by the form and function of the human mind, neural networks have developed
incredible capabilities in studying complicated styles from statistics.
In the mid-2010s, the utility of deep learning strategies, especially recurrent neural
networks (RNNs) and lengthy short-time period reminiscence (LSTM) networks, triggered
significant breakthroughs in NLP. These architectures allowed machines to capture sequential
dependencies in language, permitting more nuanced information and era of text. As NLP
persisted in strengthening, moral troubles surrounding bias, fairness, and transparency became
more and more prominent. The biases discovered in training information regularly manifested
in NLP models raise worries about the functionality reinforcement of societal inequalities.
Researchers and practitioners started out addressing those issues, advocating for responsible AI
improvement and the incorporation of moral considerations into the fabric of NLP.
Multimodal NLP represents the subsequent frontier in the evolution of herbal language
processing. Traditionally, NLP focused, in preference, on processing and understanding textual
records.
However, the appearance of multimedia-rich content material on the net and the proliferation
of devices organized with cameras and microphones have propelled the need for NLP structures
to address an extensive style of modalities at the side of pictures, audio, and video.
   1. Image Captioning: One of the early programs of multimodal NLP is image captioning,
      wherein models generate textual descriptions for photos. This challenge calls for the
      model to now not only successfully understand items inside a photograph but also
      understand the context and relationships among objects. The integration of visible facts
      with linguistic know-how poses a considerable assignment; however, it opens avenues
      for added immersive applications.
   2. Speech-to-Text and Audio Processing: Multimodal NLP extends its attainment into audio
      processing, with applications beginning from speech-to-textual content conversion to
      the evaluation of audio content material. Speech recognition systems, ready with NLP
      abilities, permit more herbal interactions with devices through voice instructions. This
      has implications for accessibility and usefulness, making technology extra inclusive for
      humans with varying levels of literacy.
   3. Video Understanding: As the amount of video content on the net keeps growing, there
      may be a burgeoning need for NLP structures to recognize and summarize video data.
      This entails now not only first-class-recognizing devices and moves inside movies but
      also knowledge of the narrative shape and context. Video information opens doors to
      programs in content fabric recommendation, video summarization, and even
      sentiment evaluation based totally on visible and auditory cues.
   4. Social Media Analysis: Multimodal NLP becomes especially relevant within the context
      of social media, wherein users share a vast range of content material fabric, which
      includes text, pictures, and movement pictures. Analyzing and understanding the
      sentiment, context, and capability implications of social media content material calls for
      NLP structures to be gifted in processing multimodal information. This has implications
      for content material cloth moderation, logo tracking, and trends evaluation on social
      media platforms.
As NLP models become increasingly complicated and powerful, there may be a developing call
for transparency and interpretability. The black-box nature of deep mastering models,
especially neural networks, has raised issues about their selection-making tactics. In response,
the sphere of explainable AI (XAI) has won prominence, aiming to shed light on the internal
workings of complicated models and make their outputs more understandable to customers.
   1. Interpretable Models: Traditional devices studying models, which include choice timber
      and linear models, are inherently extra interpretable because of their particular
      illustration of policies. However, as NLP embraced the power of deep studying, mainly
      with models like BERT and GPT, interpretability has ended up being a big task.
      Researchers are actively exploring techniques to decorate the interpretability of neural
      NLP without sacrificing their ordinary performance.
   3. Rule-based Totally Explanations: Integrating rule-based totally reasons into NLP includes
      incorporating human-comprehensible regulations alongside the complex neural
      community architecture. This hybrid approach seeks balance between the expressive
      energy of deep mastering and the transparency of rule-primarily based structures. By
      imparting rule-based reasons, customers can gain insights into why the version made a
      particular prediction or choice.
Language models form the spine of NLP, powering programs starting from chatbots and digital
assistants to device translation and sentiment analysis. The evolution of language models
reflects the non-forestall quest for extra accuracy, context cognisance, and green natural
language information.
In the early days of NLP, notice the dominance of rule-based systems trying to codify linguistic
policies into algorithms. However, the restrictions of these structures in handling the complexity
of human language paved the manner for statistical trends. Statistical techniques, along with n-
gram models and Hidden Markov Models, leveraged massive datasets to grow to be privy to
styles and probabilities, improving the accuracy of language processing obligations.
The advent of phrase embeddings, along with Word2Vec and GloVe, marked a paradigm shift in
how machines constitute and understand words. These embeddings enabled phrases to be
represented as dense vectors in a non-forestall vector region, capturing semantic relationships
and contextual data. Distributed representations facilitated more excellent nuanced language
expertise and stepped forward the overall performance of downstream NLP responsibilities.
The mid-2010s witnessed the rise of deep learning in NLP, with the software of recurrent neural
networks (RNNs) and prolonged short-time period memory (LSTM) networks. These
architectures addressed the stressful conditions of taking pictures of sequential dependencies in
language, allowing models to method and generate textual content with a higher understanding
of context. RNNs and LSTMs laid the basis for the following improvements in neural NLP.
In 2017, the advent of the Transformer shape by using Vaswani et al. They marked a
contemporary leap forward in NLP. Transformers, characterized via manner of self-attention
mechanisms, outperformed previous factors in numerous language obligations.
The Transformer structure has grown to be the cornerstone of the latest trends, allowing
parallelization and green studying of contextual facts at some stage in lengthy sequences.
Bidirectional Encoder Representations from Transformers (BERT), introduced with the aid
of Google in 2018, verified the strength of pre-schooling big-scale language models on massive
corpora. BERT and subsequent models like GPT (Generative Pre-educated
Transformer) completed super performance via studying contextualized representations of
words and terms. These pre-professional models, first-class-tuned for unique duties, have
turned out to be the pressure behind breakthroughs in understanding natural language.
The evolution of language models persisted with enhancements like XLNet, which addressed
boundaries to taking snapshots in a bidirectional context. XLNet delivered a permutation
language modeling goal, allowing the model to remember all feasible versions of a sequence.
This method similarly progressed the know-how of contextual data and examined the iterative
nature of advancements in language modeling.
The fast development in NLP has added transformative adjustments in numerous industries,
from healthcare and finance to training and enjoyment. However, with splendid power comes
first-rate duty, and the ethical issues surrounding NLP have emerged as an increasing number of
essentials.
   2. Bias in NLP Models: One of the primary moral concerns in NLP revolves around the
      capability bias present in education statistics and its impact on model predictions. If
      schooling records show present societal biases, NLP models may inadvertently
      perpetuate and make the biases more substantial. For example, biased language in
      ancient texts or news articles can lead to biased representations in language models,
      influencing their outputs.
   3. Fairness and Equity: Ensuring fairness and fairness in NLP programs is a complex
      assignment. NLP trends should be evaluated for their overall performance at some point
      by excellent demographic agencies to pick out and mitigate disparities. Addressing
      problems associated with equity entails now not only refining algorithms but also
      adopting a holistic approach that considers the numerous views and testimonies of
      customers.
This article explores language models in depth, highlighting their development, functionality,
and significance in natural language processing.
A language model in natural language processing (NLP) is a statistical or machine learning model
that is used to predict the next word in a sequence given the previous words. Language models
play a crucial role in various NLP tasks such as machine translation, speech recognition, text
generation, and sentiment analysis. They analyze and understand the structure and use of
human language, enabling machines to process and generate text that is contextually
appropriate and coherent.
Grammar-based LMs use formal grammar rules (syntax) to determine whether a sentence is
valid and sometimes assign probabilities.
 Example:
 S → NP VP
 NP → Det N
 VP → V NP
 N → "dog" | "cat"
 V → "chased" | "saw"
(b) Strengths
(c) Weaknesses
Statistical LMs use probability distributions learned from data (large corpora). Instead of strict
rules, they estimate how likely a sequence of words is.
(c) Estimation
(d) Issues
      Data Sparsity: Some word sequences may never appear in training data.
       Solution: Smoothing (e.g., Laplace smoothing, Good-Turing, Kneser-Ney).
5. Modern Perspective
       Neural LMs: Today, RNNs, LSTMs, and Transformers have replaced traditional statistical
        LMs. They model long-distance dependencies better.
REGULAR EXPRESSIONS
A Regular Expression or RegEx is a special sequence of characters that uses a search pattern to
find a string or set of strings.
It can detect the presence or absence of a text by matching it with a particular pattern and
also can split a pattern into one or more sub-patterns.
Python has a built-in module named "re" that is used for regular expressions in Python. We
can import this module by using import statement.
import re
How to Use RegEx in Python?
Example:
This Python code uses regular expressions to search for the word "portal" in the given string
and then prints the start and end indices of the matched word within the string.
import re
match = re.search(r'portal', s)
Output
Start Index: 34
End Index: 40
Note: Here r character (r’portal’) stands for raw, not regex. The raw string is slightly different
from a regular string, it won’t interpret the \ character as an escape character. This is because
the regular expression engine uses \ character for its own escaping purpose.
Before starting with the Python regex module let's see how to actually write regex using
metacharacters or special sequences.
RegEx Functions
The re module in Python provides various functions that help search, match, and manipulate
strings using regular expressions.
Let's see the working of these RegEx functions with definition and examples:
1. re.findall()
Returns all non-overlapping matches of a pattern in the string as a list. It scans the string from
left to right.
Example: This code uses regular expression \d+ to find all sequences of one or more digits in
the given string.
import re
print(match)
Output
['123456789', '987654321']
2. re.compile()
Compiles a regex into a pattern object, which can be reused for matching or substitutions.
Example 1: This pattern [a-e] matches all lowercase letters between 'a' and 'e', in the input
string "Aye, said Mr. Gibenson Stark". The output should be ['e', 'a', 'd', 'b', 'e'], which are
matching characters.
import re
p = re.compile('[a-e]')
Output
Explanation:
         Next Occurrence is 'a' in "said", then 'd' in "said", followed by 'b' and 'e' in
          "Gibenson", the Last 'a' matches with "Stark".
         Metacharacter backslash '\' has a very important role as it signals various sequences. If
          the backslash is to be used without its special meaning as metacharacter, use'\\'
Example 2: The code uses regular expressions to find and list all single digits and sequences of
digits in the given input strings. It finds single digits with \d and sequences of digits with \d+.
import re
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))
p = re.compile('\d+')
Output
import re
p = re.compile('\w')
p = re.compile('\w+')
p = re.compile('\W')
Output
['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']
Example 4: The regular expression pattern 'ab*' to find and list all occurrences of 'ab' followed
by zero or more 'b' characters. In the input string "ababbaabbb". It returns the following list
of matches: ['ab', 'abb', 'abbb'].
import re
p = re.compile('ab*')
print(p.findall("ababbaabbb"))
Output
Explanation:
3. re.split()
Splits a string wherever the pattern matches. The remaining characters are returned as list
elements.
Syntax:
This example demonstrates how to split a string using different patterns like non-word
characters (\W+), apostrophes, and digits (\d+).
Output
['On ', 'th Jan ', ', at ', ':', ' AM']
This example shows how to limit the number of splits using maxsplit, and how flags can
control case sensitivity.
import re
Output
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']
Note: In the second and third cases of the above , [a-f]+ splits the string using any
combination of lowercase letters from 'a' to 'f'. The re.IGNORECASE flag includes uppercase
letters in the match.
4. re.sub()
The re.sub() function replaces all occurrences of a pattern in a string with a replacement
string.
Syntax:
Example 1: The following examples show different ways to replace the pattern 'ub' with '~*',
using various flags and count values.
import re
Output
5. re.subn()
re.subn() function works just like re.sub(), but instead of returning only the modified string, it
returns a tuple: (new_string, number_of_substitutions)
Syntax:
This example shows how re.subn() gives both the replaced string and the number of times
replacements were made.
import re
# Case-sensitive replacement
# Case-insensitive replacement
Output
6. re.escape()
re.escape() function adds a backslash (\) before all special characters in a string. This is useful
when you want to match a string literally, including any characters that have special meaning
in regex (like ., *, [, ], etc.).
Syntax:
re.escape(string)
This example shows how re.escape() treats spaces, brackets, dashes, and tabs as literal
characters.
import re
Output
7. re.search()
The re.search() function searches for the first occurrence of a pattern in a string. It returns
a match object if found, otherwise None.
Note: Use it when you want to check if a pattern exists or extract the first match.
This example searches for a date pattern with a month name (letters) followed by a day
(digits) in a sentence.
import re
if match:
print("Month:", match.group(1))
print("Day:", match.group(2))
else:
Output
Month: June
Day: 24
Meta-characters
Metacharacters are special characters in regular expressions used to define search patterns.
The re module in Python supports several metacharacters that help you perform powerful
pattern matching.
MetaCharacters Description
1. \ - Backslash
The backslash (\) makes sure that the character is not treated in a special way. This can be
considered a way of escaping metacharacters.
For example, if you want to search for the dot(.) in the string then you will find that dot(.) will
be treated as a special character as is one of the metacharacters (as shown in the above
table). So for this case, we will use the backslash(\) just before the dot(.) so that it will lose its
specialty. See the below example for a better understanding.
Example: The first search (re.search(r'.', s)) matches any character, not just the period, while
the second search (re.search(r'\.', s)) specifically looks for and matches the period character.
import re
s = 'geeks.forgeeks'
# without using \
match = re.search(r'.', s)
print(match)
# using \
match = re.search(r'\.', s)
print(match)
Output
2. [] - Square Brackets
Square Brackets ([]) represent a character class consisting of a set of characters that we wish
to match. For example, the character class [abc] will match any single a, b, or c.
We can also specify a range of characters using - inside the square brackets. For example,
We can also invert the character class using the caret(^) symbol. For example,
Example: In this code, you're using regular expressions to find all the characters in the string
that fall within the range of 'a' to 'm'. The re.findall() function returns a list of all such
characters. In the given string, the characters that match this pattern are: 'c', 'k', 'b', 'f', 'j', 'e',
'h', 'l', 'd', 'g'.
import re
string = "The quick brown fox jumps over the lazy dog"
pattern = "[a-m]"
print(result)
Output
['h', 'e', 'i', 'c', 'k', 'b', 'f', 'j', 'm', 'e', 'h', 'e', 'l', 'a', 'd', 'g']
3. ^ - Caret
Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts with
the given character(s) or not. For example -
 ^g will check if the string starts with g such as geeks, globe, girl, g, etc.
 ^ge will check if the string starts with ge such as geeks, geeksforgeeks, etc.
Example: This code uses regular expressions to check if a list of strings starts with "The". If a
string begins with "The," it's marked as "Matched" otherwise, it's labeled as "Not matched".
import re
regex = r'^The'
strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox']
if re.match(regex, string):
print(f'Matched: {string}')
else:
Output
4. $ - Dollar
Dollar($) symbol matches the end of the string i.e checks whether the string ends with the
given character(s) or not. For example-
 s$ will check for the string that ends with a such as geeks, ends, s, etc.
          ks$ will check for the string that ends with ks such as geeks, geeksforgeeks, ks, etc.
Example: This code uses a regular expression to check if the string ends with "World!". If a
match is found, it prints "Match found!" otherwise, it prints "Match not found".
import re
pattern = r"World!$"
if match:
print("Match found!")
else:
Output
Match found!
5. . - Dot
Dot(.) symbol matches only a single character except for the newline character (\n). For
example -
        a.b will check for the string that contains any character at the place of the dot such as
         acb, acbd, abbb, etc
Example: This code uses a regular expression to search for the pattern "brown.fox" within the
string. The dot (.) in the pattern represents any character. If a match is found, it prints "Match
found!" otherwise, it prints "Match not found".
import re
string = "The quick brown fox jumps over the lazy dog."
pattern = r"brown.fox"
if match:
print("Match found!")
else:
Output
Match found!
6. | - Or
The | operator means either pattern on its left or right can match. a|b will match any string
that contains a or b such as acd, bcd, abcd, etc.
7. ? - Question Mark
The question mark (?) indicates that the preceding element should be matched zero or one
time. It allows you to specify that the element is optional, meaning it may occur once or not
at all.
For example, ab?c will be matched for the string ac, acb, dabc but will not be matched for
abbc because there are two b. Similarly, it will not be matched for abdc because b is not
followed by c.
8.* - Star
Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol.
For example, ab*c will be matched for the string ac, abc, abbbc, dabc, etc. but will not be
matched for abdc because b is not followed by c.
9. + - Plus
Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol.
For example, ab+c will be matched for the string abc, abbc, dabc, but will not be matched for
ac, abdc, because there is no b in ac and b, is not followed by c in abdc.
For example, a{2, 4} will be matched for the string aaab, baaaac, gaad, but will not be
matched for strings like abc, bc because there is only one a or no a in both the cases.
For example, (a|b)cd will match for strings like acd, abcd, gacd, etc.
Special Sequences
Special sequences do not match for the actual character in the string instead it tells the
specific location in the search string where the match must occur. It makes it easier to write
commonly used patterns.
Special
Sequence              Description            Examples
                                                           for geeks
                      Matches if the
 \A                   string begins with     \Afor
                      the given character
                                                           for the world
                                              gee ks
           Matches any
\s         whitespace              \s
           character.
                                              a bc a
                                              a bd
           Matches any non-
\S         whitespace              \S
           character
                                              abcd
Special
Sequence            Description             Examples
                                                           >$
                    Matches any non-
 \W                 alphanumeric            \W
                    character.
                                                           gee<>
                                                           abcdab
                    Matches if the
 \Z                 string ends with        ab\Z
                    the given regex
                                                           abababab
A Set is a set of characters enclosed in '[]' brackets. Sets are used to match a single character
in the set of characters specified between brackets. Below is the list of Sets:
Set Description
Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let's see some of the commonly used methods and
attributes of the match object.
match.re attribute returns the regular expression passed and match.string attribute returns
the string passed.
Example:
The code searches for the letter "G" at a word boundary in the string "Welcome to
GeeksForGeeks" and prints the regular expression pattern (res.re) and the original
string (res.string).
import re
s = "Welcome to GeeksForGeeks"
res = re.search(r"\bG", s)
print(res.re)
print(res.string)
Output
re.compile('\\bG')
Welcome to GeeksForGeeks
        span() method returns a tuple containing the starting and the ending index of the
         matched substring
import re
s = "Welcome to GeeksForGeeks"
res = re.search(r"\bGee", s)
print(res.start())
print(res.end())
print(res.span())
Output
11
14
(11, 14)
group() method returns the part of the string for which the patterns match. See the below
example for a better understanding.
The code searches for a sequence of two non-digit characters followed by a space and the
letter 't' in the string "Welcome to GeeksForGeeks" and prints the matched text
using res.group().
import re
s = "Welcome to GeeksForGeeks"
Output
me t
In the above example, our pattern specifies for the string that contains at least 2 characters
which are followed by a space, and that space is followed by a t.
Let's understand some of the basic regular expressions. They are as follows:
1. Character Classes
Character classes allow matching any one character from a specified set. They are enclosed in
square brackets [].
import re
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \
Output
2. Ranges
In RegEx, a range allows matching characters or digits within a span using - inside []. For
example, [0-9] matches digits, [A-Z] matches uppercase letters.
import re
print('Range',re.search(r'[a-zA-Z]', 'x'))
Output
3. Negation
Negation in a character class is specified by placing a ^ at the beginning of the brackets,
meaning match anything except those characters.
Syntax:
[^a-z]
Example:
import re
print(re.search(r'[^a-z]', 'c'))
print(re.search(r'G[^e]', 'Geeks'))
Output
None
None
3. Shortcuts
Shortcuts are shorthand representations for common character classes. Let's discuss some of
the shortcuts provided by the regular expression engine.
import re
Output
Geeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None
The ^ character chooses the beginning of a string and the $ character chooses the end of a
string.
import re
# Beginning of String
# End of String
Output
5. Any Character
The . character represents any single character outside a bracketed character class.
import re
Output
6. Optional Characters
Regular expression engine allows you to specify optional characters using the ? character. It
allows a character or character class either to present once or else not to occur. Let's consider
the example of a word with an alternative spelling - color or colour.
import re
print('Color',re.search(r'colou?r', 'color'))
print('Colour',re.search(r'colou?r', 'colour'))
Output
7. Repetition
Repetition enables you to repeat the same character or character class. Consider an example
of a date that consists of day, month, and year. Let's use a regular expression to identify the
date (mm-dd-yyyy).
import re
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}','18-08-2020'))
Output
The repetition range is useful when you have to accept one or more formats. Consider a
scenario where both three digits, as well as four digits, are accepted. Let's have a look at the
regular expression.
import re
Output
There are scenarios where there is no limit for a character repetition. In such scenarios, you
can set the upper limit as infinitive. A common example is matching street addresses. Let's
have a look
import re
Output
Shorthand characters allow you to use + character to specify one or more ({1,}) and *
character to specify zero or more ({0,}.
import re
Output
8. Grouping
Grouping is the process of separating an expression into groups by using parentheses, and it
allows you to fetch each individual matching group.
import re
print(grp)
Output
The re module allows you to return the entire match using the group() method
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group())
Output
26-08-2020
You can use groups() method to return a tuple that holds individual matched groups
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.groups())
Output
Upon passing the index to a group method, you can retrieve just a single group.
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group(3))
Output
2020
The re module allows you to name your groups. Let's look into the syntax.
import re
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
'26-08-2020')
print(match.group('mm'))
Output
08
We have seen how regular expression provides a tuple of individual groups. Not only tuple,
but it can also provide individual match as a dictionary in which the name of each group acts
as the dictionary key.
import re
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
'26-08-2020')
print(match.groupdict())
Output
9. Lookahead
In the case of a negated character class, it won't match if a character is not present to check
against the negated character. We can overcome this case by using lookahead; it accepts or
rejects a match based on the presence or absence of content.
import re
Output
negation: None
import re
Output
10. Substitution
The regular expression can replace the string and returns the replaced one using the re.sub
method. It is useful when you want to avoid characters such as /, -, ., etc. before storing it to a
database. It takes three arguments:
Finite automata are abstract machines used to recognize patterns in input sequences, forming
the basis for understanding regular languages in computer science.
 Consist of states, transitions, and input symbols, processing each symbol step-by-step.
        If ends in an accepting state after processing the input, then the input is accepted;
         otherwise, rejected.
 Output Relation: Based on the final state, the output decision is made.
{ Q, Σ, q, F, δ }, where:
 q: Initial state
 δ: Transition function
A DFA is represented as {Q, Σ, q, F, δ}. In DFA, for each input symbol, the machine transitions
to one and only one state. DFA does not allow any null transitions, meaning every state must
have a transition defined for every input symbol.
Example:
Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}
State\
Symbol           a    b
q0 q1 q0
 q1              q1   q0
State\
Symbol            a     b
In this example, if the string ends in 'a', the machine reaches state q1, which is an accepting
state.
        It allows null (ϵ) moves, where the machine can change states without consuming any
         input.
Example:
Given:
Σ = {a, b},
Q = {q0, q1},
F = {q1}
                 {q0,q1
 q0                         q0
                 }
q1 φ φ
Although NFAs appear more flexible, they do not have more computational power than DFAs.
Every NFA can be converted to an equivalent DFA, although the resulting DFA may have more
states.
 Power: Both DFA and NFA recognize the same set of regular languages.
1. English Morphology
(a) Definition
 Morphology = study of the internal structure of words and how they are formed.
o Changes the form of a word to express tense, number, gender, case, etc.
o Examples:
 cat → cats
2. Derivational Morphology
o Examples:
(a) Definition
 In morphology:
o Lexical level (abstract word form) ↔ Surface level (actual word form)
(b) Components
 Example:
 Surface: cats
Example:
        Lexical: walk + PAST
 FST → walked
Steps:
 Surface: tries
FST Transition:
 Input: try + s
 Output: tries
1. Tokenization
(a) Definition
      Tokenization is the process of splitting a text into smaller units (tokens) such as words,
       sentences, or subwords.
1. Word Tokenization
o Example:
2. Sentence Tokenization
o Example:
3. Subword/Character Tokenization
1. Dictionary Lookup
3. Context-Sensitive Correction
(a) Definition
      The minimum edit distance between two strings is the smallest number of operations
       required to transform one string into another.
 Operations:
(b) Example
UNIT-II
N-gram
      Word pairs likw “This article”, “article is”, “is on”, “on NLP” → bigrams
      Triplets (trigrams) or larger combinations
N-gram models predict the probability of a word given the previous n−1 words. For example,
a trigram model uses the preceding two words to predict the next word:
Goal: Calculate p(w∣h)p(w∣h), the probability that the next word is ww, given
context/history hh.
Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as
the next word:
p("NLP"∣"This","article","is","on")p("NLP"∣"This","article","is","on")
P(w1,w2,…,wn)=∏i=1nP(wi∣w1,w2,…,wi−1)P(w1,w2,…,wn)=∏i=1nP(wi∣w1,w2,…,wi−1)
Markov Assumption
To reduce complexity, N-gram models assume the probability of a word depends only on the
previous n−1 words.
P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)P(wi∣w1,…,wi−1)≈P(wi∣wi−(n−1),…,wi−1)
H(p)=∑xp(x)⋅(−log(p(x)))H(p)=∑xp(x)⋅(−log(p(x)))
2. Cross-Entropy: Measures how well a probability distribution predicts a sample from test
data.
H(p,q)=−∑xp(x)log(q(x))H(p,q)=−∑xp(x)log(q(x))
Perplexity(W)=∏i=1N1P(wi∣wi−1)NPerplexity(W)=N∏i=1NP(wi∣wi−1)1
Implementing N-Gram Language Modelling in NLTK
import nltk
nltk.download('reuters')
nltk.download('punkt')
tri_grams = list(trigrams(words))
  model[(w1, w2)][w3] += 1
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
if next_word_probs:
else:
Output:
Next Word: of
Advantages
 Simple and Fast: Easy to build and fast to run for small n.
Limitations
         Data Sparsity: Needs lots of data; rare n-grams are common as n increases.
      High Memory: Bigger n-gram models require lots of storage.
      Poor with Unseen Words: Struggles with new or rare words unless smoothing is
       applied.
(a) Definition
 An N-gram model predicts the next word based on the previous n−1n-1n−1 words.
2. Unsmoothed N-grams
(a) Definition
1. Word Classes
(a) Definition
      Word classes (also called lexical categories or parts of speech) are groups of words that
       share similar grammatical properties.
1. Nouns (N)
2. Verbs (V)
3. Adjectives (Adj)
o Describe nouns.
4. Adverbs (Adv)
5. Pronouns (Pro)
o Replace nouns.
6. Prepositions (Prep)
7. Conjunctions (Conj)
8. Determiners (Det)
o Specify nouns.
9. Interjections (Int)
       o   Express emotions.
           o     Examples: oh!, wow!, alas!
(a) Definition
      PoS tagging is the process of assigning the correct part-of-speech label to each word in
       a sentence.
 Example:
o Example: "can" = modal verb (I can swim) vs. noun (a tin can).
1. Rule-Based Tagging
o Advantages: interpretable.
o Examples:
                     JJ → Adjective, RB → Adverb
      Universal PoS Tagset (17 tags) – simpler, cross-linguistic.
 quickly → RB (Adverb)
 the → DT (Determiner)
1. Rule-Based Tagging
(a) Definition
(b) Working
1. Lexicon lookup:
(c) Example
(a) Definition
      Uses probabilistic models trained on annotated corpora to choose the most likely tag
       sequence.
(b) Types
o Goal: find the best tag sequence TTT for word sequence WWW.
o Formula:
 HMM chooses "VB" because sequence "PRP + VB" is more probable than "PRP + NN".
(a) Definition
(b) Working
1. Initialization: Assign each word a simple tag (e.g., most frequent tag).
4. Comparative Summary
                                             Example
Approach            Basis                                   Strengths             Weaknesses
                                             Method
5. Applications in NLP
      Brill tagger: Used historically in parsing pipelines, still inspires hybrid methods.
Issues in Part-of-Speech (PoS) Tagging
1. Introduction
      PoS Tagging = assigning each word in a sentence its correct part of speech (Noun, Verb,
       Adjective, etc.).
 Though highly useful, PoS tagging faces many linguistic and computational challenges.
(a) Ambiguity
1. Lexical Ambiguity
o Example:
2. Contextual Ambiguity
o Example:
 Common with:
 Example: “The new iPhoneX was launched” → “iPhoneX” may not be in lexicon.
      Literal tagging would assign kick (VB), the (DT), bucket (NN), but semantically it’s one
       expression.
      Taggers trained on one domain (e.g., news articles) may perform poorly on another
       (e.g., medical text, social media).
      Example:
      Different corpora use different tagsets (Penn Treebank: 45 tags, Universal Tagset: 17
       tags).
 Even when the word exists in lexicon, choosing correct PoS is difficult.
 Example:
(a) Definition
      A Hidden Markov Model is a probabilistic model used for sequence labeling tasks (like
       PoS tagging, speech recognition, NER).
 It assumes:
           1. Markov Assumption – the probability of the current state depends only on the
              previous state.
3. Transition Probabilities (P(tᵢ | tᵢ₋₁)) → Probability of moving from one tag to another.
      Goal: Find the most likely sequence of tags T=t1,t2,…,tnT = t_1, t_2, …, t_nT=t1,t2,…,tn
       for a given word sequence W=w1,w2,…,wnW = w_1, w_2, …, w_nW=w1,w2,…,wn.
 Formula:
(d) Example
(a) Definition
      A Maximum Entropy Model (also called a logistic regression model for classification) is
       a discriminative probabilistic model.
      Unlike HMM, which models joint probability P(W,T)P(W, T)P(W,T), MaxEnt models
       conditional probability P(T∣W)P(T | W)P(T∣W).
      Among all possible probability distributions that fit the training data, choose the one
       with the highest entropy (least biased, most uniform).
 Example features:
o Previous/next words.
(e) Example
 Features:
3. HMM vs MaxEnt
Model type Generative (models joint P(W, T))               Discriminative (models P(T
Assumptions Markov + Output independence                    No independence assumption
Weaknesses Rigid assumptions, less accurate Expensive training, needs more data
4. Applications in NLP
 HMM:
o Speech recognition
o PoS tagging
 MaxEnt:
o PoS tagging
o Information extraction