100% found this document useful (2 votes)

4K views19 pages

NLP Unit 1 Notes

The document discusses natural language processing (NLP), including what it is, its components and applications. NLP focuses on enabling computers to understand and generate human language. It covers techniques like tokenization, parsing, and named entity recognition. Real-time applications of NLP are also discussed.

Uploaded by

Yuvraj Pardeshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

4K views19 pages

NLP Unit 1 Notes

Uploaded by

Yuvraj Pardeshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

NLP UNIT 1 NOTES

https://www.datacamp.com/blog/what-is-natural-language-processing

What is natural language processing?

Natural Language Processing (NLP) is a branch of artificial intelligence that
focuses on the interaction between computers and humans through natural
language.
It encompasses the development of algorithms and techniques to enable
computers to understand, interpret, and generate human language in a way that
is both meaningful and useful.
Components of NLP –
Natural Language Processing (NLP) can be divided into two main components:
Natural Language Understanding (NLU) and Natural Language Generation
(NLG).
Natural Language Understanding (NLU):
Definition: Natural Language Understanding focuses on enabling computers to
comprehend and interpret human language input. It involves extracting meaning
from text or speech data, understanding the intent behind the communication,
and identifying relevant information.
Key Tasks:
Text Parsing: Breaking down sentences or phrases into smaller units (tokens)
for analysis.
Entity Recognition: Identifying specific entities mentioned in the text, such as
names, dates, locations, or organizations.
Sentiment Analysis: Determining the sentiment or emotion expressed in the
text (positive, negative, or neutral).
Intent Recognition: Understanding the purpose or goal behind a user's input,
such as extracting commands or requests.
Semantic Parsing: Analyzing the structure of sentences to extract semantic
meaning and relationships between words.
Example:
Given the text "Book a table for two at the Italian restaurant downtown," NLU
would identify the intent as making a restaurant reservation, extract relevant
entities (table for two, Italian restaurant), and understand the action requested
(booking).
Natural Language Generation (NLG):
Definition: Natural Language Generation focuses on generating human-like text
or speech output based on input data or instructions. It involves converting
structured data or concepts into coherent and understandable language.
Key Tasks:
Text Planning: Organizing and structuring the content to be generated,
considering coherence and relevance.
Content Selection: Choosing the most relevant information or concepts to
include in the generated text.
Text Structuring: Arranging the selected content into grammatically correct
sentences and paragraphs.
Lexicalization: Choosing appropriate words and phrases to express the
intended meaning.
Surface Realization: Converting the structured representation into natural
language text or speech output.
Example:
Given structured data about a weather forecast (e.g., temperature, humidity,
chance of rain), NLG would generate a coherent and informative text like
"Tomorrow will be partly cloudy with a high of 75°F and a 20% chance of
rain."
Stages of NLP -
1. Lexical or Morphological Analysis:
This phase involves breaking down text into smaller units like paragraphs,
phrases, and words. It analyzes the structures of words and identifies their root
forms. Additionally, it assigns parts of speech tags to words based on their
grammatical functions.
Example: When analyzing the sentence "The cat sat on the mat," lexical
analysis identifies each word ("The," "cat," "sat," "on," "the," "mat") and
determines their relationships, such as identifying "cat" as a noun and "sat" as a
verb.
2. Syntax Analysis or Parsing:
Syntax analysis checks the grammatical structure of sentences and arranges
words to show their relationships. It ensures that the arrangement of words
follows the rules of grammar for the given language. This phase helps in
understanding how words combine to form meaningful sentences.
Example: If we consider the sentence "New York goes to John," syntax analysis
would recognize that the arrangement of words is incorrect, as "New York"
cannot "go" to a person.
3. Semantic Analysis:
Semantic analysis focuses on understanding the meaning of words, phrases, and
sentences. It goes beyond the surface-level structure and examines the literal
and implied meanings of text. This phase ensures that the extracted meaning is
logical and coherent.
Example: In the sentence "The guava ate an apple," while the sentence may be
syntactically valid, semantic analysis recognizes that guavas cannot eat apples,
making the sentence illogical.
4. Discourse Integration:
Discourse integration establishes context by considering the meaning of
preceding and subsequent sentences. It ensures coherence and consistency
between sentences within a larger text or conversation. This phase helps in
understanding how individual sentences relate to each other and contribute to
the overall meaning.
Example: Consider the sentence "Billy bought it." Discourse integration
recognizes that the meaning of "it" depends on the context provided in
preceding sentences, which may not be explicitly stated.
5. Pragmatic Analysis:
Pragmatic analysis focuses on understanding the broader communicative and
social context of language use. It considers factors such as speaker intentions,
shared knowledge, and social norms to interpret language effectively. This
phase helps in understanding language use in various real-world situations and
contexts.
Example: When someone says "Switch on the TV," pragmatic analysis
recognizes that it's a directive or request to turn on the television, understanding
the intention behind the language use in a given situation.
Diagram of Stages of NLP –
NLP techniques and methods -
To analyze and understand human language, NLP employs a variety of
techniques and methods. Here are some fundamental techniques used in NLP:
Tokenization: This is the process of breaking text into words, phrases, symbols,
or other meaningful elements, known as tokens.
Parsing: Parsing involves analyzing the grammatical structure of a sentence to
extract meaning.
Lemmatization: This technique reduces words to their base or root form,
allowing for the grouping of different forms of the same word.
Named Entity Recognition (NER): NER is used to identify entities such as
persons, organizations, locations, and other named items in the text.
Sentiment analysis: This method is used to gain an understanding of the
sentiment or emotion conveyed in a piece of text.

Application of NLP –
Chatbots and Virtual Assistants: Many businesses deploy chatbots on their
websites or messaging platforms to provide instant customer support. These
chatbots use NLP to understand user queries, provide relevant information, and
assist with tasks such as booking appointments, placing orders, or resolving
issues in real-time.

Social Media Monitoring: Companies use NLP to monitor social media

platforms in real-time to track brand mentions, analyze customer sentiment, and
identify emerging trends. This helps businesses to engage with their audience,
address customer concerns promptly, and adapt their marketing strategies in
response to real-time feedback.

Real-time Translation Services: NLP-powered translation services like Google

Translate provide real-time translation of text and speech across multiple
languages. Users can communicate with people from different linguistic
backgrounds instantly, facilitating global communication and collaboration.

Voice Search and Voice Assistants: Voice-enabled devices and applications

leverage NLP to understand spoken queries and commands in real-time. Users
can perform tasks such as searching the web, checking the weather, or
controlling smart home devices using voice commands, with responses
delivered instantly.

Real-time News Analysis: News organizations use NLP to analyze news

articles and social media feeds in real-time to identify breaking news, track
developing stories, and detect misinformation. This helps journalists and media
outlets to report news promptly and accurately to their audience.

Real-time Sentiment Analysis: Companies use NLP algorithms to monitor

real-time streams of customer feedback, reviews, and social media posts to
gauge customer sentiment towards their products or services. This enables
businesses to identify issues quickly, address customer concerns, and maintain
brand reputation.

Real-time Speech Transcription: NLP technologies enable real-time speech

transcription services, allowing users to transcribe live speeches, meetings,
lectures, or phone conversations into text. This facilitates accessibility for
individuals with hearing impairments and enables real-time note-taking during
meetings or events.

Real-time Content Moderation: Online platforms use NLP-based content

moderation systems to detect and filter out inappropriate or harmful content in
real-time, such as hate speech, harassment, or spam. This helps maintain a safe
and positive user experience for platform users.

Real-time Personalization: E-commerce platforms and content

recommendation systems use NLP to analyze user behavior and preferences in
real-time, delivering personalized product recommendations, content
suggestions, and targeted advertisements to users based on their interests and
browsing history.

Why NLP is hard ?

Natural Language Processing (NLP) presents several challenges that make it a
complex and difficult field. Here are some reasons why NLP is considered hard:
Ambiguity: Natural language is inherently ambiguous. Words and phrases often
have multiple meanings, and understanding the intended meaning relies heavily
on context. For example, the word "bank" could refer to a financial institution or
the side of a river.
Variability: Language is incredibly diverse and constantly evolving. People use
language in different ways based on factors like region, culture, age, and social
context. NLP systems must be adaptable to handle this variability effectively.
Syntax Complexity: Grammar rules in natural language can be complex and
irregular. Sentences can have intricate structures with multiple clauses, and
grammar rules can vary between languages and dialects.
Lack of Formality: Unlike programming languages, natural languages often
lack strict rules and formal structure. This makes it challenging to develop
algorithms and models that accurately process and understand language.
Context Dependency: The meaning of a word or phrase can change based on
the surrounding words and the broader context of the conversation or text. NLP
systems must be able to infer context to accurately interpret language.
Data Sparsity: Natural language data is vast, but it is also sparse. NLP systems
require large amounts of annotated data for training, but collecting and
annotating such data can be time-consuming and expensive.
Pragmatic Inference: Understanding language often requires pragmatic
inference—interpreting what someone means based on shared knowledge,
social norms, and context. This aspect of language understanding is particularly
challenging for NLP systems.
Disambiguation: Resolving ambiguity in language—such as distinguishing
between homonyms or interpreting sarcasm and irony—requires sophisticated
linguistic knowledge and reasoning capabilities.
Cultural and Linguistic Diversity: NLP applications are used globally across
diverse languages and cultures. Accommodating this diversity requires building
models and systems that are sensitive to linguistic and cultural nuances.
Natural language vs Programming Language –
https://www.geeksforgeeks.org/ natural-language-processingnlp-vs-
programming-language
Challenges and Issues of NLP –
Data Annotation and Labeling: Creating high-quality annotated datasets for
training NLP models can be time-consuming and expensive, especially for tasks
requiring fine-grained annotations such as named entity recognition or semantic
role labeling.

Ambiguity: Natural languages are inherently ambiguous, with words and

phrases often having multiple meanings depending on context. Resolving this
ambiguity accurately is a significant challenge for NLP systems

Syntax and Grammar: Human languages have complex grammatical rules and
syntactic structures that can vary widely across languages and dialects. NLP
systems must accurately parse and understand the grammatical structure of
sentences to extract meaning effectively.

Semantic Understanding: Understanding the meaning of words, phrases, and

sentences in context is a key challenge for NLP. Words can have different
meanings based on their context, and understanding nuances, idioms, and
figurative language adds complexity to semantic analysis.

Lack of Data: NLP models require large amounts of annotated data for
training, but labeled datasets are often scarce or expensive to obtain.
Additionally, data may be noisy, biased, or incomplete, posing challenges for
training accurate and robust models.

Domain Specificity: NLP systems trained on general datasets may not perform
well in domain-specific applications, such as healthcare or legal documents.
Adapting NLP models to specific domains requires specialized training data and
domain knowledge.

Out-of-Vocabulary Words: NLP models may encounter words or phrases that

are not present in their vocabulary, especially when dealing with domain-
specific or newly coined terms. Handling out-of-vocabulary words effectively is
crucial for maintaining the accuracy and performance of NLP systems.
Multi-linguality: NLP tasks become more complex when dealing with
multilingual or code-switching text data. Models must be able to handle diverse
languages and language varieties effectively.
Scale and Efficiency: Processing large volumes of text data efficiently is a
challenge, particularly for real-time or streaming applications. Developing
scalable and resource-efficient NLP algorithms and systems is essential for
handling big data.

Basics of Text Processing –

Tokenization –

Tokenization is a fundamental process in Natural Language Processing (NLP)

that involves breaking down text into smaller units, called tokens. These tokens
can be words, phrases, or other meaningful elements, depending on the specific
task or context.
Types of Tokenization –
1.Word Tokenization:
Word tokenization is the process of segmenting a text into individual words,
where each word is considered a token.
This process typically involves splitting the text at whitespace characters
(spaces, tabs, newlines), punctuation marks, and other delimiters.
In some cases, tokenization may also involve handling special cases such as
contractions ("don't" -> ["do", "n't"]) or hyphenated words ("self-driving" ->
["self", "-", "driving"]).
Word tokenization is often used as the initial step in NLP tasks such as text
classification, sentiment analysis, and machine translation.
Example: Consider the sentence: "Natural language processing is fascinating!"
Word Tokens: ["Natural", "language", "processing", "is", "fascinating"]

2.Sentence Tokenization:
Sentence tokenization involves dividing a text into individual sentences, where
each sentence is treated as a separate token.
This process typically relies on identifying sentence boundaries, which can be
indicated by punctuation marks (periods, exclamation marks, question marks) or
specific sentence boundary markers.
Sentence tokenization is useful for tasks that require analyzing text at the
sentence level, such as text summarization, named entity recognition, and
information extraction.
Example: Consider the paragraph: "NLP is fascinating. It involves analyzing
and understanding text data. Sentiment analysis is one of its applications."
Sentence Tokens: ["NLP is fascinating.", "It involves analyzing and
understanding text data.", "Sentiment analysis is one of its applications."]
3.Character Tokenization:
Character tokenization breaks down the text into its constituent characters, with
each character treated as a separate token.
This approach is particularly useful for tasks that operate at the character level,
such as text generation, handwriting recognition, and spelling correction.
Character tokenization preserves the sequential order of characters in the text
and can capture fine-grained patterns and structures that may be missed by word
or sentence tokenization.
However, character tokenization typically results in a larger vocabulary size and
may require additional processing steps to handle variable-length sequences.
Example: Consider the word: "hello"
Character Tokens: ['h', 'e', 'l', 'l', 'o']
4.Subword Tokenization:
Subword tokenization divides the text into smaller subword units, such as
morphemes, syllables, or character n-grams.
This approach is especially beneficial for handling out-of-vocabulary words,
rare or infrequent terms, and languages with complex morphology.
Popular subword tokenization algorithms, such as Byte Pair Encoding (BPE)
and WordPiece, iteratively merge frequent character sequences to build a
vocabulary of subword units.
Example: Consider the word: "unbelievable"
Subword Tokens (using Byte Pair Encoding): ["un", "be", "lie", "v", "able"]
Explanation: Subword tokenization splits the word into smaller subword units.
In this example, the word "unbelievable" is divided into subword tokens such as
"un", "be", "lie", "v", and "able", which capture meaningful parts of the word.

Stemming –
Stemming is a text processing technique used in Natural Language Processing
(NLP) to reduce words to their base or root form, known as the stem. The
primary goal of stemming is to normalize words with similar meanings but
different inflections or variations to a common form.
How Stemming Works?
Stemming is a linguistic normalization process in natural language processing
and information retrieval. Its primary goal is to reduce words to their base or
root form, known as the stem. Stemming helps group words with similar
meanings or roots together, even if they have different inflections, prefixes, or
suffixes.
The process involves removing common affixes (prefixes, suffixes) from words,
resulting in a simplified form that represents the word’s core meaning.
Stemming is a heuristic process and may only sometimes produce a valid word.
Still, it is effective for tasks like information retrieval, where the focus is on
matching the essential meaning of words rather than their grammatical
correctness.
For example:
Running -> Run
Jumps -> Jump
Swimming -> Swim
Stemming algorithms use various rules and heuristics to identify and remove
affixes, making them widely applicable in text-processing tasks to enhance
information retrieval and analysis.
Types of Stemming –
Porter Stemmer:
The Porter Stemmer is based on a set of heuristic rules applied sequentially to
reduce words to their stems. These rules involve removing suffixes from words
while maintaining linguistic meaning.
Example: The rule for removing the "-ing" suffix from verbs is applied to
transform "running" to "run".
Snowball Stemmer:
The Snowball Stemmer, also known as the Porter2 Stemmer, is an extension of
the Porter Stemmer. It offers improved stemming for multiple languages and
provides more aggressive stemming.
Example: In addition to stemming English words, the Snowball Stemmer can
handle stemming for languages like French, Spanish, and German.
Example:
Original word: "universally"
Stemmed word: "univers"
Lancaster Stemming in NLP
This stemming algorithm is one of the fastest algorithms available out there.
Unlike Snowball stemmer and Porter stemmer, the stem words in this algorithm
are not intuitive. Many short words are obfuscated after this algorithm's
implementation, and it significantly reduces the number of words. So, you must
avoid using this algorithm if you are looking for more distinction among
different words. The Lancaster rule converts words with ‘ies’ as a suffix into
the ‘y’ suffix. This stemming algorithm is not as efficient as the Snowball one.
Example:
Original words: ['running', 'jumped', 'happily', 'quickly', 'foxes']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox']

Regexp Stemmer -
The Regexp Stemmer, or Regular Expression Stemmer, is a stemming algorithm
that utilizes regular expressions to identify and remove suffixes from words. It
allows users to define custom rules for stemming by specifying patterns to
match and remove.
This method provides flexibility and control over the stemming process, making
it suitable for specific applications where custom rule-based stemming is
desired.
Example: Regular expressions can be used to match specific suffix patterns,
such as "-ing" for verbs, and remove them to produce stems.
Original Word: running
Stemmed Word: runn

Lemmatization –
Lemmatization is a text normalization technique used in Natural Language
Processing (NLP) to reduce words to their base or canonical form, known as the
lemma. The primary goal of lemmatization is to transform words into their
dictionary or lexicon form, which helps in grouping together inflected or variant
forms of the same word.
How Lemmatization works?
Lemmatization is a linguistic process that involves reducing words to their base
or root form, known as the lemma. The goal is to normalize different inflected
forms of a word so that they can be analyzed or compared more easily. This is
particularly useful in natural language processing (NLP) and text analysis.
Here’s how lemmatization generally works:

Tokenization: The first step is to break down a text into individual words or
tokens. This can be done using various methods, such as splitting the text based
on spaces.
POS Tagging: Parts-of-speech tagging involves assigning a grammatical
category (like noun, verb, adjective, etc.) to each token. Lemmatization often
relies on this information, as the base form of a word can depend on its
grammatical role in a sentence.
Lemmatization: Once each word has been tokenized and assigned a part-of-
speech tag, the lemmatization algorithm uses a lexicon or linguistic rules to
determine the lemma of each word. The lemma is the base form of the word,
which may not necessarily be the same as the word’s root. For example, the
lemma of “running” is “run,” and the lemma of “better” (in the context of an
adjective) is “good.”
Applying Rules: Lemmatization algorithms often rely on linguistic rules and
patterns. For irregular verbs or words with multiple possible lemmas, these rules
help in making the correct lemmatization decision.
Output: The result of lemmatization is a set of words in their base or dictionary
form, making it easier to analyze and understand the underlying meaning of a
text.
Example:
Suppose we have a sentence:
"Chickens are laying eggs in the coop."
Now, let's perform lemmatization on this sentence:
Tokenization: Split the sentence into individual words:
["Chickens", "are", "laying", "eggs", "in", "the", "coop"]
Part-of-Speech (POS) Tagging: Determine the grammatical category of each
word:
[("Chickens", "NNS"), ("are", "VBP"), ("laying", "VBG"), ("eggs", "NNS"),
("in", "IN"), ("the", "DT"), ("coop", "NN")]
Lemmatization: Reduce each word to its base form (lemma) based on its POS
tag:
"Chickens" (NNS) -> "Chicken" (NNS)
"are" (VBP) -> "be" (VB)
"laying" (VBG) -> "lay" (VB)
"eggs" (NNS) -> "egg" (NN)
"in" (IN) -> "in" (IN)
"the" (DT) -> "the" (DT)
"coop" (NN) -> "coop" (NN)
Result: The lemmatized sentence becomes:
"Chicken be lay egg in the coop."
In this example, lemmatization transformed words like "Chickens" to "Chicken"
and "laying" to "lay" to represent their base forms. This normalization process
helps in grouping together words with similar meanings and reduces the
complexity of text data for downstream NLP tasks.

Types of Lemmatization –
1. Rule Based Lemmatization-
Rule-based lemmatization involves the application of predefined rules to derive
the base or root form of a word. Unlike machine learning-based approaches,
which learn from data, rule-based lemmatization relies on linguistic rules and
patterns.

Here’s a simplified example of rule-based lemmatization for English verbs:

Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.

Example:
Word: “walked”
Rule Application: Remove “-ed”
Result: “walk
This approach extends to other verb conjugations, providing a systematic way to
obtain lemmas for regular verbs. While rule-based lemmatization may not cover
all linguistic nuances, it serves as a transparent and interpretable method for
deriving base forms in many cases.
2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup
tables to map words to their corresponding base forms or lemmas. Each word is
matched against the dictionary entries to find its lemma. This method is
effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
‘running’ -> ‘run’
‘better’ -> ‘good’
‘went’ -> ‘go’
When we apply dictionary-based lemmatization to a text like “I was running to
become a better athlete, and then I went home,” the resulting lemmatized form
would be: “I was run to become a good athlete, and then I go home.”

3. Machine Learning-Based Lemmatization

Machine learning-based lemmatization leverages computational models to
automatically learn the relationships between words and their base forms.
Unlike rule-based or dictionary-based approaches, machine learning models,
such as neural networks or statistical models, are trained on large text datasets
to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts. When
encountering the word ‘went,’ the model, having learned patterns, predicts the
base form as ‘go.’ Similarly, for ‘happier,’ the model deduces ‘happy’ as the
lemma. The advantage lies in the model’s ability to adapt to varied linguistic
nuances and handle irregularities, making it robust for lemmatizing diverse
vocabularies.
Advantages of Lemmatization with NLTK
1. Improves text analysis accuracy: Lemmatization helps in improving the
accuracy of text analysis by reducing words to their base or dictionary
form. This makes it easier to identify and analyze words that have similar
meanings.
2. Reduces data size: Since lemmatization reduces words to their base form,
it helps in reducing the data size of the text, which makes it easier to
handle large datasets.
3. Better search results: Lemmatization helps in retrieving better search
results since it reduces different forms of a word to a common base form,
making it easier to match different forms of a word in the text.
Disadvantages of Lemmatization with NLTK
1. Time-consuming: Lemmatization can be time-consuming since it
involves parsing the text and performing a lookup in a dictionary or a
database of word forms.
2. Not suitable for real-time applications: Since lemmatization is time-
consuming, it may not be suitable for real-time applications that require
quick response times.
3. May lead to ambiguity: Lemmatization may lead to ambiguity, as a single
word may have multiple meanings depending on the context in which it is
used. In such cases, the lemmatizer may not be able to determine the
correct meaning of the word.

Part Of Speech Tagging –

POS tagging is the process of assigning grammatical categories or tags to words
in a sentence based on their syntactic roles.
Common POS tags include nouns, verbs, adjectives, adverbs, pronouns,
prepositions, conjunctions, and determiners.
POS tagging is crucial for many NLP tasks, such as syntactic parsing, semantic
analysis, and information extraction.
POS taggers can be rule-based or trained using machine learning techniques,
such as hidden Markov models (HMMs) or deep learning models like recurrent
neural networks (RNNs) or transformers.
Example:
Consider the following sentence:
"The quick brown fox jumps over the lazy dog."
Now, let's perform POS tagging on this sentence:
Tokenization: Split the sentence into individual words or tokens:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
POS Tagging: Assign a specific part-of-speech tag to each word:
[("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps",
"VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN"), (".", ".")]
In this example, each word in the sentence has been assigned a specific part-of-
speech tag based on its grammatical function within the sentence. For instance,
"quick" and "brown" are tagged as adjectives (JJ), "fox" and "dog" are tagged as
nouns (NN), "jumps" is tagged as a verb (VBZ), "over" is tagged as a
preposition (IN), and so on.
POS tagging provides valuable linguistic information about the structure of the
text, which can be used in various NLP applications. It helps in syntactic
analysis, semantic understanding, and information extraction from text data.
Types of POS Tagging:
Rule-Based POS Tagging:

Rule-based POS tagging relies on manually crafted linguistic rules to assign

tags to words based on their morphological, syntactic, and contextual features.
These rules are typically derived from linguistic theories and observations.
Example: Assigning the tag "NN" (noun) to words ending in "-tion" or "-ment"
(e.g., "information," "government").
Stochastic/Probabilistic POS Tagging:

Stochastic or probabilistic POS tagging employs statistical models trained on

annotated corpora to predict the most likely tag for each word based on its
context. These models estimate the probability distribution of tags given the
observed words.
Example: Hidden Markov Models (HMMs) and Conditional Random Fields
(CRFs) are common probabilistic models used for POS tagging.
Lexical-Based POS Tagging:
Lexical-based POS tagging involves using lexical resources such as dictionaries
or lexicons to assign tags to words based on their known meanings or
properties.
Example: Assigning the tag "NNP" (proper noun) to words listed in a dictionary
as proper nouns (e.g., "John," "London").
Transformation-Based POS Tagging:

Transformation-based POS tagging employs machine learning algorithms to

learn transformation rules that map observed word sequences to their
corresponding tag sequences. These rules are derived from training data and
iteratively refined to improve tagging accuracy.
Example: Learning rules that map word suffixes to POS tags (e.g., "-ed" → past
tense verb).
Hybrid POS Tagging:

Hybrid POS tagging combines multiple approaches, such as rule-based,

stochastic, and lexical-based methods, to leverage their respective strengths and
improve tagging accuracy.
Example: Integrating rule-based patterns with statistical models to handle
complex linguistic phenomena more effectively.
Deep Learning-Based POS Tagging:

Deep learning-based POS tagging utilizes neural network architectures, such as

recurrent neural networks (RNNs), convolutional neural networks (CNNs), or
transformers, to automatically learn features and representations from raw text
data for tagging.
Example: Training a neural network model to predict POS tags directly from
word embeddings or character-level representations.

What is the need of POS Tagging ?

Syntactic Parsing: POS tags provide essential information about the
grammatical structure of sentences, enabling syntactic parsers to analyze and
understand the relationships between words in a sentence. Syntactic parsing
involves identifying phrases, clauses, and dependencies within sentences, which
is essential for tasks like information extraction, question answering, and
machine translation.
Disambiguation: Many words in natural language are ambiguous and can have
different meanings depending on their context. POS tagging helps disambiguate
these words by assigning them the appropriate grammatical category based on
their context. For example, the word "book" can be a noun (e.g., "I read a
book") or a verb (e.g., "I will book a flight"), and POS tagging helps determine
its correct usage in a sentence.
Morphological Analysis: POS tagging provides information about the
morphology of words, such as their tense, number, gender, and case. This
information is useful for analyzing word forms and inflections, which is
important for tasks like morphological generation and morphological
disambiguation.
Word Sense Disambiguation (WSD): POS tags serve as an important feature
for word sense disambiguation, which involves determining the intended
meaning of ambiguous words based on their context. WSD is crucial for tasks
like machine translation, information retrieval, and semantic analysis.

NLP Notes
No ratings yet
NLP Notes
18 pages
Unit 5
No ratings yet
Unit 5
20 pages
NLP Unit-3-Semantics-And-Pragmatics
No ratings yet
NLP Unit-3-Semantics-And-Pragmatics
20 pages
NLP UNIT 2 (Ques Ans Bank)
No ratings yet
NLP UNIT 2 (Ques Ans Bank)
26 pages
NLP QB
100% (2)
NLP QB
14 pages
Natural Language Processing
100% (2)
Natural Language Processing
48 pages
Unit 3
No ratings yet
Unit 3
19 pages
NLP Unit-Iv
No ratings yet
NLP Unit-Iv
124 pages
NLP Notes For Students
67% (3)
NLP Notes For Students
18 pages
UNIT-V NLP
No ratings yet
UNIT-V NLP
25 pages
Atcd Model QP
0% (1)
Atcd Model QP
4 pages
Vtu NLP Questions
100% (1)
Vtu NLP Questions
5 pages
Knowledge Representation & Reasoning Notes
100% (1)
Knowledge Representation & Reasoning Notes
32 pages
System Paradigms in NLP
No ratings yet
System Paradigms in NLP
8 pages
Introduction to NLP Techniques
100% (1)
Introduction to NLP Techniques
105 pages
Natural Language Processing Module 1 Notes PDF
100% (3)
Natural Language Processing Module 1 Notes PDF
15 pages
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
51 pages
Markov Models & NLP Explained
No ratings yet
Markov Models & NLP Explained
27 pages
KRR Unit 1
No ratings yet
KRR Unit 1
26 pages
NLP Sample Question Bank
No ratings yet
NLP Sample Question Bank
9 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (2)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
NLP Sem Questions and Answers
100% (1)
NLP Sem Questions and Answers
72 pages
U4 NLP Notes
100% (1)
U4 NLP Notes
5 pages
NLP - AI2214601 Unit 1to Unit 5 Notes
No ratings yet
NLP - AI2214601 Unit 1to Unit 5 Notes
98 pages
NLP Unit I Notes-1
75% (4)
NLP Unit I Notes-1
22 pages
Compiler Design Unit 2
No ratings yet
Compiler Design Unit 2
117 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
6 pages
Irs Unit-V
No ratings yet
Irs Unit-V
48 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
Optimal Decision in Games
No ratings yet
Optimal Decision in Games
68 pages
Introduction to Scripting Languages
No ratings yet
Introduction to Scripting Languages
12 pages
NLP Course for B.Tech CSE Students
100% (1)
NLP Course for B.Tech CSE Students
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
37 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
5.5.2 Video To Text With LSTM Models
100% (1)
5.5.2 Video To Text With LSTM Models
10 pages
Ontology Unit 2 Notes
100% (1)
Ontology Unit 2 Notes
17 pages
NLP Unit V Notes
100% (1)
NLP Unit V Notes
21 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Jntu SL Lab Manual
No ratings yet
Jntu SL Lab Manual
33 pages
IR Question Bank
100% (2)
IR Question Bank
29 pages
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
No ratings yet
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
6 pages
Data Analytics Unit-I
No ratings yet
Data Analytics Unit-I
25 pages
Unit 1 (Fiot)
No ratings yet
Unit 1 (Fiot)
38 pages
IRS UNIT 5-Compressed
No ratings yet
IRS UNIT 5-Compressed
80 pages
Unit 5
No ratings yet
Unit 5
8 pages
Nodejs Lab Manual r22
No ratings yet
Nodejs Lab Manual r22
68 pages
STM Viva Que
100% (2)
STM Viva Que
54 pages
Da Unit-3
No ratings yet
Da Unit-3
27 pages
ATCD Important Questions
No ratings yet
ATCD Important Questions
7 pages
Irt 2 Marks With Answer
No ratings yet
Irt 2 Marks With Answer
15 pages
Role of Lexical Analyser
No ratings yet
Role of Lexical Analyser
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
ML Unit-3
No ratings yet
ML Unit-3
23 pages
NLP UNIT IV Notes
100% (1)
NLP UNIT IV Notes
5 pages
Unit - 5 Irs
100% (1)
Unit - 5 Irs
78 pages
IRS Unit-4
50% (4)
IRS Unit-4
13 pages
Unit V
No ratings yet
Unit V
16 pages
The Regulatory Vacuum
No ratings yet
The Regulatory Vacuum
4 pages
Ai and Intellectual Property Rights - Riya
No ratings yet
Ai and Intellectual Property Rights - Riya
4 pages
Time of Daily High and Low
No ratings yet
Time of Daily High and Low
3 pages
E-73 WH Group Control Panel
No ratings yet
E-73 WH Group Control Panel
48 pages
Orphee Mythic 22 Hematology Analyzer - User Manual
75% (4)
Orphee Mythic 22 Hematology Analyzer - User Manual
99 pages
Chamberlain Manual Cg40d Garage Door
No ratings yet
Chamberlain Manual Cg40d Garage Door
40 pages
Updated - The Ultimate Monster Hunt Guide For Lords Mobile!
No ratings yet
Updated - The Ultimate Monster Hunt Guide For Lords Mobile!
54 pages
Module 2 - Data Management and Data Wrangling
No ratings yet
Module 2 - Data Management and Data Wrangling
40 pages
15 Logical Reasoning Questions For Olympiad For 6 7 8
No ratings yet
15 Logical Reasoning Questions For Olympiad For 6 7 8
38 pages
Fault and Alarm Troubleshooting Guide
No ratings yet
Fault and Alarm Troubleshooting Guide
10 pages
Description: Print
No ratings yet
Description: Print
8 pages
View All Callouts: Function Isolation Tools
100% (1)
View All Callouts: Function Isolation Tools
33 pages
QA Plan Checklist for HUD SDM
100% (8)
QA Plan Checklist for HUD SDM
8 pages
H Series Ycm
No ratings yet
H Series Ycm
16 pages
Cpe Research Manual2021 2022-1
No ratings yet
Cpe Research Manual2021 2022-1
19 pages
Harvard Referencing Guide
No ratings yet
Harvard Referencing Guide
54 pages
f1s WF 2024 Competition Regulations
No ratings yet
f1s WF 2024 Competition Regulations
60 pages
David - Class 6 Crime Online
No ratings yet
David - Class 6 Crime Online
3 pages
Deep Foundation-Design Methods
No ratings yet
Deep Foundation-Design Methods
30 pages
The Ethical Implications of Artificial Intelligence Essay
No ratings yet
The Ethical Implications of Artificial Intelligence Essay
5 pages
Moomoo Script
No ratings yet
Moomoo Script
4 pages
Reverse Power Relay
No ratings yet
Reverse Power Relay
4 pages
Section 13b Dfss Lecture Notes
No ratings yet
Section 13b Dfss Lecture Notes
46 pages
Ark Item List PDF
100% (1)
Ark Item List PDF
10 pages
ABB - Terra - AC - Charger - OCPP1.6 - ImplementationOverview - v1.8 - FW1.6.6
No ratings yet
ABB - Terra - AC - Charger - OCPP1.6 - ImplementationOverview - v1.8 - FW1.6.6
24 pages
Standard Sample Paper Xyz
No ratings yet
Standard Sample Paper Xyz
6 pages
Nexus Prologue
No ratings yet
Nexus Prologue
79 pages
Features of The New GL
No ratings yet
Features of The New GL
3 pages
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
No ratings yet
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
146 pages
SAP HANA Data Modeling vs SQL Scripting
100% (1)
SAP HANA Data Modeling vs SQL Scripting
11 pages

NLP Unit 1 Notes

Uploaded by

NLP Unit 1 Notes

Uploaded by

NLP UNIT 1 NOTES

What is natural language processing?

Social Media Monitoring: Companies use NLP to monitor social media

Real-time Translation Services: NLP-powered translation services like Google

Voice Search and Voice Assistants: Voice-enabled devices and applications

Real-time News Analysis: News organizations use NLP to analyze news

Real-time Sentiment Analysis: Companies use NLP algorithms to monitor

Real-time Speech Transcription: NLP technologies enable real-time speech

Real-time Content Moderation: Online platforms use NLP-based content

Real-time Personalization: E-commerce platforms and content

Why NLP is hard ?

Ambiguity: Natural languages are inherently ambiguous, with words and

Semantic Understanding: Understanding the meaning of words, phrases, and

Out-of-Vocabulary Words: NLP models may encounter words or phrases that

Basics of Text Processing –

Tokenization is a fundamental process in Natural Language Processing (NLP)

Here’s a simplified example of rule-based lemmatization for English verbs:

3. Machine Learning-Based Lemmatization

Part Of Speech Tagging –

Rule-based POS tagging relies on manually crafted linguistic rules to assign

Stochastic or probabilistic POS tagging employs statistical models trained on

Transformation-based POS tagging employs machine learning algorithms to

Hybrid POS tagging combines multiple approaches, such as rule-based,

Deep learning-based POS tagging utilizes neural network architectures, such as

What is the need of POS Tagging ?

You might also like