NLP UNIT 1 NOTES
https://www.datacamp.com/blog/what-is-natural-language-processing
What is natural language processing?
Natural Language Processing (NLP) is a branch of artificial intelligence that
focuses on the interaction between computers and humans through natural
language.
It encompasses the development of algorithms and techniques to enable
computers to understand, interpret, and generate human language in a way that
is both meaningful and useful.
Components of NLP –
Natural Language Processing (NLP) can be divided into two main components:
Natural Language Understanding (NLU) and Natural Language Generation
(NLG).
Natural Language Understanding (NLU):
Definition: Natural Language Understanding focuses on enabling computers to
comprehend and interpret human language input. It involves extracting meaning
from text or speech data, understanding the intent behind the communication,
and identifying relevant information.
Key Tasks:
Text Parsing: Breaking down sentences or phrases into smaller units (tokens)
for analysis.
Entity Recognition: Identifying specific entities mentioned in the text, such as
names, dates, locations, or organizations.
Sentiment Analysis: Determining the sentiment or emotion expressed in the
text (positive, negative, or neutral).
Intent Recognition: Understanding the purpose or goal behind a user's input,
such as extracting commands or requests.
Semantic Parsing: Analyzing the structure of sentences to extract semantic
meaning and relationships between words.
Example:
Given the text "Book a table for two at the Italian restaurant downtown," NLU
would identify the intent as making a restaurant reservation, extract relevant
entities (table for two, Italian restaurant), and understand the action requested
(booking).
Natural Language Generation (NLG):
Definition: Natural Language Generation focuses on generating human-like text
or speech output based on input data or instructions. It involves converting
structured data or concepts into coherent and understandable language.
Key Tasks:
Text Planning: Organizing and structuring the content to be generated,
considering coherence and relevance.
Content Selection: Choosing the most relevant information or concepts to
include in the generated text.
Text Structuring: Arranging the selected content into grammatically correct
sentences and paragraphs.
Lexicalization: Choosing appropriate words and phrases to express the
intended meaning.
Surface Realization: Converting the structured representation into natural
language text or speech output.
Example:
Given structured data about a weather forecast (e.g., temperature, humidity,
chance of rain), NLG would generate a coherent and informative text like
"Tomorrow will be partly cloudy with a high of 75°F and a 20% chance of
rain."
Stages of NLP -
1. Lexical or Morphological Analysis:
This phase involves breaking down text into smaller units like paragraphs,
phrases, and words. It analyzes the structures of words and identifies their root
forms. Additionally, it assigns parts of speech tags to words based on their
grammatical functions.
Example: When analyzing the sentence "The cat sat on the mat," lexical
analysis identifies each word ("The," "cat," "sat," "on," "the," "mat") and
determines their relationships, such as identifying "cat" as a noun and "sat" as a
verb.
2. Syntax Analysis or Parsing:
Syntax analysis checks the grammatical structure of sentences and arranges
words to show their relationships. It ensures that the arrangement of words
follows the rules of grammar for the given language. This phase helps in
understanding how words combine to form meaningful sentences.
Example: If we consider the sentence "New York goes to John," syntax analysis
would recognize that the arrangement of words is incorrect, as "New York"
cannot "go" to a person.
3. Semantic Analysis:
Semantic analysis focuses on understanding the meaning of words, phrases, and
sentences. It goes beyond the surface-level structure and examines the literal
and implied meanings of text. This phase ensures that the extracted meaning is
logical and coherent.
Example: In the sentence "The guava ate an apple," while the sentence may be
syntactically valid, semantic analysis recognizes that guavas cannot eat apples,
making the sentence illogical.
4. Discourse Integration:
Discourse integration establishes context by considering the meaning of
preceding and subsequent sentences. It ensures coherence and consistency
between sentences within a larger text or conversation. This phase helps in
understanding how individual sentences relate to each other and contribute to
the overall meaning.
Example: Consider the sentence "Billy bought it." Discourse integration
recognizes that the meaning of "it" depends on the context provided in
preceding sentences, which may not be explicitly stated.
5. Pragmatic Analysis:
Pragmatic analysis focuses on understanding the broader communicative and
social context of language use. It considers factors such as speaker intentions,
shared knowledge, and social norms to interpret language effectively. This
phase helps in understanding language use in various real-world situations and
contexts.
Example: When someone says "Switch on the TV," pragmatic analysis
recognizes that it's a directive or request to turn on the television, understanding
the intention behind the language use in a given situation.
Diagram of Stages of NLP –
NLP techniques and methods -
To analyze and understand human language, NLP employs a variety of
techniques and methods. Here are some fundamental techniques used in NLP:
Tokenization: This is the process of breaking text into words, phrases, symbols,
or other meaningful elements, known as tokens.
Parsing: Parsing involves analyzing the grammatical structure of a sentence to
extract meaning.
Lemmatization: This technique reduces words to their base or root form,
allowing for the grouping of different forms of the same word.
Named Entity Recognition (NER): NER is used to identify entities such as
persons, organizations, locations, and other named items in the text.
Sentiment analysis: This method is used to gain an understanding of the
sentiment or emotion conveyed in a piece of text.
Application of NLP –
Chatbots and Virtual Assistants: Many businesses deploy chatbots on their
websites or messaging platforms to provide instant customer support. These
chatbots use NLP to understand user queries, provide relevant information, and
assist with tasks such as booking appointments, placing orders, or resolving
issues in real-time.
Social Media Monitoring: Companies use NLP to monitor social media
platforms in real-time to track brand mentions, analyze customer sentiment, and
identify emerging trends. This helps businesses to engage with their audience,
address customer concerns promptly, and adapt their marketing strategies in
response to real-time feedback.
Real-time Translation Services: NLP-powered translation services like Google
Translate provide real-time translation of text and speech across multiple
languages. Users can communicate with people from different linguistic
backgrounds instantly, facilitating global communication and collaboration.
Voice Search and Voice Assistants: Voice-enabled devices and applications
leverage NLP to understand spoken queries and commands in real-time. Users
can perform tasks such as searching the web, checking the weather, or
controlling smart home devices using voice commands, with responses
delivered instantly.
Real-time News Analysis: News organizations use NLP to analyze news
articles and social media feeds in real-time to identify breaking news, track
developing stories, and detect misinformation. This helps journalists and media
outlets to report news promptly and accurately to their audience.
Real-time Sentiment Analysis: Companies use NLP algorithms to monitor
real-time streams of customer feedback, reviews, and social media posts to
gauge customer sentiment towards their products or services. This enables
businesses to identify issues quickly, address customer concerns, and maintain
brand reputation.
Real-time Speech Transcription: NLP technologies enable real-time speech
transcription services, allowing users to transcribe live speeches, meetings,
lectures, or phone conversations into text. This facilitates accessibility for
individuals with hearing impairments and enables real-time note-taking during
meetings or events.
Real-time Content Moderation: Online platforms use NLP-based content
moderation systems to detect and filter out inappropriate or harmful content in
real-time, such as hate speech, harassment, or spam. This helps maintain a safe
and positive user experience for platform users.
Real-time Personalization: E-commerce platforms and content
recommendation systems use NLP to analyze user behavior and preferences in
real-time, delivering personalized product recommendations, content
suggestions, and targeted advertisements to users based on their interests and
browsing history.
Why NLP is hard ?
Natural Language Processing (NLP) presents several challenges that make it a
complex and difficult field. Here are some reasons why NLP is considered hard:
Ambiguity: Natural language is inherently ambiguous. Words and phrases often
have multiple meanings, and understanding the intended meaning relies heavily
on context. For example, the word "bank" could refer to a financial institution or
the side of a river.
Variability: Language is incredibly diverse and constantly evolving. People use
language in different ways based on factors like region, culture, age, and social
context. NLP systems must be adaptable to handle this variability effectively.
Syntax Complexity: Grammar rules in natural language can be complex and
irregular. Sentences can have intricate structures with multiple clauses, and
grammar rules can vary between languages and dialects.
Lack of Formality: Unlike programming languages, natural languages often
lack strict rules and formal structure. This makes it challenging to develop
algorithms and models that accurately process and understand language.
Context Dependency: The meaning of a word or phrase can change based on
the surrounding words and the broader context of the conversation or text. NLP
systems must be able to infer context to accurately interpret language.
Data Sparsity: Natural language data is vast, but it is also sparse. NLP systems
require large amounts of annotated data for training, but collecting and
annotating such data can be time-consuming and expensive.
Pragmatic Inference: Understanding language often requires pragmatic
inference—interpreting what someone means based on shared knowledge,
social norms, and context. This aspect of language understanding is particularly
challenging for NLP systems.
Disambiguation: Resolving ambiguity in language—such as distinguishing
between homonyms or interpreting sarcasm and irony—requires sophisticated
linguistic knowledge and reasoning capabilities.
Cultural and Linguistic Diversity: NLP applications are used globally across
diverse languages and cultures. Accommodating this diversity requires building
models and systems that are sensitive to linguistic and cultural nuances.
Natural language vs Programming Language –
https://www.geeksforgeeks.org/ natural-language-processingnlp-vs-
programming-language
Challenges and Issues of NLP –
Data Annotation and Labeling: Creating high-quality annotated datasets for
training NLP models can be time-consuming and expensive, especially for tasks
requiring fine-grained annotations such as named entity recognition or semantic
role labeling.
Ambiguity: Natural languages are inherently ambiguous, with words and
phrases often having multiple meanings depending on context. Resolving this
ambiguity accurately is a significant challenge for NLP systems
Syntax and Grammar: Human languages have complex grammatical rules and
syntactic structures that can vary widely across languages and dialects. NLP
systems must accurately parse and understand the grammatical structure of
sentences to extract meaning effectively.
Semantic Understanding: Understanding the meaning of words, phrases, and
sentences in context is a key challenge for NLP. Words can have different
meanings based on their context, and understanding nuances, idioms, and
figurative language adds complexity to semantic analysis.
Lack of Data: NLP models require large amounts of annotated data for
training, but labeled datasets are often scarce or expensive to obtain.
Additionally, data may be noisy, biased, or incomplete, posing challenges for
training accurate and robust models.
Domain Specificity: NLP systems trained on general datasets may not perform
well in domain-specific applications, such as healthcare or legal documents.
Adapting NLP models to specific domains requires specialized training data and
domain knowledge.
Out-of-Vocabulary Words: NLP models may encounter words or phrases that
are not present in their vocabulary, especially when dealing with domain-
specific or newly coined terms. Handling out-of-vocabulary words effectively is
crucial for maintaining the accuracy and performance of NLP systems.
Multi-linguality: NLP tasks become more complex when dealing with
multilingual or code-switching text data. Models must be able to handle diverse
languages and language varieties effectively.
Scale and Efficiency: Processing large volumes of text data efficiently is a
challenge, particularly for real-time or streaming applications. Developing
scalable and resource-efficient NLP algorithms and systems is essential for
handling big data.
Basics of Text Processing –
Tokenization –
Tokenization is a fundamental process in Natural Language Processing (NLP)
that involves breaking down text into smaller units, called tokens. These tokens
can be words, phrases, or other meaningful elements, depending on the specific
task or context.
Types of Tokenization –
1.Word Tokenization:
Word tokenization is the process of segmenting a text into individual words,
where each word is considered a token.
This process typically involves splitting the text at whitespace characters
(spaces, tabs, newlines), punctuation marks, and other delimiters.
In some cases, tokenization may also involve handling special cases such as
contractions ("don't" -> ["do", "n't"]) or hyphenated words ("self-driving" ->
["self", "-", "driving"]).
Word tokenization is often used as the initial step in NLP tasks such as text
classification, sentiment analysis, and machine translation.
Example: Consider the sentence: "Natural language processing is fascinating!"
Word Tokens: ["Natural", "language", "processing", "is", "fascinating"]
2.Sentence Tokenization:
Sentence tokenization involves dividing a text into individual sentences, where
each sentence is treated as a separate token.
This process typically relies on identifying sentence boundaries, which can be
indicated by punctuation marks (periods, exclamation marks, question marks) or
specific sentence boundary markers.
Sentence tokenization is useful for tasks that require analyzing text at the
sentence level, such as text summarization, named entity recognition, and
information extraction.
Example: Consider the paragraph: "NLP is fascinating. It involves analyzing
and understanding text data. Sentiment analysis is one of its applications."
Sentence Tokens: ["NLP is fascinating.", "It involves analyzing and
understanding text data.", "Sentiment analysis is one of its applications."]
3.Character Tokenization:
Character tokenization breaks down the text into its constituent characters, with
each character treated as a separate token.
This approach is particularly useful for tasks that operate at the character level,
such as text generation, handwriting recognition, and spelling correction.
Character tokenization preserves the sequential order of characters in the text
and can capture fine-grained patterns and structures that may be missed by word
or sentence tokenization.
However, character tokenization typically results in a larger vocabulary size and
may require additional processing steps to handle variable-length sequences.
Example: Consider the word: "hello"
Character Tokens: ['h', 'e', 'l', 'l', 'o']
4.Subword Tokenization:
Subword tokenization divides the text into smaller subword units, such as
morphemes, syllables, or character n-grams.
This approach is especially beneficial for handling out-of-vocabulary words,
rare or infrequent terms, and languages with complex morphology.
Popular subword tokenization algorithms, such as Byte Pair Encoding (BPE)
and WordPiece, iteratively merge frequent character sequences to build a
vocabulary of subword units.
Example: Consider the word: "unbelievable"
Subword Tokens (using Byte Pair Encoding): ["un", "be", "lie", "v", "able"]
Explanation: Subword tokenization splits the word into smaller subword units.
In this example, the word "unbelievable" is divided into subword tokens such as
"un", "be", "lie", "v", and "able", which capture meaningful parts of the word.
Stemming –
Stemming is a text processing technique used in Natural Language Processing
(NLP) to reduce words to their base or root form, known as the stem. The
primary goal of stemming is to normalize words with similar meanings but
different inflections or variations to a common form.
How Stemming Works?
Stemming is a linguistic normalization process in natural language processing
and information retrieval. Its primary goal is to reduce words to their base or
root form, known as the stem. Stemming helps group words with similar
meanings or roots together, even if they have different inflections, prefixes, or
suffixes.
The process involves removing common affixes (prefixes, suffixes) from words,
resulting in a simplified form that represents the word’s core meaning.
Stemming is a heuristic process and may only sometimes produce a valid word.
Still, it is effective for tasks like information retrieval, where the focus is on
matching the essential meaning of words rather than their grammatical
correctness.
For example:
Running -> Run
Jumps -> Jump
Swimming -> Swim
Stemming algorithms use various rules and heuristics to identify and remove
affixes, making them widely applicable in text-processing tasks to enhance
information retrieval and analysis.
Types of Stemming –
Porter Stemmer:
The Porter Stemmer is based on a set of heuristic rules applied sequentially to
reduce words to their stems. These rules involve removing suffixes from words
while maintaining linguistic meaning.
Example: The rule for removing the "-ing" suffix from verbs is applied to
transform "running" to "run".
Snowball Stemmer:
The Snowball Stemmer, also known as the Porter2 Stemmer, is an extension of
the Porter Stemmer. It offers improved stemming for multiple languages and
provides more aggressive stemming.
Example: In addition to stemming English words, the Snowball Stemmer can
handle stemming for languages like French, Spanish, and German.
Example:
Original word: "universally"
Stemmed word: "univers"
Lancaster Stemming in NLP
This stemming algorithm is one of the fastest algorithms available out there.
Unlike Snowball stemmer and Porter stemmer, the stem words in this algorithm
are not intuitive. Many short words are obfuscated after this algorithm's
implementation, and it significantly reduces the number of words. So, you must
avoid using this algorithm if you are looking for more distinction among
different words. The Lancaster rule converts words with ‘ies’ as a suffix into
the ‘y’ suffix. This stemming algorithm is not as efficient as the Snowball one.
Example:
Original words: ['running', 'jumped', 'happily', 'quickly', 'foxes']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox']
Regexp Stemmer -
The Regexp Stemmer, or Regular Expression Stemmer, is a stemming algorithm
that utilizes regular expressions to identify and remove suffixes from words. It
allows users to define custom rules for stemming by specifying patterns to
match and remove.
This method provides flexibility and control over the stemming process, making
it suitable for specific applications where custom rule-based stemming is
desired.
Example: Regular expressions can be used to match specific suffix patterns,
such as "-ing" for verbs, and remove them to produce stems.
Original Word: running
Stemmed Word: runn
Lemmatization –
Lemmatization is a text normalization technique used in Natural Language
Processing (NLP) to reduce words to their base or canonical form, known as the
lemma. The primary goal of lemmatization is to transform words into their
dictionary or lexicon form, which helps in grouping together inflected or variant
forms of the same word.
How Lemmatization works?
Lemmatization is a linguistic process that involves reducing words to their base
or root form, known as the lemma. The goal is to normalize different inflected
forms of a word so that they can be analyzed or compared more easily. This is
particularly useful in natural language processing (NLP) and text analysis.
Here’s how lemmatization generally works:
Tokenization: The first step is to break down a text into individual words or
tokens. This can be done using various methods, such as splitting the text based
on spaces.
POS Tagging: Parts-of-speech tagging involves assigning a grammatical
category (like noun, verb, adjective, etc.) to each token. Lemmatization often
relies on this information, as the base form of a word can depend on its
grammatical role in a sentence.
Lemmatization: Once each word has been tokenized and assigned a part-of-
speech tag, the lemmatization algorithm uses a lexicon or linguistic rules to
determine the lemma of each word. The lemma is the base form of the word,
which may not necessarily be the same as the word’s root. For example, the
lemma of “running” is “run,” and the lemma of “better” (in the context of an
adjective) is “good.”
Applying Rules: Lemmatization algorithms often rely on linguistic rules and
patterns. For irregular verbs or words with multiple possible lemmas, these rules
help in making the correct lemmatization decision.
Output: The result of lemmatization is a set of words in their base or dictionary
form, making it easier to analyze and understand the underlying meaning of a
text.
Example:
Suppose we have a sentence:
"Chickens are laying eggs in the coop."
Now, let's perform lemmatization on this sentence:
Tokenization: Split the sentence into individual words:
["Chickens", "are", "laying", "eggs", "in", "the", "coop"]
Part-of-Speech (POS) Tagging: Determine the grammatical category of each
word:
[("Chickens", "NNS"), ("are", "VBP"), ("laying", "VBG"), ("eggs", "NNS"),
("in", "IN"), ("the", "DT"), ("coop", "NN")]
Lemmatization: Reduce each word to its base form (lemma) based on its POS
tag:
"Chickens" (NNS) -> "Chicken" (NNS)
"are" (VBP) -> "be" (VB)
"laying" (VBG) -> "lay" (VB)
"eggs" (NNS) -> "egg" (NN)
"in" (IN) -> "in" (IN)
"the" (DT) -> "the" (DT)
"coop" (NN) -> "coop" (NN)
Result: The lemmatized sentence becomes:
"Chicken be lay egg in the coop."
In this example, lemmatization transformed words like "Chickens" to "Chicken"
and "laying" to "lay" to represent their base forms. This normalization process
helps in grouping together words with similar meanings and reduces the
complexity of text data for downstream NLP tasks.
Types of Lemmatization –
1. Rule Based Lemmatization-
Rule-based lemmatization involves the application of predefined rules to derive
the base or root form of a word. Unlike machine learning-based approaches,
which learn from data, rule-based lemmatization relies on linguistic rules and
patterns.
Here’s a simplified example of rule-based lemmatization for English verbs:
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
Word: “walked”
Rule Application: Remove “-ed”
Result: “walk
This approach extends to other verb conjugations, providing a systematic way to
obtain lemmas for regular verbs. While rule-based lemmatization may not cover
all linguistic nuances, it serves as a transparent and interpretable method for
deriving base forms in many cases.
2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup
tables to map words to their corresponding base forms or lemmas. Each word is
matched against the dictionary entries to find its lemma. This method is
effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
‘running’ -> ‘run’
‘better’ -> ‘good’
‘went’ -> ‘go’
When we apply dictionary-based lemmatization to a text like “I was running to
become a better athlete, and then I went home,” the resulting lemmatized form
would be: “I was run to become a good athlete, and then I go home.”
3. Machine Learning-Based Lemmatization
Machine learning-based lemmatization leverages computational models to
automatically learn the relationships between words and their base forms.
Unlike rule-based or dictionary-based approaches, machine learning models,
such as neural networks or statistical models, are trained on large text datasets
to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts. When
encountering the word ‘went,’ the model, having learned patterns, predicts the
base form as ‘go.’ Similarly, for ‘happier,’ the model deduces ‘happy’ as the
lemma. The advantage lies in the model’s ability to adapt to varied linguistic
nuances and handle irregularities, making it robust for lemmatizing diverse
vocabularies.
Advantages of Lemmatization with NLTK
1. Improves text analysis accuracy: Lemmatization helps in improving the
accuracy of text analysis by reducing words to their base or dictionary
form. This makes it easier to identify and analyze words that have similar
meanings.
2. Reduces data size: Since lemmatization reduces words to their base form,
it helps in reducing the data size of the text, which makes it easier to
handle large datasets.
3. Better search results: Lemmatization helps in retrieving better search
results since it reduces different forms of a word to a common base form,
making it easier to match different forms of a word in the text.
Disadvantages of Lemmatization with NLTK
1. Time-consuming: Lemmatization can be time-consuming since it
involves parsing the text and performing a lookup in a dictionary or a
database of word forms.
2. Not suitable for real-time applications: Since lemmatization is time-
consuming, it may not be suitable for real-time applications that require
quick response times.
3. May lead to ambiguity: Lemmatization may lead to ambiguity, as a single
word may have multiple meanings depending on the context in which it is
used. In such cases, the lemmatizer may not be able to determine the
correct meaning of the word.
Part Of Speech Tagging –
POS tagging is the process of assigning grammatical categories or tags to words
in a sentence based on their syntactic roles.
Common POS tags include nouns, verbs, adjectives, adverbs, pronouns,
prepositions, conjunctions, and determiners.
POS tagging is crucial for many NLP tasks, such as syntactic parsing, semantic
analysis, and information extraction.
POS taggers can be rule-based or trained using machine learning techniques,
such as hidden Markov models (HMMs) or deep learning models like recurrent
neural networks (RNNs) or transformers.
Example:
Consider the following sentence:
"The quick brown fox jumps over the lazy dog."
Now, let's perform POS tagging on this sentence:
Tokenization: Split the sentence into individual words or tokens:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
POS Tagging: Assign a specific part-of-speech tag to each word:
[("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps",
"VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN"), (".", ".")]
In this example, each word in the sentence has been assigned a specific part-of-
speech tag based on its grammatical function within the sentence. For instance,
"quick" and "brown" are tagged as adjectives (JJ), "fox" and "dog" are tagged as
nouns (NN), "jumps" is tagged as a verb (VBZ), "over" is tagged as a
preposition (IN), and so on.
POS tagging provides valuable linguistic information about the structure of the
text, which can be used in various NLP applications. It helps in syntactic
analysis, semantic understanding, and information extraction from text data.
Types of POS Tagging:
Rule-Based POS Tagging:
Rule-based POS tagging relies on manually crafted linguistic rules to assign
tags to words based on their morphological, syntactic, and contextual features.
These rules are typically derived from linguistic theories and observations.
Example: Assigning the tag "NN" (noun) to words ending in "-tion" or "-ment"
(e.g., "information," "government").
Stochastic/Probabilistic POS Tagging:
Stochastic or probabilistic POS tagging employs statistical models trained on
annotated corpora to predict the most likely tag for each word based on its
context. These models estimate the probability distribution of tags given the
observed words.
Example: Hidden Markov Models (HMMs) and Conditional Random Fields
(CRFs) are common probabilistic models used for POS tagging.
Lexical-Based POS Tagging:
Lexical-based POS tagging involves using lexical resources such as dictionaries
or lexicons to assign tags to words based on their known meanings or
properties.
Example: Assigning the tag "NNP" (proper noun) to words listed in a dictionary
as proper nouns (e.g., "John," "London").
Transformation-Based POS Tagging:
Transformation-based POS tagging employs machine learning algorithms to
learn transformation rules that map observed word sequences to their
corresponding tag sequences. These rules are derived from training data and
iteratively refined to improve tagging accuracy.
Example: Learning rules that map word suffixes to POS tags (e.g., "-ed" → past
tense verb).
Hybrid POS Tagging:
Hybrid POS tagging combines multiple approaches, such as rule-based,
stochastic, and lexical-based methods, to leverage their respective strengths and
improve tagging accuracy.
Example: Integrating rule-based patterns with statistical models to handle
complex linguistic phenomena more effectively.
Deep Learning-Based POS Tagging:
Deep learning-based POS tagging utilizes neural network architectures, such as
recurrent neural networks (RNNs), convolutional neural networks (CNNs), or
transformers, to automatically learn features and representations from raw text
data for tagging.
Example: Training a neural network model to predict POS tags directly from
word embeddings or character-level representations.
What is the need of POS Tagging ?
Syntactic Parsing: POS tags provide essential information about the
grammatical structure of sentences, enabling syntactic parsers to analyze and
understand the relationships between words in a sentence. Syntactic parsing
involves identifying phrases, clauses, and dependencies within sentences, which
is essential for tasks like information extraction, question answering, and
machine translation.
Disambiguation: Many words in natural language are ambiguous and can have
different meanings depending on their context. POS tagging helps disambiguate
these words by assigning them the appropriate grammatical category based on
their context. For example, the word "book" can be a noun (e.g., "I read a
book") or a verb (e.g., "I will book a flight"), and POS tagging helps determine
its correct usage in a sentence.
Morphological Analysis: POS tagging provides information about the
morphology of words, such as their tense, number, gender, and case. This
information is useful for analyzing word forms and inflections, which is
important for tasks like morphological generation and morphological
disambiguation.
Word Sense Disambiguation (WSD): POS tags serve as an important feature
for word sense disambiguation, which involves determining the intended
meaning of ambiguous words based on their context. WSD is crucial for tasks
like machine translation, information retrieval, and semantic analysis.