0% found this document useful (0 votes)

35 views94 pages

Unit 2

The document discusses syntax analysis in natural language processing (NLP), focusing on parsing techniques such as constituency and dependency parsing. It explains the structure of parse trees, the use of context-free grammar (CFG), and the challenges of ambiguity in syntactic analysis. Additionally, it introduces treebanks as a data-driven approach to syntax, providing a collection of sentences with expert-judged syntactic analyses to aid in knowledge acquisition.

Uploaded by

harshithgodha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views94 pages

Unit 2

Uploaded by

harshithgodha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Syntax Analysis

•Parsing Natural Language

• The parsing in NLP is the process of determining the
syntactic structure of a text by analyzing its constituent
words based on an underlying grammar.
• Sentence - > noun_phrase, verb_phrase
• Noun_phrase - > proper_noun
• Noun_phrase - > determiner, noun
• Verb_phrase - > verb, noun_phrase
• Proper_noun - > [Tom]
• Verb - > [ate]
• Determiner - > [an]
• Noun - > [apple]
Syntax Analysis:
• Then, the outcome of the parsing process would be a parse tree where
sentence is the root, intermediate nodes such as noun_phrase
verb_phrase etc. have children called non-terminals and the leaves of
the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.

• Following structure shows the parse tree for the sentence “Tom ate an
apple”
Syntax Analysis:
• Parsing natural language refers to the process of analyzing and
interpreting the grammatical structure and meaning of a sentence or
text.

• Parsing identifies the information which is not explicitly given in the

input sentence.

• So, to identify the information parsing requires some additional

information i.e. grammar of a language.

• Context Free Grammar (CFG) is used to write the rules of syntax. For
Example :

• S —> NP VP

• NP –> ‘Ajay’ | ‘pockets’ | D N |N | P | PP

• VP – >‘bought’
Syntax Analysis
• D–a

• N – ‘shirt’

• PP - ‘On the table’

• P – ‘On’, ‘with’, ‘in’

• Where, V =Verb, NP =Noun phrase, VP = Verb phrase,

PP = Prepositional phrases.

• Typically the rules are stated in the form X —> w, Where ‘X’ is a
part of speech for word ‘w’ which is a generated terminal symbol.

• For Example : In rule V —> ‘saw’, V is a part of speech which

generates verb saw.
Syntax Analysis
• The sentence ‘Ajay bought a shirt with pockets can have two possible
forms based on rules stated in CFG.

• In the first parse form pockets can be considered as a currency to buy

shirt and in second form the sentence can be of meaning Ajay is
purchasing shirts with pockets.
• 2nd Phrase
• 1st Phrase

• ( S (NP Ajay) • (S (NP Ajay)

(VP (VP (V bought) • (VP (V bought)

(NP (D a) • (NP (NP (D a)

• (N shirt)
(N shirt))))
• (PP (P with)
(PP (P with)
• (NP pockets))))))
(NP pockets))))
Parse Tree
Syntax Analysis
• From the above example we understand that writing all possible
syntactic formations in a language is difficult and complex task as we
cannot list down a single form.

• All the possibilities need to be explored based on the parts of speech

tags mentioned for a particular word.

• It is also difficult to mention lexical properties of a particular word,

which is a first knowledge acquisition problem.

• Second problem of knowledge acquisition arises from the fact that, it

is not only sufficient to know the syntactic rules of a language but it
is also required to understand which analysis is most feasible for the
sentence in terms of meaning due to ambiguity in syntactic analysis.
Syntax Analysis
• For Example: Consider CFG

• N —> N N

• N -> ‘natural’ | ‘language” | ‘processing’ | ‘book’

• If we consider input sentence as “natural language

processing”, two possible ambiguous parses can be
generated.

• Parse 1 Parse 2
• (N (N (N natural) (N (Natural)

• (N language)) (N (N language)

(N Processing)) (N processing)))
Syntax Analysis
• Parse 1 leads to the meaning processing of
natural language and parse 2 leads to meaning
natural way to do language processing.

• This ambiguity problem is a limitation of CFG.

Shallow Parsing

• A parser takes input in the form of sequence of

tokens and produces output in the form of parse
tree.
Shallow Parsing
Statement: He ate the pizza
G= (V, T, P, S)
V - set of variables={ S, NP, VP, ART, N, V, PRO}
T - Set of terminals= { he, ate, the, Pizza}
P - Production Rule
S - start sym.
P: {S->NP VP
NP-> PRO
NP-> ART N
VP->V NP
PRO->He
V-> ate
ART->the
N-> Pizza}
Shallow Parsing
Shallow Parsing
• Parse tree for the sentence “The rose smells wonderful”
Shallow Parsing

• A parser takes input in the form of sequence of

tokens and produces output in the form of parse
tree.
• Parsing is of two types: top down parsing and
bottom up parsing.
Shallow Parsing

• Top down parsing

• The top down parsing is known as recursive
parsing or predictive parsing.
• In the top down parsing, the parsing starts from
the start symbol and transform it into the input
symbol.
• Parse Tree representation of input string
“Book that flight" is as follows:
Shallow Parsing

• Bottom up parsing
• Bottom up parsing is also known as shift-reduce parsing.
• Bottom up parsing is used to construct a parse tree for an
input string.
• In the bottom up parsing, the parsing starts with the
input symbol and construct the parse tree up to the start
symbol by tracing out the rightmost derivations of string
in reverse.
• Parse Tree representation of input string "id * id" is as
follows:
Shallow Parsing

Given grammar input string "id * id"

E→T
T→T*F
T → id
F→T
F → id
Shallow Parsing

import nltk
sentence = [
("a", "DT"),
("clever", "JJ"),
("fox","NN"),
("was","VBP"),
("jumping","VBP"),
("over","IN"),
("the","DT"),
("wall","NN")
]
grammar = "NP:{<DT>?<JJ>*<NN>}"
Reg_parser = nltk.RegexpParser(grammar)
Reg_parser.parse(sentence)
Output = Reg_parser.parse(sentence)
#Output.draw()
print(Output)
Dependency-based Parsing
• The parsing can also be divided into dependency-based and constituency-
based parsing.

• Dependency-based parsing is a type of syntactic parsing in natural

language processing (NLP) that focuses on representing the grammatical
structure of a sentence through dependencies between words.

• In a dependency-based parse, each word in a sentence is associated with

its syntactic relationship to other words.

• Dependency parsing displays only relationships between words and their

constitutes, while constituency parsing displays the entire sentence
structure and relationships.

• All the nodes in the tree are words and the links among words are
labelled by syntactic function.
Dependency-based Parsing
• A dependency parse tree is a graph where the set of
vertices contains the words in the sentence, and each
edge connects two words.

• The graph must satisfy three conditions:

1. There has to be a single root node with no incoming

edges.

2. For each node in, there must be a path from the root to.

3. Each node except the root must have exactly 1 incoming

edge.
Dependency-based Parsing
• Additionally, each edge in has a type, which defines
the grammatical relation that occurs between the two
words.

• Each relationship:

Has one head and a dependent that modifies the head.

Is labelled according to the nature of the dependency

between the head and the dependent.

These labels can be found at Universal Dependency

Relations.
Dependency-based Parsing
• Example – 1: Black car
• A relationship between car and black, because
black modifies the meaning of car.
• Here, car acts as the head and black is
a dependent of the head.
• The nature of the relationship here is amod which
stands for “Adjectival Modifier”.
• It is an adjective or an adjective phrase that
modifies a noun.
Dependency-based Parsing
• Example – 2 : I saw a fox

• The root of the tree is the verb of the sentence, and edges
between words describe their relationships.

• The word “saw” has an outgoing edge of type nsubj to the

word “I”, meaning that “I” is the nominal subject of the
verb “saw”.
• In this case, we say that “I” depends on “saw”.
Dependency-based Parsing
• Example 3: Deemed Universities charge huge
fees.
• The dependency-parsing representation is not
based on one dependency word.
Constituency -based Parsing

• Constituency parsing displays the entire sentence

structure and relationships.

• Constituency Parsing is the process of analyzing the

sentences by breaking down it into sub-phrases also
known as constituents.

• These sub-phrases belong to a specific category of

grammar like NP (noun phrase) and VP(verb phrase).

• Constituency parsing aims to extract a constituency-based

parse tree from a sentence that represents its syntactic
structure according to a phrase structure grammar or CFG
Constituency -based Parsing

• Example 1: John sees Bill.

• Example 2: I saw a fox.

Constituency Grammar (CG) and
Dependency Grammar(DG)

• Constituency Grammar (CG)

• Constituency Grammar (CG), also known as Phrase Structure

Grammar, is a formalism in the field of linguistics and natural
language processing (NLP) used to describe the hierarchical
structure of sentences in natural language.

• Constituency Grammar focuses on dividing sentences into

constituents (phrases or word groups) based on their
grammatical structure.

• Each constituent is represented as a hierarchical tree structure,

which is often referred to as a "parse tree."
Constituency Grammar (CG) and
Dependency Grammar(DG)
• A phrase-structure grammar is defined by a finite vocabulary
(alphabet) Vp, and a finite set Σ of initial strings in V p, and a finite

set F of rules of the form: X → Y, where X and Y are strings in V p.

• In linguistics, phrase structure grammars are all those grammars

that are based on the constituency relation, as opposed to the
dependency relation associated with dependency grammars.

• The constituency relation derives from the subject-

predicate division of Latin and Greek grammars.
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Here, we study the clause structure in terms of
noun phrase NP and verb phrase VP.
• Sentence: This tree is illustrating IC-analysis
according to the constituency relation
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Dependency Grammar(DG)

• It is opposite to the constituency grammar and is based on

the dependency relation.

• Dependency grammar (DG) is opposite to constituency

grammar because it lacks phrasal nodes.

• It emphasizes the relationships between words in a

sentence.
• In Dependency Grammar, the words are connected to each
other by directed links.
•
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Every other syntactic unit is connected to the verb
in terms of directed link.
• These syntactic units are called dependencies.
• Sentence: This tree is illustrating the dependency
relation
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing
Dependency parsing focuses on
Constituency parsing, such as noun identifying the grammatical
phrases andfocuses on identifying the
constituent structure of a sentence relationships
sentence,
between words in a
such as subject-verb
verb phrases.
relationships.

Constituency parsing uses phrase Dependency parsing uses dependency

structure grammar, such as context-free grammar, which represents the
grammar or dependency grammar. relationships between words as labeled
directed arcs.

Constituency parsing is based on a top- Dependency parsing is based on a

down approach, where the parse tree is bottom-up approach, where the parse
built from the root node down to the tree is built from the leaves up to the
leaves. root.
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing
Dependency parsing represents a
Constituency parsing represents a sentence as a directed graph, where
sentence as a tree structure with non- words are represented as nodes and
overlapping constituents. grammatical relationships are
represented as edges.

Dependency parsing is more suitable for

Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks. dependency-based machine learning
models.

Constituency parsing is more expressive Dependency parsing is simpler and more

and captures more syntactic information, efficient, but may not capture as much
but can be more complex to compute and syntactic information as constituency
interpret. parsing.
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing

Constituency parsing is more appropriate Dependency parsing is more appropriate

for languages with rich morphology such for languages with less morphological
as agglutinative languages. inflection like English and Chinese.

Dependency parsing is more suitable for

Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks. dependency-based machine learning
models.

Constituency parsing is used for more Dependency parsing is used for more
traditional NLP tasks like Named Entity advanced NLP tasks like Machine
Recognition, Text classification, and Translation, Language Modelling, and
Sentiment analysis. Text summarization.
Difference between constituency parsing and
dependency parsing
Syntax Analysis
•Treebanks: A Data-Driven Approach to Syntax

• Parsing recovers information that is not explicit in the input

sentence.

• This implies that a parser requires some knowledge in addition

to the input sentence about the kind of syntactic analysis that
should be produced as output.

• One method to provide such knowledge to the parser is to write

down a grammar of the language – a set of rules of syntactic
analysis as a CFGs.

• In natural language, it is far too complex to simply list all the

syntactic rules in terms of a CFG.
Treebanks: A Data-Driven Approach to Syntax
• The second method is, knowledge acquisition problem- which
not only do we need to know the syntactic rules for a
particular language, but we also need to know which analysis
is the most plausible(probably) for a given input sentence.

• The construction of treebank is a data driven approach to

syntax analysis that allows us to address both of these
knowledge acquisition bottlenecks in one stroke.

• A treebank is simply a collection of sentences (also called a

corpus of text), where each sentence is provided a complete
syntax analysis.
Treebanks: A Data-Driven Approach to Syntax
• The syntactic analysis for each sentence has been judged
by a human expert as the most possible analysis for that
sentence.

• A lot of care is taken during the human annotation

process to ensure that a consistent treatment is provided
across the treebank for related grammatical phenomena.

• There is no set of syntactic rules or linguistic grammar

explicitly provided by a treebank, and typically there is
no list of syntactic constructions provided explicitly in a
treebank.
Treebanks: A Data-Driven Approach to Syntax
• A detailed set of assumptions about the syntax is typically
used as an annotation guideline to help the human
experts produce the single-most plausible syntactic
analysis for each sentence in the corpus.

• Treebanks provide a solution to the two kinds of

knowledge acquisition bottlenecks.

• Treebanks solve the first knowledge acquisition problem of

finding the grammar underlying the syntax analysis
because the syntactic analysis is directly given instead of
a grammar.
Treebanks: A Data-Driven Approach to Syntax

• Treebank solve the second knowledge acquisition

problem as well.

• Since, each sentence in a treebank has been given its

most plausible(probable) syntactic analysis.

• Supervised machine learning methods can be used to

learn a scoring function over all possible syntax
analyses.
CFG with LMD and RMD
CFG: CFG consists of a finite set of grammar rules having the
following four components
Set of Non-Terminals
Set of Terminals
Set of Productions
Start Symbol
Set of Non-terminals
• It is represented by V. The non-terminals are syntactic
variables that denote the sets of strings, which helps in
defining the language that is generated with the help of
grammar.
CFG with LMD and RMD
• Set of Terminals
• It is also known as tokens and represented by Σ. Strings
are formed with the help of the basic symbols of terminals.
• Set of Productions
• It is represented by P. The set gives an idea about how the
terminals and non-terminals can be combined.
• Start Symbol
• The production begins from the start symbol. It is
represented by symbol S. Non-terminal symbols are
always designated as start symbols.
CFG with LMD and RMD
CFG with LMD and RMD
CFG with LMD and RMD
• Left Most Derivation (LMD) and Derivation
Tree :
• Leftmost derivation of a string from starting
symbol S is done by replacing leftmost non-
terminal symbol by RHS of corresponding
production rule.
• For example, the leftmost derivation of string
abab from grammar G above is done as : S =>
aSb => abSab => abab
• The symbols underlined are replaced using
production rules.
CFG with LMD and RMD
• Right Most Derivation (RMD) :
• Rightmost derivation of a string from starting
symbol S is done by replacing rightmost non-
terminal symbol by RHS of corresponding
production rule.
• e.g.; The rightmost derivation of string abab
from grammar G above is done as : S => SS =>
SaSb => Sab => aSbab => abab
• The symbols underlined are replaced using
production rules.
• The derivation tree for abab using rightmost
derivation
CFG with LMD and RMD
• Ambiguous Context Free Grammar :

• A context free grammar is called ambiguous if there exists more than one

LMD or more than one RMD for a string which is generated by grammar.
• There will also be more than one derivation tree for a string in

ambiguous grammar.
• The grammar described above is ambiguous because there are two

derivation trees (Figure 1 and Figure 2).

• There can be more than one RMD for string abab which are:

S => SS => SaSb => Sab => aSbab => abab

S => aSb => abSab => abab

CFG with LMD and RMD
Text Normalization
• Text normalization is the process of transforming text into a consistent
and standardized format.

• Text normalization is a critical preprocessing step in NLP, as it helps

reduce noise and ensures that text data is consistent and suitable for
various tasks like sentiment analysis, machine learning, and information
retrieval.

• The primary goals of text normalization are to reduce text variations and
enhance the accuracy and efficiency of text processing.

• Common text normalization tasks are : Tokenization, Removing

Punctuation, Stemming and Lemmatization, Removing Stop
Words, Spell Checking and Correction, Handling Numbers,
Removing HTML Tags etc.
Text Normalization
• Removing Stopwords :

• StopWords: A stopword is a commonly used word that a

search engine has been programmed to ignore, both
when indexing entries for searching and when
retrieving them as the result of a search query.

• For instance, nltk can list the English language

stopwords from the corpus using the code below:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
Text Normalization
• Removing Stopwords :
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its',
'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were',
'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",
'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Total-179
Removing Stopwords :
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off

the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

# converts the words in word_tokens to lower case and

then checks whether they are present in stop_words or

not

filtered_sentence = [w for w in word_tokens if not

w.lower() in stop_words]
Removing Stopwords:

#with no lower case conversion

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:

filtered_sentence.append(w)

print(word_tokens)

print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',',

'showing', 'off', 'the', 'stop', 'words', 'filtration',
'.']

['This', 'sample', 'sentence', ',', 'showing', 'stop',

'words', 'filtration', '.']
Text Normalization
• Correcting Words:

• Each method takes a list of misspelled words and gives the

suggestion of the correct word for each incorrect word.

• It tries to find a word in the list of correct spellings that

has the shortest distance and the same initial letter as the
misspelled word.

• It then returns the word which matches the given criteria.

• The correcting word methods can be differentiated on the

basis of the distance measure which are used to find the
closest word.
Text Normalization
• Method 1: Using Jaccard distance Method

• This method is mainly used to measure

similarities between two sets or between two
asymmetric binary vectors.

• Jaccard distance = 1- J(A, B) where J(A, B)is

Jaccard index.
Text Normalization
• Example on correction words using Jaccard distance
method.

• Jaccard distance = 1 - J (A, B)= 1-0.33=0.66

Text Normalization
#Python program using nltk
for correcting words using Jaccard
distance method.
# importing the nltk suite
import nltk

#importing jaccard distance and ngrams

from nltk.util
from nltk.metrics.distance import
jaccard_distance
from nltk.util import ngrams
# Downloading and importing
Text Normalization
# package 'words' from nltk corpus
nltk.download('words')

from nltk.corpus import words

correct_words = words.words()

w1 = set('cat')
w2 = set('cap')
nltk.jaccard_distance(w1, w2)

Output: 0.5
Text Normalization
#Python program using nltk for correcting words
# list of incorrect spellings that need to be c
orrected.
incorrect_words=['happpy', 'azmaing',
'intelliengt']

# loop for finding correct spellings based on

jaccard distance and printing the correct word

for word in incorrect_words:

temp = [(jaccard_distance(set(ngrams(word,

2)),
set(ngrams(w, 2))),w)

for w in correct_words if w[0] == word[0]]

Text Normalization
print(sorted(temp, key = lambda
val:val[0])[0][1]))

Output:

happy
Amazing
intelligent
Text Normalization
• Method 2: Using Edit distance Method

• The distance between the source string and the

target string is the minimum number of edit
operations (deletions, insertions, or substitutions)
required to transform the source into the target.

• The lower the distance, the more similar the two

strings.
Text Normalization
• Example python program to Edit distance
method
# importing the nltk suite
import nltk
# importing edit distance
from nltk.metrics.distance import
edit_distance
# Downloading and importing package
words'
Text Normalization

from nltk.corpus import words

correct_words = words.words()

w1 = 'cat'

w2 = 'cap'

nltk.edit_distance(w1, w2)

O/P:1
Text Normalization
# Correcting Words using edit distance method:

# list of incorrect spellings that need to be

corrected

incorrect_words=['intelliengt']

# loop for finding correct spellings based on

edit distance and printing the correct words

for word in incorrect_words:

temp = [(edit_distance(word, w),w) for w in

correct_words if w[0] = = word[0]]

Text Normalization

print(sorted(temp, key = lambda

val:val[0])[0][1])

Output: intelligent
Parts of Speech (POS) Tagging
• Part-of-speech (POS) tagging, also known as speech
tagging or word-category disambiguation, is a
fundamental task in natural language processing
(NLP).

• It involves the process of assigning a specific part-of-

speech tag (a grammatical category or label) to each
word in a sentence or a text.

• These tags represent the syntactic and grammatical

role of each word within the sentence.
Parts of Speech (POS) Tagging
• Input: I am going.
• Output:
with stop word
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('.', '.')]

Without stop word

[('I', 'PRP'), ('going', 'VBG'), ('.', '.')]
Where
PRP- Pronoun Personal
VBP-Verb non-3rd personal singular present
VBG-Verb Gerund/ Present Participle
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = " Everything is all about money.“
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
O/P:
for i in tokenized:
[('Everything', 'VBG'),
# Word tokenizers is used to find the words ('money', 'NN'),
# and punctuation in a string ('.', '.')]
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Parts of Speech (POS) Tagging
# with stop words
import nltk
# download required nltk packages
# required for tokenization
nltk.download('punkt')
# required for parts of speech tagging O/P:
nltk.download('averaged_perceptron_tagger')
[('Today', 'NN'),
# input text ('morning', 'NN'),
sentence = """Today morning, I am Very Happy."""
(',', ','),
# tokene into words ('I', 'PRP'),
tokens = nltk.word_tokenize(sentence)
('am', 'VBP'),
# parts of speech tagging ('Very', 'RB'),
tagged = nltk.pos_tag(tokens)
('Happy', 'JJ'), ('.', '.')]
# print tagged tokens
print(tagged)
Parts of Speech (POS) Tagging
# Without stop words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = "Today morning, I am Very Happy.“
O/P:
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt [('Today', 'NN'),
module ('morning', 'NN'), (',',
tokenized = sent_tokenize(txt)
for i in tokenized: ','), ('I', 'PRP'), ('Very',
# Word tokenizers is used to find the words 'RB'), ('Happy', 'JJ'),
# and punctuation in a string ('.', '.')]
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in
stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Parsing Algorithms
• Parsing algorithms in natural language processing (NLP)
are methods and techniques used to analyze the
grammatical structure of sentences and generate parse
trees or dependency structures.

• Parsing plays a crucial role in understanding the syntactic

and semantic relationships between words in a sentence.

• There are various parsing algorithms used in NLP that

depends on the specific NLP task and the nature of the text
being parsed.
Parsing Algorithms
• Shift-Reduce Parsing

• Shift Reduce parser attempts for the construction of parse

in a similar manner as done in bottom-up parsing i.e. the
parse tree is constructed from leaves(bottom) to the
root(up) in the form of a table.

• A more general form of the shift-reduce parser is the LR

parser.

• This parser requires some data structures i.e.

• An input buffer for storing the input string.

• A stack for storing and accessing the production rules.

Parsing Algorithms
• To build a parser, we need an algorithm that can perform the
steps in the above rightmost derivation for any grammar
and for any input string.

• Basic Operations –

• Shift: This involves moving symbols from the input buffer

onto the stack.

• Reduce: If the handle appears on top of the stack then, it is

reduced by using appropriate production rule i.e. RHS of a
production rule is popped out of a stack and LHS of a
production rule is pushed onto the stack.
Parsing Algorithms
• Accept: If only the start symbol is present in the stack and the
input buffer is empty then, the parsing action is called accept.

When accepted action is obtained, it means successful parsing is

done.

• Error: This is the situation in which the parser can neither

perform shift action nor reduce action and not even accept
action.

• Example 1 – Consider the grammar

S –> S + S
S –> S * S
S –> id

• Perform Shift Reduce parsing for input string “id + id + id”.

Shift-Reduce Parsing
Stack Input String Action
$ id + id + id $ Shift id
$ id + id + id $ Reduce S --> id
$S + id + id $ Shift +
$S+ id + id $ Shift id
$ S + id + id $ Reduce S --> id
$S+S + id $ Reduce S --> S + S
$S + id $ Shift +
$S+ id $ Shift id
$ S + id $ Reduce S --> id
$S+S $ Reduce S --> S + S
$S $ Accept
Shift-Reduce Parsing
• Example 2 – Consider the grammar
S → S+S

S → S-S

S → (S)

S→a

• Perform Shift Reduce parsing for input string

“a1-(a2+a3)”
Shift-Reduce Parsing
Shift-Reduce Parsing
• Advantages:
1. Shift-reduce parsing is efficient and can handle a wide
range of context-free grammars.

2. It can parse a large variety of programming languages

and is widely used in practice.

3. It is capable of handling both left and right-recursive

grammars, which can be important in parsing certain
programming languages.

4. The parse table generated for shift-reduce parsing is

typically small, which makes the parser efficient in
terms of memory usage.
Shift-Reduce Parsing
• Disadvantages:
1. Shift-reduce parsing has a limited look ahead, which
means that it may miss some syntax errors that require
a larger look ahead.

2. It may also generate false-positive shift-reduce conflicts,

which can require additional manual intervention to
resolve.

3. Shift-reduce parsers may have difficulty in parsing

ambiguous grammars, where there are multiple possible
parse trees for a given input sequence.

4. In some cases, the parse tree generated by shift-reduce

parsing may be more complex than other parsing
techniques.
Parsing algorithm
• Hypergraphs and Chart Parsing:

• Hypergraphs and chart parsing are two related concepts in natural

language processing (NLP) and computational linguistics.

• They are used to represent and analyze the structure of sentences and
syntactic or semantic relationships in a more flexible and expressive
manner than traditional parsing techniques.

• Hypergraphs:

• A hypergraph is a generalization of a graph, where edges (hyperedges)

can connect more than two nodes.

• In the context of NLP, hypergraphs are often used to represent complex

syntactic or semantic structures in a sentence.
Parsing algorithm
• Each node in a hypergraph typically corresponds to a word or a sub-phrase,
and hyperedges connect nodes to represent relationships between them.

• In NLP, hypergraphs are commonly used in parsing and semantic analysis

tasks to represent various linguistic structures, such as constituency and
dependency parses.

• Chart Parsing

• Chart parsing is a parsing technique used in NLP to construct parse charts,

which are data structures that store and organize partial parsing results for
sentences.

• It's a dynamic programming approach that incrementally builds and

combines parse trees or structures as it processes words in a sentence.
Parsing algorithm
• Chart parsing is commonly used for constituency parsing and
dependency parsing.

• The key components of chart parsing include:

• Chart: The data structure that stores parsing results for sub-phrases.

• Earley Parser: A popular algorithm for chart parsing, it incrementally

adds and combines parsing states in the chart as it processes input
words.

• Chart parsing can handle ambiguity, non-projectivity, and other

linguistic phenomena, making it a powerful tool for syntactic and
semantic analysis in NLP.

• Example of chart parsing is CYK algorithm.

Parsing algorithm
• CYK algorithm

• The CYK algorithm is a parsing algorithm for context-free

grammar.

• It is used to check if a particular string can be derived from

the language generated by a given grammar.

• It is also called the membership algorithm as it tells whether

the given string is a member of the given grammar or not.

• It was independently developed by three Russian scientists

named Cocke, Younger, and Kasami, hence the name CYK!
Parsing algorithm
• In a CYK algorithm, the structure of grammar should be
in Chomsky normal form (CNF).

• The CYK algorithm is a dynamic programming-based

parsing technique or table filling algorithm used to
identify possible constituency parses for a given sentence.

• The grammar will be in CNF if each rule has one of the

following forms:
• �→��A→BC (at most two variables on the right-hand
side)
• �→A→ a (a single terminal on the right-hand side)
• �→ØS→Ø (null string)
Parsing algorithm
• If the given Grammar is not in the CNF form, convert it to the CNF
form before applying the CYK Algorithm.

• Deciding Membership Using CYK Algorithm

• To decide the membership of any given string, we construct a

triangular table where

 Each row of the table corresponds to one particular length of sub

strings.

 Bottom most row corresponds to strings of length-1.

 Second row from bottom corresponds to strings of length-2.

 Third row from bottom corresponds to strings of length-3.

 Top most row from bottom corresponds to the given string of length-n.
CYK algorithm
• For a given string ‘w’ of length 4 , triangular table looks as
below:

• Let w be the n length string to be parsed. And G represent

the set of rules in our grammar with start state S.
1. Construct a table for size n × n.

2. If w = e (empty string) and S -> e is a rule in G then we

accept the string else we reject.
CYK algorithm
3. For i = 1 to n:
For each variable A:
We check if A -> b is a rule and b = wi for some i:
If so, we place A in cell (i, i) of our table.
4. For l = 2 to n:
For i = 1 to n-l+1:
j = i+l-1
For k = i to j-1:
For each rule A -> BC: We check if (i, k) cell
contains B and
(k + 1, j) cell contains C:
If so, we put A in cell (i, j) of our table.
5. We check if S is in (1, n):
If so, we accept the string
Else, we reject.
CYK algorithm
• Example 1. Let our grammar G be:
S -> AB | BB
A -> CC | AB | a
B -> BB | CA | b
C -> BA | AA | b

We check if aabb is in L(G):

• Example 2. Let our grammar G be:

S -> AB | BC
A -> BA | a
B -> CC| b
C -> AB | a

We check if baaba is in L(G):

CYK algorithm
• Example 3. Let our grammar G be:
S->S+S
S->S*S
S->id
We check if string = id*id + id is in L(G):

NLP Chapter 3
No ratings yet
NLP Chapter 3
23 pages
NLP Parsing Techniques
No ratings yet
NLP Parsing Techniques
54 pages
NLP M3 SPP
No ratings yet
NLP M3 SPP
53 pages
Unit Iii
No ratings yet
Unit Iii
17 pages
NLP Unit 2
No ratings yet
NLP Unit 2
13 pages
4.chapter5 - Syntactic and Semantic Representations
No ratings yet
4.chapter5 - Syntactic and Semantic Representations
47 pages
Unit 3
No ratings yet
Unit 3
10 pages
Unit 2 New One
No ratings yet
Unit 2 New One
12 pages
Mod - 3
No ratings yet
Mod - 3
51 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
71 pages
NLP Module 3
No ratings yet
NLP Module 3
41 pages
Longsem2024-25 Cse3015 Eth Ap2024256000125 Reference-material-III
No ratings yet
Longsem2024-25 Cse3015 Eth Ap2024256000125 Reference-material-III
89 pages
13-Dependency Grammar-03-09-2024
No ratings yet
13-Dependency Grammar-03-09-2024
31 pages
Module 3 NLP
No ratings yet
Module 3 NLP
32 pages
NLP Merged
No ratings yet
NLP Merged
135 pages
What Is Parsing
No ratings yet
What Is Parsing
47 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
Parsing and Ambiguity in NLP
No ratings yet
Parsing and Ambiguity in NLP
18 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
NLP - Shortnotes Unit 3
No ratings yet
NLP - Shortnotes Unit 3
16 pages
Parsing
No ratings yet
Parsing
10 pages
Module-2 ch-4
No ratings yet
Module-2 ch-4
32 pages
Unit 2
No ratings yet
Unit 2
140 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
Natural Language Processing Unit 3
No ratings yet
Natural Language Processing Unit 3
55 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Unit II
No ratings yet
Unit II
61 pages
Natural Language Processing UNIT 2
No ratings yet
Natural Language Processing UNIT 2
32 pages
Parsing Techniques
No ratings yet
Parsing Techniques
16 pages
Context-Free Grammars and Constituency Parsing
No ratings yet
Context-Free Grammars and Constituency Parsing
31 pages
Introduction to NLP Concepts
No ratings yet
Introduction to NLP Concepts
21 pages
NLP Parsing Techniques Explained
No ratings yet
NLP Parsing Techniques Explained
13 pages
NLP Unit 3 Part A PDF
No ratings yet
NLP Unit 3 Part A PDF
75 pages
NLP Unit 3 - Student
No ratings yet
NLP Unit 3 - Student
26 pages
NLP Sem Unit 2
No ratings yet
NLP Sem Unit 2
12 pages
CH 08
No ratings yet
CH 08
31 pages
Syntactic Analysis
No ratings yet
Syntactic Analysis
66 pages
Chapter 3 - Syntax Analysis
No ratings yet
Chapter 3 - Syntax Analysis
67 pages
Unit 5
No ratings yet
Unit 5
10 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
NLP Unit Ii
No ratings yet
NLP Unit Ii
30 pages
Compiler 2
100% (1)
Compiler 2
45 pages
Morphological Parsing
No ratings yet
Morphological Parsing
19 pages
Constituency Parsing PPT 2
No ratings yet
Constituency Parsing PPT 2
33 pages
MNLP Unit-2
No ratings yet
MNLP Unit-2
96 pages
Unit 2 - Lecture 1
No ratings yet
Unit 2 - Lecture 1
19 pages
NLP Unit Ii
No ratings yet
NLP Unit Ii
30 pages
2024 CD-Ch03 Syntaxx Analysis
No ratings yet
2024 CD-Ch03 Syntaxx Analysis
28 pages
Consituency Vs Dependent Grammar
No ratings yet
Consituency Vs Dependent Grammar
5 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
50 pages
Chapter15 NaturalLanguage
100% (1)
Chapter15 NaturalLanguage
35 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
A Context
No ratings yet
A Context
5 pages
21cse356t NLP Unit 2
No ratings yet
21cse356t NLP Unit 2
89 pages
NLP Unit-2
No ratings yet
NLP Unit-2
42 pages
Compiler Theory: (A Simple Syntax-Directed Translator)
No ratings yet
Compiler Theory: (A Simple Syntax-Directed Translator)
50 pages
FCE Speaking Part Two Speculate Compare and Contrast Vocabulary
No ratings yet
FCE Speaking Part Two Speculate Compare and Contrast Vocabulary
2 pages
4-Dars ENG
No ratings yet
4-Dars ENG
5 pages
Umpan Balik Rekan Sejawat
No ratings yet
Umpan Balik Rekan Sejawat
3 pages
CEFR Checklist A1PLUS Starter
No ratings yet
CEFR Checklist A1PLUS Starter
2 pages
Eng-2-Right Forms of Verbs & Subject-Verb Agree
No ratings yet
Eng-2-Right Forms of Verbs & Subject-Verb Agree
22 pages
Verb Forms and Usage Guide
No ratings yet
Verb Forms and Usage Guide
2 pages
Pre-Advanced Etiquette Lesson
No ratings yet
Pre-Advanced Etiquette Lesson
21 pages
2 - Writing Fluently - Seamlessly Integrating Quotes Into A Sentence
No ratings yet
2 - Writing Fluently - Seamlessly Integrating Quotes Into A Sentence
7 pages
Stale $17.50 Cookie Sparks Outrage
No ratings yet
Stale $17.50 Cookie Sparks Outrage
6 pages
English Grammar Structures Guide
No ratings yet
English Grammar Structures Guide
41 pages
Yoruba Jamb Syllabus
No ratings yet
Yoruba Jamb Syllabus
9 pages
Learn Nuwaupic: Daily Language Lessons
No ratings yet
Learn Nuwaupic: Daily Language Lessons
16 pages
HSG 12 (2019 - 2020) - Updated - KEY 1
No ratings yet
HSG 12 (2019 - 2020) - Updated - KEY 1
7 pages
S1 &2 Schemes of Work 2022
No ratings yet
S1 &2 Schemes of Work 2022
21 pages
Onomatopoeia Identification Worksheet
No ratings yet
Onomatopoeia Identification Worksheet
2 pages
Learn Sanskrit - Lesson 10
No ratings yet
Learn Sanskrit - Lesson 10
8 pages
SCS 2ND U1
No ratings yet
SCS 2ND U1
7 pages
BCVS - 1 English RMK Group Unit 3 F (1) (Autosaved)
No ratings yet
BCVS - 1 English RMK Group Unit 3 F (1) (Autosaved)
41 pages
Noun Worksheet For Class 6 With Answers 1
95% (19)
Noun Worksheet For Class 6 With Answers 1
9 pages
Language Milestones Checklist
No ratings yet
Language Milestones Checklist
10 pages
F9 - Delivering ManuscriptMemorized Speech
No ratings yet
F9 - Delivering ManuscriptMemorized Speech
24 pages
English Grammar & Cultural Tips
No ratings yet
English Grammar & Cultural Tips
8 pages
English Use Exercises Part 4
No ratings yet
English Use Exercises Part 4
10 pages
Group9 - New E11-Unit-9 - Speaking
No ratings yet
Group9 - New E11-Unit-9 - Speaking
21 pages
Mental Ability Verbal Series Notes
No ratings yet
Mental Ability Verbal Series Notes
2 pages
English Training Manual
100% (1)
English Training Manual
101 pages
Reading Process
No ratings yet
Reading Process
39 pages
Storytelling Modes for Educators
No ratings yet
Storytelling Modes for Educators
4 pages
Peterson Barney 1952 Control
No ratings yet
Peterson Barney 1952 Control
10 pages
Linguistic Humor, and Language Play: Don Nilsen and Alleen Pace Nilsen
No ratings yet
Linguistic Humor, and Language Play: Don Nilsen and Alleen Pace Nilsen
57 pages