0% found this document useful (0 votes)
35 views94 pages

Unit 2

The document discusses syntax analysis in natural language processing (NLP), focusing on parsing techniques such as constituency and dependency parsing. It explains the structure of parse trees, the use of context-free grammar (CFG), and the challenges of ambiguity in syntactic analysis. Additionally, it introduces treebanks as a data-driven approach to syntax, providing a collection of sentences with expert-judged syntactic analyses to aid in knowledge acquisition.

Uploaded by

harshithgodha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views94 pages

Unit 2

The document discusses syntax analysis in natural language processing (NLP), focusing on parsing techniques such as constituency and dependency parsing. It explains the structure of parse trees, the use of context-free grammar (CFG), and the challenges of ambiguity in syntactic analysis. Additionally, it introduces treebanks as a data-driven approach to syntax, providing a collection of sentences with expert-judged syntactic analyses to aid in knowledge acquisition.

Uploaded by

harshithgodha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Syntax Analysis

•Parsing Natural Language


• The parsing in NLP is the process of determining the
syntactic structure of a text by analyzing its constituent
words based on an underlying grammar.
• Sentence - > noun_phrase, verb_phrase
• Noun_phrase - > proper_noun
• Noun_phrase - > determiner, noun
• Verb_phrase - > verb, noun_phrase
• Proper_noun - > [Tom]
• Verb - > [ate]
• Determiner - > [an]
• Noun - > [apple]
Syntax Analysis:
• Then, the outcome of the parsing process would be a parse tree where
sentence is the root, intermediate nodes such as noun_phrase
verb_phrase etc. have children called non-terminals and the leaves of
the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.

• Following structure shows the parse tree for the sentence “Tom ate an
apple”
Syntax Analysis:
• Parsing natural language refers to the process of analyzing and
interpreting the grammatical structure and meaning of a sentence or
text.

• Parsing identifies the information which is not explicitly given in the


input sentence.

• So, to identify the information parsing requires some additional


information i.e. grammar of a language.

• Context Free Grammar (CFG) is used to write the rules of syntax. For
Example :

• S —> NP VP

• NP –> ‘Ajay’ | ‘pockets’ | D N |N | P | PP

• VP – >‘bought’
Syntax Analysis
• D–a

• N – ‘shirt’

• PP - ‘On the table’

• P – ‘On’, ‘with’, ‘in’

• Where, V =Verb, NP =Noun phrase, VP = Verb phrase,

PP = Prepositional phrases.

• Typically the rules are stated in the form X —> w, Where ‘X’ is a
part of speech for word ‘w’ which is a generated terminal symbol.

• For Example : In rule V —> ‘saw’, V is a part of speech which


generates verb saw.
Syntax Analysis
• The sentence ‘Ajay bought a shirt with pockets can have two possible
forms based on rules stated in CFG.

• In the first parse form pockets can be considered as a currency to buy


shirt and in second form the sentence can be of meaning Ajay is
purchasing shirts with pockets.
• 2nd Phrase
• 1st Phrase

• ( S (NP Ajay) • (S (NP Ajay)


(VP (VP (V bought) • (VP (V bought)

(NP (D a) • (NP (NP (D a)

• (N shirt)
(N shirt))))
• (PP (P with)
(PP (P with)
• (NP pockets))))))
(NP pockets))))
Parse Tree
Syntax Analysis
• From the above example we understand that writing all possible
syntactic formations in a language is difficult and complex task as we
cannot list down a single form.

• All the possibilities need to be explored based on the parts of speech


tags mentioned for a particular word.

• It is also difficult to mention lexical properties of a particular word,


which is a first knowledge acquisition problem.

• Second problem of knowledge acquisition arises from the fact that, it


is not only sufficient to know the syntactic rules of a language but it
is also required to understand which analysis is most feasible for the
sentence in terms of meaning due to ambiguity in syntactic analysis.
Syntax Analysis
• For Example: Consider CFG

• N —> N N

• N -> ‘natural’ | ‘language” | ‘processing’ | ‘book’

• If we consider input sentence as “natural language


processing”, two possible ambiguous parses can be
generated.

• Parse 1 Parse 2
• (N (N (N natural) (N (Natural)

• (N language)) (N (N language)

(N Processing)) (N processing)))
Syntax Analysis
• Parse 1 leads to the meaning processing of
natural language and parse 2 leads to meaning
natural way to do language processing.

• This ambiguity problem is a limitation of CFG.


Shallow Parsing

• A parser takes input in the form of sequence of


tokens and produces output in the form of parse
tree.
Shallow Parsing
Statement: He ate the pizza
G= (V, T, P, S)
V - set of variables={ S, NP, VP, ART, N, V, PRO}
T - Set of terminals= { he, ate, the, Pizza}
P - Production Rule
S - start sym.
P: {S->NP VP
NP-> PRO
NP-> ART N
VP->V NP
PRO->He
V-> ate
ART->the
N-> Pizza}
Shallow Parsing
Shallow Parsing
• Parse tree for the sentence “The rose smells wonderful”
Shallow Parsing

• A parser takes input in the form of sequence of


tokens and produces output in the form of parse
tree.
• Parsing is of two types: top down parsing and
bottom up parsing.
Shallow Parsing

• Top down parsing


• The top down parsing is known as recursive
parsing or predictive parsing.
• In the top down parsing, the parsing starts from
the start symbol and transform it into the input
symbol.
• Parse Tree representation of input string
“Book that flight" is as follows:
Shallow Parsing

• Bottom up parsing
• Bottom up parsing is also known as shift-reduce parsing.
• Bottom up parsing is used to construct a parse tree for an
input string.
• In the bottom up parsing, the parsing starts with the
input symbol and construct the parse tree up to the start
symbol by tracing out the rightmost derivations of string
in reverse.
• Parse Tree representation of input string "id * id" is as
follows:
Shallow Parsing

Given grammar input string "id * id"


E→T
T→T*F
T → id
F→T
F → id
Shallow Parsing

import nltk
sentence = [
("a", "DT"),
("clever", "JJ"),
("fox","NN"),
("was","VBP"),
("jumping","VBP"),
("over","IN"),
("the","DT"),
("wall","NN")
]
grammar = "NP:{<DT>?<JJ>*<NN>}"
Reg_parser = nltk.RegexpParser(grammar)
Reg_parser.parse(sentence)
Output = Reg_parser.parse(sentence)
#Output.draw()
print(Output)
Dependency-based Parsing
• The parsing can also be divided into dependency-based and constituency-
based parsing.

• Dependency-based parsing is a type of syntactic parsing in natural


language processing (NLP) that focuses on representing the grammatical
structure of a sentence through dependencies between words.

• In a dependency-based parse, each word in a sentence is associated with


its syntactic relationship to other words.

• Dependency parsing displays only relationships between words and their


constitutes, while constituency parsing displays the entire sentence
structure and relationships.

• All the nodes in the tree are words and the links among words are
labelled by syntactic function.
Dependency-based Parsing
• A dependency parse tree is a graph where the set of
vertices contains the words in the sentence, and each
edge connects two words.

• The graph must satisfy three conditions:

1. There has to be a single root node with no incoming


edges.

2. For each node in, there must be a path from the root to.

3. Each node except the root must have exactly 1 incoming


edge.
Dependency-based Parsing
• Additionally, each edge in has a type, which defines
the grammatical relation that occurs between the two
words.

• Each relationship:

Has one head and a dependent that modifies the head.

Is labelled according to the nature of the dependency


between the head and the dependent.

These labels can be found at Universal Dependency

Relations.
Dependency-based Parsing
• Example – 1: Black car
• A relationship between car and black, because
black modifies the meaning of car.
• Here, car acts as the head and black is
a dependent of the head.
• The nature of the relationship here is amod which
stands for “Adjectival Modifier”.
• It is an adjective or an adjective phrase that
modifies a noun.
Dependency-based Parsing
• Example – 2 : I saw a fox

• The root of the tree is the verb of the sentence, and edges
between words describe their relationships.

• The word “saw” has an outgoing edge of type nsubj to the


word “I”, meaning that “I” is the nominal subject of the
verb “saw”.
• In this case, we say that “I” depends on “saw”.
Dependency-based Parsing
• Example 3: Deemed Universities charge huge
fees.
• The dependency-parsing representation is not
based on one dependency word.
Constituency -based Parsing

• Constituency parsing displays the entire sentence


structure and relationships.

• Constituency Parsing is the process of analyzing the


sentences by breaking down it into sub-phrases also
known as constituents.

• These sub-phrases belong to a specific category of


grammar like NP (noun phrase) and VP(verb phrase).

• Constituency parsing aims to extract a constituency-based


parse tree from a sentence that represents its syntactic
structure according to a phrase structure grammar or CFG
Constituency -based Parsing

• Example 1: John sees Bill.

• Example 2: I saw a fox.


Constituency Grammar (CG) and
Dependency Grammar(DG)

• Constituency Grammar (CG)

• Constituency Grammar (CG), also known as Phrase Structure


Grammar, is a formalism in the field of linguistics and natural
language processing (NLP) used to describe the hierarchical
structure of sentences in natural language.

• Constituency Grammar focuses on dividing sentences into


constituents (phrases or word groups) based on their
grammatical structure.

• Each constituent is represented as a hierarchical tree structure,


which is often referred to as a "parse tree."
Constituency Grammar (CG) and
Dependency Grammar(DG)
• A phrase-structure grammar is defined by a finite vocabulary
(alphabet) Vp, and a finite set Σ of initial strings in V p, and a finite

set F of rules of the form: X → Y, where X and Y are strings in V p.

• In linguistics, phrase structure grammars are all those grammars


that are based on the constituency relation, as opposed to the
dependency relation associated with dependency grammars.

• The constituency relation derives from the subject-


predicate division of Latin and Greek grammars.
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Here, we study the clause structure in terms of
noun phrase NP and verb phrase VP.
• Sentence: This tree is illustrating IC-analysis
according to the constituency relation
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Dependency Grammar(DG)

• It is opposite to the constituency grammar and is based on


the dependency relation.

• Dependency grammar (DG) is opposite to constituency


grammar because it lacks phrasal nodes.

• It emphasizes the relationships between words in a


sentence.
• In Dependency Grammar, the words are connected to each
other by directed links.

Constituency Grammar (CG) and
Dependency Grammar(DG)
• Every other syntactic unit is connected to the verb
in terms of directed link.
• These syntactic units are called dependencies.
• Sentence: This tree is illustrating the dependency
relation
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing
Dependency parsing focuses on
Constituency parsing, such as noun identifying the grammatical
phrases andfocuses on identifying the
constituent structure of a sentence relationships
sentence,
between words in a
such as subject-verb
verb phrases.
relationships.

Constituency parsing uses phrase Dependency parsing uses dependency


structure grammar, such as context-free grammar, which represents the
grammar or dependency grammar. relationships between words as labeled
directed arcs.

Constituency parsing is based on a top- Dependency parsing is based on a


down approach, where the parse tree is bottom-up approach, where the parse
built from the root node down to the tree is built from the leaves up to the
leaves. root.
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing
Dependency parsing represents a
Constituency parsing represents a sentence as a directed graph, where
sentence as a tree structure with non- words are represented as nodes and
overlapping constituents. grammatical relationships are
represented as edges.

Dependency parsing is more suitable for


Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks. dependency-based machine learning
models.

Constituency parsing is more expressive Dependency parsing is simpler and more


and captures more syntactic information, efficient, but may not capture as much
but can be more complex to compute and syntactic information as constituency
interpret. parsing.
Difference between constituency parsing and
dependency parsing
Constituency Parsing Dependency Parsing

Constituency parsing is more appropriate Dependency parsing is more appropriate


for languages with rich morphology such for languages with less morphological
as agglutinative languages. inflection like English and Chinese.

Dependency parsing is more suitable for


Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks. dependency-based machine learning
models.

Constituency parsing is used for more Dependency parsing is used for more
traditional NLP tasks like Named Entity advanced NLP tasks like Machine
Recognition, Text classification, and Translation, Language Modelling, and
Sentiment analysis. Text summarization.
Difference between constituency parsing and
dependency parsing
Syntax Analysis
•Treebanks: A Data-Driven Approach to Syntax

• Parsing recovers information that is not explicit in the input


sentence.

• This implies that a parser requires some knowledge in addition


to the input sentence about the kind of syntactic analysis that
should be produced as output.

• One method to provide such knowledge to the parser is to write


down a grammar of the language – a set of rules of syntactic
analysis as a CFGs.

• In natural language, it is far too complex to simply list all the


syntactic rules in terms of a CFG.
Treebanks: A Data-Driven Approach to Syntax
• The second method is, knowledge acquisition problem- which
not only do we need to know the syntactic rules for a
particular language, but we also need to know which analysis
is the most plausible(probably) for a given input sentence.

• The construction of treebank is a data driven approach to


syntax analysis that allows us to address both of these
knowledge acquisition bottlenecks in one stroke.

• A treebank is simply a collection of sentences (also called a


corpus of text), where each sentence is provided a complete
syntax analysis.
Treebanks: A Data-Driven Approach to Syntax
• The syntactic analysis for each sentence has been judged
by a human expert as the most possible analysis for that
sentence.

• A lot of care is taken during the human annotation


process to ensure that a consistent treatment is provided
across the treebank for related grammatical phenomena.

• There is no set of syntactic rules or linguistic grammar


explicitly provided by a treebank, and typically there is
no list of syntactic constructions provided explicitly in a
treebank.
Treebanks: A Data-Driven Approach to Syntax
• A detailed set of assumptions about the syntax is typically
used as an annotation guideline to help the human
experts produce the single-most plausible syntactic
analysis for each sentence in the corpus.

• Treebanks provide a solution to the two kinds of


knowledge acquisition bottlenecks.

• Treebanks solve the first knowledge acquisition problem of


finding the grammar underlying the syntax analysis
because the syntactic analysis is directly given instead of
a grammar.
Treebanks: A Data-Driven Approach to Syntax

• Treebank solve the second knowledge acquisition


problem as well.

• Since, each sentence in a treebank has been given its


most plausible(probable) syntactic analysis.

• Supervised machine learning methods can be used to


learn a scoring function over all possible syntax
analyses.
CFG with LMD and RMD
CFG: CFG consists of a finite set of grammar rules having the
following four components
Set of Non-Terminals
Set of Terminals
Set of Productions
Start Symbol
Set of Non-terminals
• It is represented by V. The non-terminals are syntactic
variables that denote the sets of strings, which helps in
defining the language that is generated with the help of
grammar.
CFG with LMD and RMD
• Set of Terminals
• It is also known as tokens and represented by Σ. Strings
are formed with the help of the basic symbols of terminals.
• Set of Productions
• It is represented by P. The set gives an idea about how the
terminals and non-terminals can be combined.
• Start Symbol
• The production begins from the start symbol. It is
represented by symbol S. Non-terminal symbols are
always designated as start symbols.
CFG with LMD and RMD
CFG with LMD and RMD
CFG with LMD and RMD
• Left Most Derivation (LMD) and Derivation
Tree :
• Leftmost derivation of a string from starting
symbol S is done by replacing leftmost non-
terminal symbol by RHS of corresponding
production rule.
• For example, the leftmost derivation of string
abab from grammar G above is done as : S =>
aSb => abSab => abab
• The symbols underlined are replaced using
production rules.
CFG with LMD and RMD
• Right Most Derivation (RMD) :
• Rightmost derivation of a string from starting
symbol S is done by replacing rightmost non-
terminal symbol by RHS of corresponding
production rule.
• e.g.; The rightmost derivation of string abab
from grammar G above is done as : S => SS =>
SaSb => Sab => aSbab => abab
• The symbols underlined are replaced using
production rules.
• The derivation tree for abab using rightmost
derivation
CFG with LMD and RMD
• Ambiguous Context Free Grammar :

• A context free grammar is called ambiguous if there exists more than one

LMD or more than one RMD for a string which is generated by grammar.
• There will also be more than one derivation tree for a string in

ambiguous grammar.
• The grammar described above is ambiguous because there are two

derivation trees (Figure 1 and Figure 2).


• There can be more than one RMD for string abab which are:

S => SS => SaSb => Sab => aSbab => abab

S => aSb => abSab => abab


CFG with LMD and RMD
Text Normalization
• Text normalization is the process of transforming text into a consistent
and standardized format.

• Text normalization is a critical preprocessing step in NLP, as it helps


reduce noise and ensures that text data is consistent and suitable for
various tasks like sentiment analysis, machine learning, and information
retrieval.

• The primary goals of text normalization are to reduce text variations and
enhance the accuracy and efficiency of text processing.

• Common text normalization tasks are : Tokenization, Removing


Punctuation, Stemming and Lemmatization, Removing Stop
Words, Spell Checking and Correction, Handling Numbers,
Removing HTML Tags etc.
Text Normalization
• Removing Stopwords :

• StopWords: A stopword is a commonly used word that a


search engine has been programmed to ignore, both
when indexing entries for searching and when
retrieving them as the result of a search query.

• For instance, nltk can list the English language


stopwords from the corpus using the code below:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
Text Normalization
• Removing Stopwords :
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its',
'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were',
'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",
'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Total-179
Removing Stopwords :
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off

the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

# converts the words in word_tokens to lower case and

then checks whether they are present in stop_words or

not

filtered_sentence = [w for w in word_tokens if not

w.lower() in stop_words]
Removing Stopwords:

#with no lower case conversion

filtered_sentence = []

for w in word_tokens:

if w not in stop_words:

filtered_sentence.append(w)

print(word_tokens)

print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',',


'showing', 'off', 'the', 'stop', 'words', 'filtration',
'.']

['This', 'sample', 'sentence', ',', 'showing', 'stop',


'words', 'filtration', '.']
Text Normalization
• Correcting Words:

• Each method takes a list of misspelled words and gives the


suggestion of the correct word for each incorrect word.

• It tries to find a word in the list of correct spellings that


has the shortest distance and the same initial letter as the
misspelled word.

• It then returns the word which matches the given criteria.

• The correcting word methods can be differentiated on the


basis of the distance measure which are used to find the
closest word.
Text Normalization
• Method 1: Using Jaccard distance Method

• This method is mainly used to measure


similarities between two sets or between two
asymmetric binary vectors.

• Jaccard distance = 1- J(A, B) where J(A, B)is


Jaccard index.
Text Normalization
• Example on correction words using Jaccard distance
method.

• Jaccard distance = 1 - J (A, B)= 1-0.33=0.66


Text Normalization
#Python program using nltk
for correcting words using Jaccard
distance method.
# importing the nltk suite
import nltk

#importing jaccard distance and ngrams


from nltk.util
from nltk.metrics.distance import
jaccard_distance
from nltk.util import ngrams
# Downloading and importing
Text Normalization
# package 'words' from nltk corpus
nltk.download('words')

from nltk.corpus import words


correct_words = words.words()

w1 = set('cat')
w2 = set('cap')
nltk.jaccard_distance(w1, w2)

Output: 0.5
Text Normalization
#Python program using nltk for correcting words
# list of incorrect spellings that need to be c
orrected.
incorrect_words=['happpy', 'azmaing',
'intelliengt']

# loop for finding correct spellings based on


jaccard distance and printing the correct word

for word in incorrect_words:


temp = [(jaccard_distance(set(ngrams(word,

2)),
set(ngrams(w, 2))),w)

for w in correct_words if w[0] == word[0]]


Text Normalization
print(sorted(temp, key = lambda
val:val[0])[0][1]))

Output:

happy
Amazing
intelligent
Text Normalization
• Method 2: Using Edit distance Method

• The distance between the source string and the


target string is the minimum number of edit
operations (deletions, insertions, or substitutions)
required to transform the source into the target.

• The lower the distance, the more similar the two


strings.
Text Normalization
• Example python program to Edit distance
method
# importing the nltk suite
import nltk
# importing edit distance
from nltk.metrics.distance import
edit_distance
# Downloading and importing package
words'
Text Normalization

from nltk.corpus import words


correct_words = words.words()

w1 = 'cat'

w2 = 'cap'

nltk.edit_distance(w1, w2)

O/P:1
Text Normalization
# Correcting Words using edit distance method:

# list of incorrect spellings that need to be

corrected

incorrect_words=['intelliengt']

# loop for finding correct spellings based on

edit distance and printing the correct words

for word in incorrect_words:

temp = [(edit_distance(word, w),w) for w in

correct_words if w[0] = = word[0]]


Text Normalization

print(sorted(temp, key = lambda

val:val[0])[0][1])

Output: intelligent
Parts of Speech (POS) Tagging
• Part-of-speech (POS) tagging, also known as speech
tagging or word-category disambiguation, is a
fundamental task in natural language processing
(NLP).

• It involves the process of assigning a specific part-of-


speech tag (a grammatical category or label) to each
word in a sentence or a text.

• These tags represent the syntactic and grammatical


role of each word within the sentence.
Parts of Speech (POS) Tagging
• Input: I am going.
• Output:
with stop word
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('.', '.')]

Without stop word


[('I', 'PRP'), ('going', 'VBG'), ('.', '.')]
Where
PRP- Pronoun Personal
VBP-Verb non-3rd personal singular present
VBG-Verb Gerund/ Present Participle
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = " Everything is all about money.“
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
O/P:
for i in tokenized:
[('Everything', 'VBG'),
# Word tokenizers is used to find the words ('money', 'NN'),
# and punctuation in a string ('.', '.')]
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Parts of Speech (POS) Tagging
# with stop words
import nltk
# download required nltk packages
# required for tokenization
nltk.download('punkt')
# required for parts of speech tagging O/P:
nltk.download('averaged_perceptron_tagger')
[('Today', 'NN'),
# input text ('morning', 'NN'),
sentence = """Today morning, I am Very Happy."""
(',', ','),
# tokene into words ('I', 'PRP'),
tokens = nltk.word_tokenize(sentence)
('am', 'VBP'),
# parts of speech tagging ('Very', 'RB'),
tagged = nltk.pos_tag(tokens)
('Happy', 'JJ'), ('.', '.')]
# print tagged tokens
print(tagged)
Parts of Speech (POS) Tagging
# Without stop words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = "Today morning, I am Very Happy.“
O/P:
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt [('Today', 'NN'),
module ('morning', 'NN'), (',',
tokenized = sent_tokenize(txt)
for i in tokenized: ','), ('I', 'PRP'), ('Very',
# Word tokenizers is used to find the words 'RB'), ('Happy', 'JJ'),
# and punctuation in a string ('.', '.')]
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in
stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Parsing Algorithms
• Parsing algorithms in natural language processing (NLP)
are methods and techniques used to analyze the
grammatical structure of sentences and generate parse
trees or dependency structures.

• Parsing plays a crucial role in understanding the syntactic


and semantic relationships between words in a sentence.

• There are various parsing algorithms used in NLP that


depends on the specific NLP task and the nature of the text
being parsed.
Parsing Algorithms
• Shift-Reduce Parsing

• Shift Reduce parser attempts for the construction of parse


in a similar manner as done in bottom-up parsing i.e. the
parse tree is constructed from leaves(bottom) to the
root(up) in the form of a table.

• A more general form of the shift-reduce parser is the LR


parser.

• This parser requires some data structures i.e.


• An input buffer for storing the input string.

• A stack for storing and accessing the production rules.


Parsing Algorithms
• To build a parser, we need an algorithm that can perform the
steps in the above rightmost derivation for any grammar
and for any input string.

• Basic Operations –

• Shift: This involves moving symbols from the input buffer


onto the stack.

• Reduce: If the handle appears on top of the stack then, it is


reduced by using appropriate production rule i.e. RHS of a
production rule is popped out of a stack and LHS of a
production rule is pushed onto the stack.
Parsing Algorithms
• Accept: If only the start symbol is present in the stack and the
input buffer is empty then, the parsing action is called accept.

When accepted action is obtained, it means successful parsing is


done.

• Error: This is the situation in which the parser can neither


perform shift action nor reduce action and not even accept
action.

• Example 1 – Consider the grammar


S –> S + S
S –> S * S
S –> id

• Perform Shift Reduce parsing for input string “id + id + id”.


Shift-Reduce Parsing
Stack Input String Action
$ id + id + id $ Shift id
$ id + id + id $ Reduce S --> id
$S + id + id $ Shift +
$S+ id + id $ Shift id
$ S + id + id $ Reduce S --> id
$S+S + id $ Reduce S --> S + S
$S + id $ Shift +
$S+ id $ Shift id
$ S + id $ Reduce S --> id
$S+S $ Reduce S --> S + S
$S $ Accept
Shift-Reduce Parsing
• Example 2 – Consider the grammar
S → S+S

S → S-S

S → (S)

S→a

• Perform Shift Reduce parsing for input string


“a1-(a2+a3)”
Shift-Reduce Parsing
Shift-Reduce Parsing
• Advantages:
1. Shift-reduce parsing is efficient and can handle a wide
range of context-free grammars.

2. It can parse a large variety of programming languages


and is widely used in practice.

3. It is capable of handling both left and right-recursive


grammars, which can be important in parsing certain
programming languages.

4. The parse table generated for shift-reduce parsing is


typically small, which makes the parser efficient in
terms of memory usage.
Shift-Reduce Parsing
• Disadvantages:
1. Shift-reduce parsing has a limited look ahead, which
means that it may miss some syntax errors that require
a larger look ahead.

2. It may also generate false-positive shift-reduce conflicts,


which can require additional manual intervention to
resolve.

3. Shift-reduce parsers may have difficulty in parsing


ambiguous grammars, where there are multiple possible
parse trees for a given input sequence.

4. In some cases, the parse tree generated by shift-reduce


parsing may be more complex than other parsing
techniques.
Parsing algorithm
• Hypergraphs and Chart Parsing:

• Hypergraphs and chart parsing are two related concepts in natural


language processing (NLP) and computational linguistics.

• They are used to represent and analyze the structure of sentences and
syntactic or semantic relationships in a more flexible and expressive
manner than traditional parsing techniques.

• Hypergraphs:

• A hypergraph is a generalization of a graph, where edges (hyperedges)


can connect more than two nodes.

• In the context of NLP, hypergraphs are often used to represent complex


syntactic or semantic structures in a sentence.
Parsing algorithm
• Each node in a hypergraph typically corresponds to a word or a sub-phrase,
and hyperedges connect nodes to represent relationships between them.

• In NLP, hypergraphs are commonly used in parsing and semantic analysis


tasks to represent various linguistic structures, such as constituency and
dependency parses.

• Chart Parsing

• Chart parsing is a parsing technique used in NLP to construct parse charts,


which are data structures that store and organize partial parsing results for
sentences.

• It's a dynamic programming approach that incrementally builds and


combines parse trees or structures as it processes words in a sentence.
Parsing algorithm
• Chart parsing is commonly used for constituency parsing and
dependency parsing.

• The key components of chart parsing include:

• Chart: The data structure that stores parsing results for sub-phrases.

• Earley Parser: A popular algorithm for chart parsing, it incrementally


adds and combines parsing states in the chart as it processes input
words.

• Chart parsing can handle ambiguity, non-projectivity, and other


linguistic phenomena, making it a powerful tool for syntactic and
semantic analysis in NLP.

• Example of chart parsing is CYK algorithm.


Parsing algorithm
• CYK algorithm

• The CYK algorithm is a parsing algorithm for context-free


grammar.

• It is used to check if a particular string can be derived from


the language generated by a given grammar.

• It is also called the membership algorithm as it tells whether


the given string is a member of the given grammar or not.

• It was independently developed by three Russian scientists


named Cocke, Younger, and Kasami, hence the name CYK!
Parsing algorithm
• In a CYK algorithm, the structure of grammar should be
in Chomsky normal form (CNF).

• The CYK algorithm is a dynamic programming-based


parsing technique or table filling algorithm used to
identify possible constituency parses for a given sentence.

• The grammar will be in CNF if each rule has one of the


following forms:
• �→��A→BC (at most two variables on the right-hand
side)
• �→A→ a (a single terminal on the right-hand side)
• �→ØS→Ø (null string)
Parsing algorithm
• If the given Grammar is not in the CNF form, convert it to the CNF
form before applying the CYK Algorithm.

• Deciding Membership Using CYK Algorithm

• To decide the membership of any given string, we construct a


triangular table where

 Each row of the table corresponds to one particular length of sub


strings.

 Bottom most row corresponds to strings of length-1.

 Second row from bottom corresponds to strings of length-2.

 Third row from bottom corresponds to strings of length-3.

 Top most row from bottom corresponds to the given string of length-n.
CYK algorithm
• For a given string ‘w’ of length 4 , triangular table looks as
below:

• Let w be the n length string to be parsed. And G represent


the set of rules in our grammar with start state S.
1. Construct a table for size n × n.

2. If w = e (empty string) and S -> e is a rule in G then we


accept the string else we reject.
CYK algorithm
3. For i = 1 to n:
For each variable A:
We check if A -> b is a rule and b = wi for some i:
If so, we place A in cell (i, i) of our table.
4. For l = 2 to n:
For i = 1 to n-l+1:
j = i+l-1
For k = i to j-1:
For each rule A -> BC: We check if (i, k) cell
contains B and
(k + 1, j) cell contains C:
If so, we put A in cell (i, j) of our table.
5. We check if S is in (1, n):
If so, we accept the string
Else, we reject.
CYK algorithm
• Example 1. Let our grammar G be:
S -> AB | BB
A -> CC | AB | a
B -> BB | CA | b
C -> BA | AA | b

We check if aabb is in L(G):

• Example 2. Let our grammar G be:


S -> AB | BC
A -> BA | a
B -> CC| b
C -> AB | a

We check if baaba is in L(G):


CYK algorithm
• Example 3. Let our grammar G be:
S->S+S
S->S*S
S->id
We check if string = id*id + id is in L(G):

You might also like