Syntax Analysis
•Parsing Natural Language
• The parsing in NLP is the process of determining the
  syntactic structure of a text by analyzing its constituent
  words based on an underlying grammar.
• Sentence - > noun_phrase, verb_phrase
• Noun_phrase - > proper_noun
• Noun_phrase - > determiner, noun
• Verb_phrase - > verb, noun_phrase
• Proper_noun - > [Tom]
• Verb - > [ate]
• Determiner - > [an]
• Noun - > [apple]
  Syntax Analysis:
• Then, the outcome of the parsing process would be a parse tree where
 sentence is the root, intermediate nodes such as noun_phrase
 verb_phrase etc. have children called non-terminals and the leaves of
 the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called terminals.
• Following structure shows the parse tree for the sentence “Tom ate an
 apple”
  Syntax Analysis:
• Parsing natural language refers to the process of analyzing and
 interpreting the grammatical structure and meaning of a sentence or
 text.
• Parsing identifies the information which is not explicitly given in the
 input sentence.
• So, to identify the information parsing requires some additional
 information i.e. grammar of a language.
• Context Free Grammar (CFG) is used to write the rules of syntax. For
 Example :
• S —> NP VP
• NP –> ‘Ajay’ | ‘pockets’ | D N |N | P | PP
• VP – >‘bought’
  Syntax Analysis
• D–a
• N – ‘shirt’
• PP - ‘On the table’
• P – ‘On’, ‘with’, ‘in’
• Where, V =Verb, NP =Noun phrase, VP = Verb phrase,
  PP = Prepositional phrases.
• Typically the rules are stated in the form X —> w, Where ‘X’ is a
  part of speech for word ‘w’ which is a generated terminal symbol.
• For Example : In rule V —> ‘saw’, V is a part of speech which
  generates verb saw.
  Syntax Analysis
• The sentence ‘Ajay bought a shirt with pockets can have two possible
 forms based on rules stated in CFG.
• In the first parse form pockets can be considered as a currency to buy
 shirt and in second form the sentence can be of meaning Ajay is
 purchasing shirts with pockets.
                                 • 2nd Phrase
• 1st Phrase
• ( S (NP Ajay)                  • (S (NP Ajay)
   (VP (VP (V bought)                • (VP (V bought)
       (NP (D a)                         • (NP (NP (D a)
                                                • (N shirt)
           (N shirt))))
                                        • (PP (P with)
           (PP (P with)
                                                • (NP pockets))))))
               (NP pockets))))
Parse Tree
 Syntax Analysis
• From the above example we understand that writing all possible
  syntactic formations in a language is difficult and complex task as we
  cannot list down a single form.
• All the possibilities need to be explored based on the parts of speech
  tags mentioned for a particular word.
• It is also difficult to mention lexical properties of a particular word,
  which is a first knowledge acquisition problem.
• Second problem of knowledge acquisition arises from the fact that, it
  is not only sufficient to know the syntactic rules of a language but it
  is also required to understand which analysis is most feasible for the
  sentence in terms of meaning due to ambiguity in syntactic analysis.
  Syntax Analysis
• For Example: Consider CFG
• N —> N N
• N -> ‘natural’ | ‘language” | ‘processing’ | ‘book’
• If we consider input sentence as “natural language
 processing”, two possible ambiguous parses can be
 generated.
• Parse 1     Parse 2
• (N (N (N natural) (N (Natural)
   • (N language))        (N (N language)
   (N Processing))   (N processing)))
 Syntax Analysis
• Parse 1 leads to the meaning processing of
 natural language and parse 2 leads to meaning
 natural way to do language processing.
• This ambiguity problem is a limitation of CFG.
                     Shallow Parsing
• A parser takes input in the form of sequence of
  tokens and produces output in the form of parse
  tree.
                 Shallow Parsing
Statement: He ate the pizza
G= (V, T, P, S)
V     - set of variables={ S, NP, VP, ART, N, V, PRO}
T     - Set of terminals= { he, ate, the, Pizza}
P     - Production Rule
S     - start sym.
P:    {S->NP VP
      NP-> PRO
      NP-> ART N
      VP->V NP
      PRO->He
      V-> ate
      ART->the
      N-> Pizza}
Shallow Parsing
               Shallow Parsing
• Parse tree for the sentence “The rose smells wonderful”
                    Shallow Parsing
• A parser takes input in the form of sequence of
  tokens and produces output in the form of parse
  tree.
• Parsing is of two types: top down parsing and
  bottom up parsing.
                     Shallow Parsing
• Top down parsing
• The top down parsing is known as recursive
  parsing or predictive parsing.
• In the top down parsing, the parsing starts from
  the start symbol and transform it into the input
  symbol.
• Parse Tree representation of input string
  “Book that flight" is as follows:
                         Shallow Parsing
• Bottom up parsing
• Bottom up parsing is also known as shift-reduce parsing.
• Bottom up parsing is used to construct a parse tree for an
  input string.
• In the bottom up parsing, the parsing starts with the
  input symbol and construct the parse tree up to the start
  symbol by tracing out the rightmost derivations of string
  in reverse.
• Parse Tree representation of input string "id * id" is as
  follows:
                Shallow Parsing
Given grammar           input string "id * id"
E→T
T→T*F
T → id
F→T
F → id
                         Shallow Parsing
import nltk
sentence = [
  ("a", "DT"),
  ("clever", "JJ"),
  ("fox","NN"),
  ("was","VBP"),
  ("jumping","VBP"),
  ("over","IN"),
  ("the","DT"),
  ("wall","NN")
]
grammar = "NP:{<DT>?<JJ>*<NN>}"
Reg_parser = nltk.RegexpParser(grammar)
Reg_parser.parse(sentence)
Output = Reg_parser.parse(sentence)
#Output.draw()
print(Output)
Dependency-based Parsing
• The parsing can also be divided into dependency-based and constituency-
 based parsing.
• Dependency-based parsing is a type of syntactic parsing in natural
 language processing (NLP) that focuses on representing the grammatical
 structure of a sentence through dependencies between words.
• In a dependency-based parse, each word in a sentence is associated with
 its syntactic relationship to other words.
• Dependency parsing displays only relationships between words and their
 constitutes, while constituency parsing displays the entire sentence
 structure and relationships.
• All the nodes in the tree are words and the links among words are
 labelled by syntactic function.
Dependency-based Parsing
• A dependency parse tree is a graph where the set of
 vertices contains the words in the sentence, and each
 edge connects two words.
• The graph must satisfy three conditions:
1. There has to be a single root node with no incoming
   edges.
2. For each node in, there must be a path from the root to.
3. Each node except the root must have exactly 1 incoming
   edge.
Dependency-based Parsing
• Additionally, each edge in has a type, which defines
 the grammatical relation that occurs between the two
 words.
• Each relationship:
Has one head and a dependent that modifies the head.
Is labelled according to the nature of the dependency
 between the head and the dependent.
 These labels can be found at Universal Dependency
 Relations.
Dependency-based Parsing
• Example – 1: Black car
• A relationship between car and black, because
  black modifies the meaning of car.
• Here, car acts       as       the head and black is
  a dependent of the head.
• The nature of the relationship here is amod which
  stands for “Adjectival Modifier”.
• It is an adjective or an adjective phrase that
  modifies a noun.
Dependency-based Parsing
• Example – 2 : I saw a fox
• The root of the tree is the verb of the sentence, and edges
 between words describe their relationships.
• The word “saw” has an outgoing edge of type nsubj to the
 word “I”, meaning that “I” is the nominal subject of the
 verb “saw”.
• In this case, we say that “I” depends on “saw”.
Dependency-based Parsing
• Example 3: Deemed Universities charge huge
  fees.
• The dependency-parsing representation is not
  based on one dependency word.
  Constituency -based Parsing
• Constituency            parsing   displays   the   entire   sentence
 structure and relationships.
• Constituency Parsing is the process of analyzing the
 sentences by breaking down it into sub-phrases also
 known as constituents.
• These sub-phrases belong to a specific category of
 grammar like NP (noun phrase) and VP(verb phrase).
• Constituency parsing aims to extract a constituency-based
 parse tree from a sentence that represents its syntactic
 structure according to a phrase structure grammar or CFG
  Constituency -based Parsing
• Example 1: John sees Bill.
• Example 2: I saw a fox.
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Constituency Grammar (CG)
• Constituency Grammar (CG), also known as Phrase Structure
 Grammar, is a formalism in the field of linguistics and natural
 language processing (NLP) used to describe the hierarchical
 structure of sentences in natural language.
• Constituency Grammar focuses on dividing sentences into
 constituents   (phrases    or   word   groups)   based   on   their
 grammatical structure.
• Each constituent is represented as a hierarchical tree structure,
 which is often referred to as a "parse tree."
Constituency Grammar (CG) and
Dependency Grammar(DG)
• A phrase-structure grammar is defined by a finite vocabulary
 (alphabet) Vp, and a finite set Σ of initial strings in V p, and a finite
 set F of rules of the form: X → Y, where X and Y are strings in V p.
• In linguistics, phrase structure grammars are all those grammars
 that are based on the constituency relation, as opposed to the
 dependency relation associated with dependency grammars.
• The    constituency      relation    derives     from      the subject-
 predicate division of Latin and Greek grammars.
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Here, we study the clause structure in terms of
  noun phrase NP and verb phrase VP.
• Sentence: This tree is illustrating IC-analysis
  according to the constituency relation
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Dependency Grammar(DG)
• It is opposite to the constituency grammar and is based on
    the dependency relation.
• Dependency grammar (DG) is opposite to constituency
    grammar because it lacks phrasal nodes.
• It emphasizes the relationships between words in a
    sentence.
• In Dependency Grammar, the words are connected to each
    other by directed links.
•
Constituency Grammar (CG) and
Dependency Grammar(DG)
• Every other syntactic unit is connected to the verb
  in terms of directed link.
• These syntactic units are called dependencies.
• Sentence: This tree is illustrating the dependency
  relation
  Difference between constituency parsing and
  dependency parsing
         Constituency Parsing                      Dependency Parsing
                                      Dependency parsing focuses on
Constituency parsing, such as noun identifying           the    grammatical
phrases andfocuses on identifying the
constituent structure of a sentence relationships
                                      sentence,
                                                       between words in a
                                                     such    as subject-verb
verb phrases.
                                      relationships.
Constituency parsing uses phrase          Dependency parsing uses dependency
structure grammar, such as context-free   grammar, which represents the
grammar or dependency grammar.            relationships between words as labeled
                                          directed arcs.
Constituency parsing is based on a top-   Dependency parsing is based on a
down approach, where the parse tree is    bottom-up approach, where the parse
built from the root node down to the      tree is built from the leaves up to the
leaves.                                   root.
    Difference between constituency parsing and
    dependency parsing
         Constituency Parsing                       Dependency Parsing
                                       Dependency parsing represents a
Constituency parsing represents a sentence as a directed graph, where
sentence as a tree structure with non- words are represented as nodes and
overlapping constituents.              grammatical       relationships are
                                       represented as edges.
                                          Dependency parsing is more suitable for
Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks.     dependency-based machine learning
                                          models.
Constituency parsing is more expressive    Dependency parsing is simpler and more
and captures more syntactic information,   efficient, but may not capture as much
but can be more complex to compute and     syntactic information as constituency
interpret.                                 parsing.
  Difference between constituency parsing and
  dependency parsing
         Constituency Parsing                      Dependency Parsing
Constituency parsing is more appropriate Dependency parsing is more appropriate
for languages with rich morphology such for languages with less morphological
as agglutinative languages.              inflection like English and Chinese.
                                          Dependency parsing is more suitable for
Constituency parsing is more suitable for natural language generation tasks and
natural language understanding tasks.     dependency-based machine learning
                                          models.
Constituency parsing is used for more     Dependency parsing is used for more
traditional NLP tasks like Named Entity   advanced NLP tasks like Machine
Recognition, Text classification, and     Translation, Language Modelling, and
Sentiment analysis.                       Text summarization.
Difference between constituency parsing and
dependency parsing
 Syntax Analysis
•Treebanks: A Data-Driven Approach to Syntax
• Parsing recovers information that is not explicit in the input
 sentence.
• This implies that a parser requires some knowledge in addition
 to the input sentence about the kind of syntactic analysis that
 should be produced as output.
• One method to provide such knowledge to the parser is to write
 down a grammar of the language – a set of rules of syntactic
 analysis as a CFGs.
• In natural language, it is far too complex to simply list all the
 syntactic rules in terms of a CFG.
Treebanks: A Data-Driven Approach to Syntax
• The second method is, knowledge acquisition problem- which
  not only do we need to know the syntactic rules for a
  particular language, but we also need to know which analysis
  is the most plausible(probably) for a given input sentence.
• The construction of treebank is a data driven approach to
  syntax analysis that allows us to address both of these
  knowledge acquisition bottlenecks in one stroke.
• A treebank is simply a collection of sentences (also called a
  corpus of text), where each sentence is provided a complete
  syntax analysis.
Treebanks: A Data-Driven Approach to Syntax
• The syntactic analysis for each sentence has been judged
  by a human expert as the most possible analysis for that
  sentence.
• A lot of care is taken during the human annotation
  process to ensure that a consistent treatment is provided
  across the treebank for related grammatical phenomena.
• There is no set of syntactic rules or linguistic grammar
  explicitly provided by a treebank, and typically there is
  no list of syntactic constructions provided explicitly in a
  treebank.
Treebanks: A Data-Driven Approach to Syntax
• A detailed set of assumptions about the syntax is typically
  used as an annotation guideline to help the human
  experts produce the single-most plausible syntactic
  analysis for each sentence in the corpus.
• Treebanks provide a solution to the two kinds of
  knowledge acquisition bottlenecks.
• Treebanks solve the first knowledge acquisition problem of
  finding the grammar underlying the syntax analysis
  because the syntactic analysis is directly given instead of
  a grammar.
Treebanks: A Data-Driven Approach to Syntax
 • Treebank solve the second knowledge acquisition
  problem as well.
 • Since, each sentence in a treebank has been given its
  most plausible(probable) syntactic analysis.
 • Supervised machine learning methods can be used to
  learn a scoring function over all possible syntax
  analyses.
  CFG with LMD and RMD
CFG: CFG consists of a finite set of grammar rules having the
 following four components
Set of Non-Terminals
Set of Terminals
Set of Productions
Start Symbol
Set of Non-terminals
• It is represented by V. The non-terminals are syntactic
 variables that denote the sets of strings, which helps in
 defining the language that is generated with the help of
 grammar.
 CFG with LMD and RMD
• Set of Terminals
• It is also known as tokens and represented by Σ. Strings
 are formed with the help of the basic symbols of terminals.
• Set of Productions
• It is represented by P. The set gives an idea about how the
 terminals and non-terminals can be combined.
• Start Symbol
• The production begins from the start symbol. It is
 represented by symbol S. Non-terminal symbols are
 always designated as start symbols.
CFG with LMD and RMD
CFG with LMD and RMD
    CFG with LMD and RMD
• Left Most Derivation (LMD) and Derivation
    Tree :
•   Leftmost derivation of a string from starting
    symbol S is done by replacing leftmost non-
    terminal symbol by RHS of corresponding
    production rule.
•   For example, the leftmost derivation of string
    abab from grammar G above is done as : S =>
    aSb => abSab => abab
•   The symbols underlined are replaced using
    production rules.
 CFG with LMD and RMD
• Right Most Derivation (RMD) :
• Rightmost derivation of a string from starting
  symbol S is done by replacing rightmost non-
  terminal symbol by RHS of corresponding
  production rule.
• e.g.; The rightmost derivation of string abab
  from grammar G above is done as : S => SS =>
  SaSb => Sab => aSbab => abab
• The symbols underlined are replaced using
  production rules.
• The derivation tree for abab using rightmost
  derivation
  CFG with LMD and RMD
• Ambiguous Context Free Grammar :
• A context free grammar is called ambiguous if there exists more than one
  LMD or more than one RMD for a string which is generated by grammar.
• There will also be more than one derivation tree for a string in
  ambiguous grammar.
• The grammar described above is ambiguous because there are two
  derivation trees (Figure 1 and Figure 2).
• There can be more than one RMD for string abab which are:
  S => SS => SaSb => Sab => aSbab => abab
  S => aSb => abSab => abab
CFG with LMD and RMD
   Text Normalization
• Text normalization is the process of transforming text into a consistent
 and standardized format.
• Text normalization is a critical preprocessing step in NLP, as it helps
 reduce noise and ensures that text data is consistent and suitable for
 various tasks like sentiment analysis, machine learning, and information
 retrieval.
• The primary goals of text normalization are to reduce text variations and
 enhance the accuracy and efficiency of text processing.
• Common text normalization tasks are : Tokenization, Removing
 Punctuation, Stemming and Lemmatization, Removing Stop
 Words, Spell Checking and Correction, Handling Numbers,
 Removing HTML Tags etc.
  Text Normalization
• Removing Stopwords :
• StopWords: A stopword is a commonly used word that a
 search engine has been programmed to ignore, both
 when    indexing   entries   for   searching   and   when
 retrieving them as the result of a search query.
• For instance, nltk can list the English language
 stopwords from the corpus using the code below:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
   Text Normalization
• Removing Stopwords :
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
   "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
   'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its',
   'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
   'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were',
   'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
   'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
   'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
   'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
   'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
   'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
   'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
   's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
   'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
   "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
   "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't",
   'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
   "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Total-179
Removing Stopwords :
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off
 the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and
 then checks whether they are present in stop_words or
 not
filtered_sentence = [w for w in word_tokens if not
 w.lower() in stop_words]
Removing Stopwords:
#with no lower case conversion
filtered_sentence = []
for w in word_tokens:
 if w not in stop_words:
       filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
['This', 'is', 'a', 'sample', 'sentence', ',',
 'showing', 'off', 'the', 'stop', 'words', 'filtration',
 '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
 'words', 'filtration', '.']
  Text Normalization
• Correcting Words:
• Each method takes a list of misspelled words and gives the
 suggestion of the correct word for each incorrect word.
• It tries to find a word in the list of correct spellings that
 has the shortest distance and the same initial letter as the
 misspelled word.
• It then returns the word which matches the given criteria.
• The correcting word methods can be differentiated on the
 basis of the distance measure which are used to find the
 closest word.
  Text Normalization
• Method 1: Using Jaccard distance Method
• This method is mainly used to measure
 similarities between two sets or between two
 asymmetric binary vectors.
• Jaccard distance = 1- J(A, B) where J(A, B)is
 Jaccard index.
  Text Normalization
• Example on correction words using Jaccard distance
  method.
• Jaccard distance = 1 - J (A, B)= 1-0.33=0.66
 Text Normalization
#Python program using nltk
 for correcting words using Jaccard
 distance method.
# importing the nltk suite
import nltk
#importing jaccard distance and ngrams
 from nltk.util
from nltk.metrics.distance import
 jaccard_distance
from nltk.util import ngrams
# Downloading and importing
 Text Normalization
# package 'words' from nltk corpus
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()
w1 = set('cat')
w2 = set('cap')
nltk.jaccard_distance(w1, w2)
Output: 0.5
  Text Normalization
#Python program using nltk for correcting words
# list of incorrect spellings that need to be c
orrected.
incorrect_words=['happpy', 'azmaing',
 'intelliengt']
# loop for finding correct spellings based on
 jaccard distance and printing the correct word
for word in incorrect_words:
    temp = [(jaccard_distance(set(ngrams(word,
             2)),
    set(ngrams(w, 2))),w)
    for w in correct_words if w[0] == word[0]]
 Text Normalization
 print(sorted(temp, key = lambda
 val:val[0])[0][1]))
Output:
happy
Amazing
intelligent
  Text Normalization
• Method 2: Using Edit distance Method
• The distance between the source string and the
 target string is the minimum number of edit
 operations (deletions, insertions, or substitutions)
 required to transform the source into the target.
• The lower the distance, the more similar the two
 strings.
 Text Normalization
• Example python program to Edit distance
  method
# importing the nltk suite
import nltk
# importing edit distance
from nltk.metrics.distance import
 edit_distance
# Downloading and importing package
 words'
 Text Normalization
from nltk.corpus import words
correct_words = words.words()
w1 = 'cat'
w2 = 'cap'
nltk.edit_distance(w1, w2)
O/P:1
  Text Normalization
# Correcting Words using edit distance method:
# list of incorrect spellings that need to be
 corrected
incorrect_words=['intelliengt']
# loop for finding correct spellings based on
 edit distance and   printing the correct words
for word in incorrect_words:
 temp = [(edit_distance(word, w),w) for w in
 correct_words if w[0] = = word[0]]
 Text Normalization
print(sorted(temp, key = lambda
 val:val[0])[0][1])
Output: intelligent
 Parts of Speech (POS) Tagging
• Part-of-speech (POS) tagging, also known as speech
 tagging or word-category disambiguation, is a
 fundamental task in natural language processing
 (NLP).
• It involves the process of assigning a specific part-of-
 speech tag (a grammatical category or label) to each
 word in a sentence or a text.
• These tags represent the syntactic and grammatical
 role of each word within the sentence.
 Parts of Speech (POS) Tagging
• Input: I am going.
• Output:
with stop word
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('.', '.')]
Without stop word
[('I', 'PRP'), ('going', 'VBG'), ('.', '.')]
Where
PRP- Pronoun Personal
VBP-Verb non-3rd personal singular present
VBG-Verb Gerund/ Present Participle
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
Parts of Speech (POS) Tagging
  Parts of Speech (POS) Tagging
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = " Everything is all about money.“
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
                                                                    O/P:
for i in tokenized:
                                                                    [('Everything', 'VBG'),
           # Word tokenizers is used to find the words              ('money', 'NN'),
           # and punctuation in a string                            ('.', '.')]
           wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
           wordsList = [w for w in wordsList if not w in stop_words]
           # Using a Tagger. Which is part-of-speech
           # tagger or POS-tagger.
           tagged = nltk.pos_tag(wordsList)
           print(tagged)
  Parts of Speech (POS) Tagging
# with stop words
import nltk
# download required nltk packages
# required for tokenization
nltk.download('punkt')
# required for parts of speech tagging             O/P:
nltk.download('averaged_perceptron_tagger')
                                                   [('Today', 'NN'),
# input text                                       ('morning', 'NN'),
sentence = """Today morning, I am Very Happy."""
                                                   (',', ','),
# tokene into words                                ('I', 'PRP'),
tokens = nltk.word_tokenize(sentence)
                                                   ('am', 'VBP'),
# parts of speech tagging                          ('Very', 'RB'),
tagged = nltk.pos_tag(tokens)
                                                   ('Happy', 'JJ'), ('.', '.')]
# print tagged tokens
print(tagged)
  Parts of Speech (POS) Tagging
# Without stop words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = "Today morning, I am Very Happy.“
                                                           O/P:
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt      [('Today', 'NN'),
module                                                     ('morning', 'NN'), (',',
tokenized = sent_tokenize(txt)
for i in tokenized:                                        ','), ('I', 'PRP'), ('Very',
# Word tokenizers is used to find the words                'RB'), ('Happy', 'JJ'),
# and punctuation in a string                              ('.', '.')]
           wordsList = nltk.word_tokenize(i)
           # removing stop words from wordList
           wordsList = [w for w in wordsList if not w in
stop_words]
           # Using a Tagger. Which is part-of-speech
           # tagger or POS-tagger.
           tagged = nltk.pos_tag(wordsList)
           print(tagged)
Parsing Algorithms
• Parsing algorithms in natural language processing (NLP)
 are   methods   and   techniques   used   to   analyze   the
 grammatical structure of sentences and generate parse
 trees or dependency structures.
• Parsing plays a crucial role in understanding the syntactic
 and semantic relationships between words in a sentence.
• There are various parsing algorithms used in NLP that
 depends on the specific NLP task and the nature of the text
 being parsed.
 Parsing Algorithms
• Shift-Reduce Parsing
• Shift Reduce parser attempts for the construction of parse
 in a similar manner as done in bottom-up parsing i.e. the
 parse tree is constructed from leaves(bottom) to the
 root(up) in the form of a table.
• A more general form of the shift-reduce parser is the LR
 parser.
• This parser requires some data structures i.e.
  • An input buffer for storing the input string.
  • A stack for storing and accessing the production rules.
 Parsing Algorithms
• To build a parser, we need an algorithm that can perform the
 steps in the above rightmost derivation for any grammar
 and for any input string.
• Basic Operations –
• Shift: This involves moving symbols from the input buffer
 onto the stack.
• Reduce: If the handle appears on top of the stack then, it is
 reduced by using appropriate production rule i.e. RHS of a
 production rule is popped out of a stack and LHS of a
 production rule is pushed onto the stack.
 Parsing Algorithms
• Accept: If only the start symbol is present in the stack and the
 input buffer is empty then, the parsing action is called accept.
 When accepted action is obtained, it means successful parsing is
  done.
• Error: This is the situation in which the parser can neither
 perform shift action nor reduce action and not even accept
 action.
• Example 1 – Consider the grammar
                  S –> S + S
                  S –> S * S
                  S –> id
• Perform Shift Reduce parsing for input string “id + id + id”.
    Shift-Reduce Parsing
Stack         Input String     Action
$             id + id + id $   Shift id
$ id            + id + id $    Reduce S --> id
$S              + id + id $    Shift +
$S+                id + id $   Shift id
$ S + id             + id $    Reduce S --> id
$S+S                 + id $    Reduce S --> S + S
$S                    + id $   Shift +
$S+                     id $   Shift id
$ S + id                   $   Reduce S --> id
$S+S                       $   Reduce S --> S + S
$S                         $   Accept
  Shift-Reduce Parsing
• Example 2 – Consider the grammar
  S → S+S
  S → S-S
  S → (S)
  S→a
• Perform Shift Reduce parsing for input string
 “a1-(a2+a3)”
Shift-Reduce Parsing
  Shift-Reduce Parsing
• Advantages:
1. Shift-reduce parsing is efficient and can handle a wide
   range of context-free grammars.
2. It can parse a large variety of programming languages
   and is widely used in practice.
3. It is capable of handling both left and right-recursive
   grammars, which can be important in parsing certain
   programming languages.
4. The parse table generated for shift-reduce parsing is
   typically small, which makes the parser efficient in
   terms of memory usage.
  Shift-Reduce Parsing
• Disadvantages:
1. Shift-reduce parsing has a limited look ahead, which
   means that it may miss some syntax errors that require
   a larger look ahead.
2. It may also generate false-positive shift-reduce conflicts,
   which can require additional manual intervention to
   resolve.
3. Shift-reduce parsers may have difficulty in parsing
   ambiguous grammars, where there are multiple possible
   parse trees for a given input sequence.
4. In some cases, the parse tree generated by shift-reduce
   parsing may be more complex than other parsing
   techniques.
 Parsing algorithm
• Hypergraphs and Chart Parsing:
• Hypergraphs and chart parsing are two related concepts in natural
 language processing (NLP) and computational linguistics.
• They are used to represent and analyze the structure of sentences and
 syntactic or semantic relationships in a more flexible and expressive
 manner than traditional parsing techniques.
• Hypergraphs:
• A hypergraph is a generalization of a graph, where edges (hyperedges)
 can connect more than two nodes.
• In the context of NLP, hypergraphs are often used to represent complex
 syntactic or semantic structures in a sentence.
Parsing algorithm
• Each node in a hypergraph typically corresponds to a word or a sub-phrase,
 and hyperedges connect nodes to represent relationships between them.
• In NLP, hypergraphs are commonly used in parsing and semantic analysis
 tasks to represent various linguistic structures, such as constituency and
 dependency parses.
• Chart Parsing
• Chart parsing is a parsing technique used in NLP to construct parse charts,
 which are data structures that store and organize partial parsing results for
 sentences.
• It's a dynamic programming approach that incrementally builds and
 combines parse trees or structures as it processes words in a sentence.
Parsing algorithm
• Chart parsing is commonly used for constituency parsing and
 dependency parsing.
• The key components of chart parsing include:
• Chart: The data structure that stores parsing results for sub-phrases.
• Earley Parser: A popular algorithm for chart parsing, it incrementally
 adds and combines parsing states in the chart as it processes input
 words.
• Chart parsing can handle ambiguity, non-projectivity, and other
 linguistic phenomena, making it a powerful tool for syntactic and
 semantic analysis in NLP.
• Example of chart parsing is CYK algorithm.
Parsing algorithm
• CYK algorithm
• The CYK algorithm is a parsing algorithm for context-free
 grammar.
• It is used to check if a particular string can be derived from
 the language generated by a given grammar.
• It is also called the membership algorithm as it tells whether
 the given string is a member of the given grammar or not.
• It was independently developed by three Russian scientists
 named Cocke, Younger, and Kasami, hence the name CYK!
Parsing algorithm
• In a CYK algorithm, the structure of grammar should be
 in Chomsky normal form (CNF).
• The CYK algorithm is a dynamic programming-based
 parsing technique or table filling algorithm used to
 identify possible constituency parses for a given sentence.
• The grammar will be in CNF if each rule has one of the
  following forms:
• �→��A→BC (at most two variables on the right-hand
  side)
• �→A→ a (a single terminal on the right-hand side)
• �→ØS→Ø (null string)
Parsing algorithm
• If the given Grammar is not in the CNF form, convert it to the CNF
 form before applying the CYK Algorithm.
• Deciding Membership Using CYK Algorithm
• To decide the membership of any given string, we construct a
 triangular table where
 Each row of the table corresponds to one particular length of sub
 strings.
 Bottom most row corresponds to strings of length-1.
 Second row from bottom corresponds to strings of length-2.
 Third row from bottom corresponds to strings of length-3.
 Top most row from bottom corresponds to the given string of length-n.
CYK algorithm
• For a given string ‘w’ of length 4 , triangular table looks as
  below:
• Let w be the n length string to be parsed. And G represent
  the set of rules in our grammar with start state S.
1. Construct a table for size n × n.
2. If w = e (empty string) and S -> e is a rule in G then we
  accept the string else we reject.
CYK algorithm
3. For i = 1 to n:
    For each variable A:
            We check if A -> b is a rule and b = wi for some i:
                    If so, we place A in cell (i, i) of our table.
4. For l = 2 to n:
    For i = 1 to n-l+1:
           j = i+l-1
           For k = i to j-1:
               For each rule A -> BC: We check if (i, k) cell
  contains B and
               (k + 1, j) cell contains C:
            If so, we put A in cell (i, j) of our table.
5. We check if S is in (1, n):
         If so, we accept the string
         Else, we reject.
CYK algorithm
• Example 1. Let our grammar G be:
S -> AB | BB
A -> CC | AB | a
B -> BB | CA | b
C -> BA | AA | b
We check if aabb is in L(G):
• Example 2. Let our grammar G be:
S -> AB | BC
A -> BA | a
 B -> CC| b
C -> AB | a
We check if baaba is in L(G):
CYK algorithm
• Example 3. Let our grammar G be:
S->S+S
S->S*S
S->id
We check if string = id*id + id is in L(G):