UNIT-3
Parsing Algorithms
          A natural language parser is a program that figures out which group of words go together (as
          “phrases”) and which words are the subject or object of a verb. The NLP parser separates a
          series of text into smaller pieces based on the grammar rules. If a sentence that cannot be
          parsed may have grammatical errors.
          The Document Parsing algorithm breaks up a document into its most extensive constituents,
          typically sentences and clauses. The initial step is usually to convert the sentences of the
          source text into their stem format called the Sentence Graph. Document parsing also includes
          tokenization.
          There are two types of Parsing:
              •   The Top-down Parsing.
              •   The Bottom-up Parsing.
          Deep Vs Shallow Parsing
                           Deep Parsing                                             Shallow Parsing
In deep parsing, the search strategy will give a complete        It is the task of parsing a limited part of the syntactic
syntactic structure to a sentence.                               information from the given task.
It is suitable for complex NLP applications.                     It can be used for less complex NLP applications.
Dialogue systems and summarization are the examples of           Information extraction and text mining are the
NLP applications where deep parsing is used.                     examples of NLP applications where deep parsing is
                                                                 used.
It is also called full parsing.                                  It is also called chunking.
Some Parsing Algorithms:
CYK Algorithm
The CYK Algorithm is an algorithm for parsing context-free grammars (CFG). Used to
decide whether a given CFG string belongs to a given language or not. The CFG describes
language, and the algorithm checks whether a string S satisfies the conditions specified in the
CFG.
It finds N most likely context-free grammars for a set of sentences S. Relies on the principle
that it is unlikely that more than one relatively short grammar. It will be consistent with a
given sentence. Especially, CYK Algorithm employs a dynamic programming test to predict
if a string is in a grammar language or not.
Earley’s Algorithm
Earley’s algorithm is a top-down parser that operates on context-free grammars. It Designed
to solve a practical parsing problem known as the shift-reduce problem. In comparison, it
retains the simplicity and efficiency of LL parsers.
It is a top-down algorithm for generating a left-deep, acyclic constituent parse tree from a
formal sentence description. It is one of the challenging and efficient parsing algorithms.
Produced to date and has been applied successfully to such tasks as. Sentence identification,
text classification, machine translation, and statistical part-of-speech tagging.
Generally, It used the chart for parsing and implemented as a dynamic program relying on
solving simpler sub-problems. Similarly, the purpose of the algorithm is to decide whether a
given grammar generates a given text.
LL Parsing Algorithm
The LL Parsing Algorithm (LL stands for Left-to-right, Leftmost derivation) is the simplest
of all parsers. It is lucid to implement compared with other parsing algorithms. This method
is for creating a parser for a specific language. The parser employs a set of hand-written rules
for recognizing the various tokens in a given programming language.
LR Parsing Algorithm
LR Parsing Algorithm is a bottom-up parsing algorithm that has become practical, simple,
and efficient. Since the inception of computer languages. LR parsing has remained a standard
algorithm in contemporary programming language designs. Especially for parsers
implemented as compiler components or used with general-purpose programming languages.
Packrat Parsing Algorithm
It is a method for creating linear-time parsers for Top-Down Parsing Language grammars. It
stores the entire set of production rules and uses them to generate the leftmost derivation.
Instead of the standard LR(k) algorithm, which produces the LALR parse table. As a result, it
finds the best parsing of a given sentence.
It is implemented in practice as part of a wider workbench or toolkit rather than in isolation.
A primary benefit is that it can handle left-recursive grammars without backtracking. It
explains that it will not produce left-factoring during the parse phase, requiring additional
right factoring at run-time for proper evaluation.
Parser Combinator
The Parser Combinator takes in parser functions and outputs a new parser function. It plays
an outstanding role in designing a parser by integrating simple Look-ahead Parsing.
It is well suited for creating complex and fast text parsers. Moreover, it can handle context-
free and context-sensitive grammars. The basic rule using parser-combinator is to generate
code by implementing grammar rules instead of writing parser directly in any other
programming language.
Pratt Parsing Algorithm
Generally, Pratt algorithm allows you to create a “general-purpose parser”. Into which you
can plug in all of the parser rules you’ll need for your language. It’s a top-down, general-
purpose parser.
It can be programmed using any programming language. Capable of numerical computations
but implemented in Prolog or Lisp every time. The algorithm works by recursion until it
reaches the terminal markers (tokens). At which point it sorts its list of non-terminals. And
creates new derivation trees from each non-terminal by attaching rules to the terminals as
children.
WordNet Lexical Relations
WordNET is a lexical database of semantic relations between words in more than 200
languages. In the field of natural language processing, there are a variety of tasks such as
automatic text classification, sentiment analysis, text summarization, etc.
WordNet is a lexical database of semantic relations between words that links words into
semantic relations including synonyms, hyponyms, and meronyms. The synonyms are
grouped into synsets with short definitions and usage examples. It can thus be seen as a
combination and extension of a dictionary and thesaurus.
WordNet is a fantastic lexical resource. Its unique semantic network aids in the discovery of
word relationships, synonyms, grammar, and other topics. This aids NLP tasks like sentiment
analysis, automatic language translation, and text similarity
There are several types of lexical relations, such as;
1.homonym
2.polysemy
3.synonymy
4.antonym
5.hyponymy
6.metonymy
Lexical Word:
Lexical words are those that have independent meaning (such as a Noun (N), verb (V),
adjective (A), adverb (Adv), or preposition (P). The definition which reports the meaning of a
word or a phrase as it is actually used by people is called a lexical definition.
These words belong to the closed classes, like conjunctions, prepositions. There's
an unlimited number of lexical words, because they belong to the open classes. There's a
limited and relatively small number of function words, because they belong to the closed
classes.
Evidence for Deeper Structure
The deep structure of a linguistic expression is a theoretical construct that seeks to unify
several related structures. For example, the sentences "Pat loves Chris" and "Chris is loved by
Pat" mean roughly the same thing and use similar words.
Properties of Deep Structure
Major grammatical relations, such as subject of and object of, are defined at deep structure.
All lexical insertion occurs at deep structure. All transformations occur after deep structure.
Semantic interpretation occurs at deep structure.
This “deep structure” information consists of the values, beliefs and unwritten rules in an
organization. Previous research shows that failure to identify this is one of the reasons why
information systems (IS) fail.
Deep and Surface Structure Examples
Deep Structure                           Surface Structure
Please boss is hard.                     It is hard to please my boss.
John throw ball.                         John throws the ball.
Under bed is book.                       The book is underneath the bed.
Key Phrase Extraction
Keyphrase or keyword extraction in NLP is a text analysis technique that
extracts important words and phrases from the input text. These key phrases can
be used in a variety of tasks, including information retrieval, document
summarization, and content categorization.
This task is performed in two stages:
   1. Candidate Generation: This process involves the identification of all
      possible keywords from the input text.
   2. Keyphrase Ranking: After the candidate keywords are generated, they
      are ranked in order of importance for the identification of the best
      keywords.
   Some of the popular key phrase generating tools and algorithms
   are RAKE, YAKE, spaCy, Textacy.
RAKE
      RAKE stands for Rapid Automatic Keyword Extraction and it is a
      frequency-based key phrase extractor. To implement RAKE we will
      use rake-nltk library. This library can be installed by using the
      following command.
pip install rake-nltk
# Importing libraries
from rake_nltk import Rake
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Initializing the Rake instance
rake = Rake()
# Input text
input_text = '''
NLP stands for Natural Language Processing.
It is the branch of Artificial Intelligence that gives the ability to machine
understand
and process human languages. Human languages can be in the form of
text or audio format.
Natural Language Processing started in 1950 When Alan Mathison
Turing published
an article in the name Computing Machinery and Intelligence.
It is based on Artificial intelligence. It talks about automatic interpretation
and
generation of natural language.
As the technology evolved, different approaches have come to deal with
NLP tasks.
'''
# Extracting keywords and phrases
rake.extract_keywords_from_text(input_text)
keywords = rake.get_ranked_phrases()
# Displaying the keywords
print(keywords)
# Generate WordCloud
wordcloud = WordCloud().generate(' '.join(keywords))
# Display the WordCloud
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Output:
['alan mathison turing published', 'natural language processing started',
'natural                          language                          processing',
 'name computing machinery', 'process human languages', 'natural
language',                           'human                         languages',
 'technology evolved', 'nlp tasks', 'nlp stands', 'machine understand',
'different                                                         approaches',
 'automatic interpretation', 'audio format', 'artificial intelligence', 'artificial
intelligence',
 'intelligence', 'text', 'talks', 'gives', 'generation', 'form', 'deal', 'come',
'branch',                                                                'based',
 'article', 'ability', '1950']
YAKE
YAKE stands for Yet Another Keyword Extractor and it is an unsupervised
approach for automatic keyword extraction by leveraging text features.
To implement YAKE we will use the yake library. This library can be
installed using the following command.
pip install yake
spaCy
SpaCy is a free, open-source library specifically designed for efficiently
performing various NLP tasks. It is usually used for setting up
production-level pipelines using pre-trained models for tasks like
information extractors or reviews of sentimental analysis systems. It can
also be used to extract key phrases and words from the text input. This
library can be installed using the following commands.
pip install -U spacy
python -m spacy download en_core_web_sm
Textacy
Textacy is a Python library that provides a simple and intuitive interface
for performing various natural language processing (NLP) tasks. It is built
on top of spaCy, another popular NLP library, and offers additional
functionalities and utilities to simplify common NLP workflows.
What is text-clustering?
Text clustering is one of the natural language processing tasks in which a
collection of text documents is grouped based on textual similarity.
All this process of clustering needs is intended to ensure that documents
within a single group are much more alike than they are to other groups.
This is a technique that has found application in several areas including
organizing large collections of documents, data summarisation,
information retrieval, and recommendation systems.
How can text-clustering in NPL be performed?
Here’s a step-by-step overview of how text clustering can be performed:
1. Text Preprocessing:
All text data must be cleaned and prepared for clustering before being
analyzed as part of any sort of textual preprocessing occurs.
Common steps in this process include-
   •   Tokenization : Splitting it into single words called tokens—a step
       known as tokenization
   •   Lowercasing : Converting them- lowercasing- to same case.
   •   Stop words removal : Removing frequently occurring but not
       meaningful items such as “a” or “the”.
   •   Use stemming : lemmatisation/stemming which implies how one
       reduces all words so they look alike but still maintain its
       uniqueness by using base or root form of the same word.
2. Feature extraction:
After preprocessing, it is now time for converting the text data into a
numeric format suitable for clustering.
This can be done by any of the following popular ways:
   •   Bag of Words (BoW) : Representation texts as a group of words
       and their frequency
   •   Term Frequency:Inverse Document Frequency (TF-IDF) which is a
       metric that shows how significant given word is in a particular
       document among other documents
   •   Utilizing pre-trained models: such as Word2Vec (Mikolov et al.,
       2013), GloVe and BERT to represent words in dense vectors that
       capture semantic relationships.
3. Dimentionality Reduction:
Analyzing high-dimensional data can have high computational costs, and
it may contain noise.\nDiminishing the number of dimensions can be done
through principal component analysis or using t-distributed stochastic
neighbor embedding (t-SNE) to preserve the most essential information.
4. Clustering algorithms:
There are numerous algorithms available by which Text Data can be
clustered:
   •   K - Means: It is one of the most popular and easiest ways that
       compiles data into k clusters depending on these feature
       similarities.
   •   Hierarchical Clustering: This method creates a series of linked
       clusters either from bottom to top which is known as
       agglomerative or from top to bottom known as divisive way.
   •   DBSCAN (Density Based Spatial Clustering for Application with
       Noise): It determines the noise levels through the number of
       neighbouring data points around a core point which allows
       clustering an arbitrary shape.
   •   The Gaussian Mixture Models (GMM): assume that the data is
       generated by a combination of multiple Gaussian distributions and
       gives the chance that any data point belongs to the given cluster.
5. Clustering Evaluation:
It can be difficult to evaluate quality of clustering because it usually
involves unsupervised learning.
There are several metrics used in evaluating clustering such as:
   •   Silhouette Score : A measure that calculates how much an item
       resembles its own cluster relative to other clusters.
   •   Davies-Bouldin Index : This is used to find out the average
       similarity ratio between each cluster and the one that is most
       similar.
   •   Adjusted Rand Index (ARI) : It compares how much two different
       clusterings are alike in terms of class labels they assign to
       individual objects .
Applications of Text-clustering in NPL:
Text clustering has numerous applications across various domains. Here
are some of the key applications:
1. Document Organization and Management:
  •   Library Systems: Automatically organizing books, articles, and
      other materials into relevant categories.
  •   Digital Archives: Structuring and categorizing large digital archives
      to improve navigation and accessibility.
2. Information Retrieval:
  •   Search Engines: Improving search results by grouping similar
      documents together, making it easier to present relevant results.
  •   Topic Modeling: Identifying and organizing documents based on
      topic.
3. Recommendation Systems Content Recommendation:
  •   Personalized Content Delivery: By clustering browsing history or
      preferences, it is possible to deliver custom content to users who
      visit our websites.
4. Social Media Analysis Trend Analysis:
  •   One can group social media posts or tweets so as to unveil
      emerging trends or topics especially if they are divided into
      subgroups of interest.
  •   Clusterize user comments (for example only important ones) based
      on different sentiments identified in that particular tweet or status
      update.
Pros & cons of text clustering in NPL:
Aspect                 Pros                        Cons
                       Automatic                   Sensitivity to
Organization and       Categorization:             Parameters:
Management             Organizes large volumes     Algorithms like K-
Aspect              Pros                      Cons
                    of text data into         Means require the
                    meaningful groups.        number of clusters to
                                              be set in advance,
                    Scalability: Handles      which can be
                    large datasets            challenging.
                    efficiently.
                    Pattern Detection:
                    Identifies hidden         Dependence on
                    patterns and              Feature Extraction:
                    relationships.            Quality of clusters
                                              depends heavily on the
                    Trend Analysis: Detects   chosen feature
                    emerging trends and       extraction method.
Insight Discovery   common themes.
                    Enhanced Search:
                    Improves the relevance
                    of search results by
                                              Resource Intensive:
                    grouping similar
                                              Clustering large
                    documents.
                                              datasets can be
                                              computationally
                    Personalized
                                              expensive and time-
                    Recommendations:
                                              consuming.
Improved Search     Enables accurate
and                 content
Recommendations     recommendations.
                    Various Applications:
                    Applicable across         Interpretability: Some
                    different fields like     algorithms, particularly
                    healthcare, e-commerce,   advanced ones, can be
                    and market research.      difficult to interpret and
                                              understand.
Versatility         Integration: Can be
Aspect                 Pros                        Cons
                       combined with other
                       NLP techniques for
                       enhanced analysis.
Hard Clustering Vs Soft Clustering in NLP
In NLP, hard clustering and soft clustering refer to different methods of
grouping text data based on similarity.
Hard clustering, also known as crisp clustering or traditional clustering,
is a technique that assigns each data point, in this case, a document or
word, to exactly one cluster. The assignment is based on a similarity
measure, such as cosine similarity or Euclidean distance, which calculates
the similarity between the data points. Hard clustering algorithms like k-
means or hierarchical clustering partition the data into distinct, non-
overlapping clusters. Each document or word belongs exclusively to a
single cluster, and there is no ambiguity in the assignment.
On the other hand, soft clustering, also called fuzzy clustering or
probabilistic clustering, allows for more flexible assignments. Instead of
assigning a data point to a single cluster, soft clustering assigns a
probability or degree of membership to each cluster. This means that a
document or word can belong to multiple clusters simultaneously, with
varying degrees of membership. Soft clustering algorithms, such as fuzzy
c-means or Gaussian mixture models, calculate the likelihood of a data
point belonging to each cluster based on probabilistic models.
Hard clustering: Assigns each data point to a single, exclusive cluster
based on similarity. There is no ambiguity in the assignment.
Soft clustering: Assigns a degree of membership or probability to each
cluster for a data point. Allows for overlapping membership in
multiple clusters.