0% found this document useful (0 votes)
17 views14 pages

NLP Unit-3

The document discusses parsing algorithms used in natural language processing (NLP), explaining the role of parsers in breaking down text into grammatical structures. It outlines various parsing methods such as deep and shallow parsing, as well as specific algorithms like CYK, Earley’s, LL, LR, and Packrat parsing. Additionally, it covers keyphrase extraction techniques, text clustering, and their applications in NLP, highlighting the importance of feature extraction and evaluation metrics.

Uploaded by

gaursujal02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

NLP Unit-3

The document discusses parsing algorithms used in natural language processing (NLP), explaining the role of parsers in breaking down text into grammatical structures. It outlines various parsing methods such as deep and shallow parsing, as well as specific algorithms like CYK, Earley’s, LL, LR, and Packrat parsing. Additionally, it covers keyphrase extraction techniques, text clustering, and their applications in NLP, highlighting the importance of feature extraction and evaluation metrics.

Uploaded by

gaursujal02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-3

Parsing Algorithms
A natural language parser is a program that figures out which group of words go together (as
“phrases”) and which words are the subject or object of a verb. The NLP parser separates a
series of text into smaller pieces based on the grammar rules. If a sentence that cannot be
parsed may have grammatical errors.
The Document Parsing algorithm breaks up a document into its most extensive constituents,
typically sentences and clauses. The initial step is usually to convert the sentences of the
source text into their stem format called the Sentence Graph. Document parsing also includes
tokenization.
There are two types of Parsing:
• The Top-down Parsing.
• The Bottom-up Parsing.

Deep Vs Shallow Parsing


Deep Parsing Shallow Parsing

In deep parsing, the search strategy will give a complete It is the task of parsing a limited part of the syntactic
syntactic structure to a sentence. information from the given task.

It is suitable for complex NLP applications. It can be used for less complex NLP applications.

Dialogue systems and summarization are the examples of Information extraction and text mining are the
NLP applications where deep parsing is used. examples of NLP applications where deep parsing is
used.

It is also called full parsing. It is also called chunking.


Some Parsing Algorithms:
CYK Algorithm

The CYK Algorithm is an algorithm for parsing context-free grammars (CFG). Used to
decide whether a given CFG string belongs to a given language or not. The CFG describes
language, and the algorithm checks whether a string S satisfies the conditions specified in the
CFG.

It finds N most likely context-free grammars for a set of sentences S. Relies on the principle
that it is unlikely that more than one relatively short grammar. It will be consistent with a
given sentence. Especially, CYK Algorithm employs a dynamic programming test to predict
if a string is in a grammar language or not.

Earley’s Algorithm

Earley’s algorithm is a top-down parser that operates on context-free grammars. It Designed


to solve a practical parsing problem known as the shift-reduce problem. In comparison, it
retains the simplicity and efficiency of LL parsers.

It is a top-down algorithm for generating a left-deep, acyclic constituent parse tree from a
formal sentence description. It is one of the challenging and efficient parsing algorithms.
Produced to date and has been applied successfully to such tasks as. Sentence identification,
text classification, machine translation, and statistical part-of-speech tagging.

Generally, It used the chart for parsing and implemented as a dynamic program relying on
solving simpler sub-problems. Similarly, the purpose of the algorithm is to decide whether a
given grammar generates a given text.

LL Parsing Algorithm

The LL Parsing Algorithm (LL stands for Left-to-right, Leftmost derivation) is the simplest
of all parsers. It is lucid to implement compared with other parsing algorithms. This method
is for creating a parser for a specific language. The parser employs a set of hand-written rules
for recognizing the various tokens in a given programming language.

LR Parsing Algorithm

LR Parsing Algorithm is a bottom-up parsing algorithm that has become practical, simple,
and efficient. Since the inception of computer languages. LR parsing has remained a standard
algorithm in contemporary programming language designs. Especially for parsers
implemented as compiler components or used with general-purpose programming languages.

Packrat Parsing Algorithm

It is a method for creating linear-time parsers for Top-Down Parsing Language grammars. It
stores the entire set of production rules and uses them to generate the leftmost derivation.
Instead of the standard LR(k) algorithm, which produces the LALR parse table. As a result, it
finds the best parsing of a given sentence.

It is implemented in practice as part of a wider workbench or toolkit rather than in isolation.


A primary benefit is that it can handle left-recursive grammars without backtracking. It
explains that it will not produce left-factoring during the parse phase, requiring additional
right factoring at run-time for proper evaluation.

Parser Combinator

The Parser Combinator takes in parser functions and outputs a new parser function. It plays
an outstanding role in designing a parser by integrating simple Look-ahead Parsing.

It is well suited for creating complex and fast text parsers. Moreover, it can handle context-
free and context-sensitive grammars. The basic rule using parser-combinator is to generate
code by implementing grammar rules instead of writing parser directly in any other
programming language.

Pratt Parsing Algorithm

Generally, Pratt algorithm allows you to create a “general-purpose parser”. Into which you
can plug in all of the parser rules you’ll need for your language. It’s a top-down, general-
purpose parser.

It can be programmed using any programming language. Capable of numerical computations


but implemented in Prolog or Lisp every time. The algorithm works by recursion until it
reaches the terminal markers (tokens). At which point it sorts its list of non-terminals. And
creates new derivation trees from each non-terminal by attaching rules to the terminals as
children.

WordNet Lexical Relations

WordNET is a lexical database of semantic relations between words in more than 200
languages. In the field of natural language processing, there are a variety of tasks such as
automatic text classification, sentiment analysis, text summarization, etc.

WordNet is a lexical database of semantic relations between words that links words into
semantic relations including synonyms, hyponyms, and meronyms. The synonyms are
grouped into synsets with short definitions and usage examples. It can thus be seen as a
combination and extension of a dictionary and thesaurus.

WordNet is a fantastic lexical resource. Its unique semantic network aids in the discovery of
word relationships, synonyms, grammar, and other topics. This aids NLP tasks like sentiment
analysis, automatic language translation, and text similarity
There are several types of lexical relations, such as;
1.homonym
2.polysemy
3.synonymy
4.antonym
5.hyponymy
6.metonymy
Lexical Word:

Lexical words are those that have independent meaning (such as a Noun (N), verb (V),
adjective (A), adverb (Adv), or preposition (P). The definition which reports the meaning of a
word or a phrase as it is actually used by people is called a lexical definition.

These words belong to the closed classes, like conjunctions, prepositions. There's
an unlimited number of lexical words, because they belong to the open classes. There's a
limited and relatively small number of function words, because they belong to the closed
classes.

Evidence for Deeper Structure

The deep structure of a linguistic expression is a theoretical construct that seeks to unify
several related structures. For example, the sentences "Pat loves Chris" and "Chris is loved by
Pat" mean roughly the same thing and use similar words.

Properties of Deep Structure

Major grammatical relations, such as subject of and object of, are defined at deep structure.
All lexical insertion occurs at deep structure. All transformations occur after deep structure.
Semantic interpretation occurs at deep structure.

This “deep structure” information consists of the values, beliefs and unwritten rules in an
organization. Previous research shows that failure to identify this is one of the reasons why
information systems (IS) fail.

Deep and Surface Structure Examples

Deep Structure Surface Structure

Please boss is hard. It is hard to please my boss.

John throw ball. John throws the ball.

Under bed is book. The book is underneath the bed.


Key Phrase Extraction
Keyphrase or keyword extraction in NLP is a text analysis technique that
extracts important words and phrases from the input text. These key phrases can
be used in a variety of tasks, including information retrieval, document
summarization, and content categorization.
This task is performed in two stages:
1. Candidate Generation: This process involves the identification of all
possible keywords from the input text.
2. Keyphrase Ranking: After the candidate keywords are generated, they
are ranked in order of importance for the identification of the best
keywords.
Some of the popular key phrase generating tools and algorithms
are RAKE, YAKE, spaCy, Textacy.
RAKE
RAKE stands for Rapid Automatic Keyword Extraction and it is a
frequency-based key phrase extractor. To implement RAKE we will
use rake-nltk library. This library can be installed by using the
following command.

pip install rake-nltk

# Importing libraries

from rake_nltk import Rake

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Initializing the Rake instance

rake = Rake()
# Input text

input_text = '''

NLP stands for Natural Language Processing.

It is the branch of Artificial Intelligence that gives the ability to machine


understand

and process human languages. Human languages can be in the form of


text or audio format.

Natural Language Processing started in 1950 When Alan Mathison


Turing published

an article in the name Computing Machinery and Intelligence.

It is based on Artificial intelligence. It talks about automatic interpretation


and

generation of natural language.

As the technology evolved, different approaches have come to deal with


NLP tasks.

'''

# Extracting keywords and phrases

rake.extract_keywords_from_text(input_text)

keywords = rake.get_ranked_phrases()

# Displaying the keywords

print(keywords)
# Generate WordCloud

wordcloud = WordCloud().generate(' '.join(keywords))

# Display the WordCloud

plt.figure(figsize=(10,10))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.show()

Output:

['alan mathison turing published', 'natural language processing started',


'natural language processing',
'name computing machinery', 'process human languages', 'natural
language', 'human languages',
'technology evolved', 'nlp tasks', 'nlp stands', 'machine understand',
'different approaches',
'automatic interpretation', 'audio format', 'artificial intelligence', 'artificial
intelligence',
'intelligence', 'text', 'talks', 'gives', 'generation', 'form', 'deal', 'come',
'branch', 'based',
'article', 'ability', '1950']

YAKE

YAKE stands for Yet Another Keyword Extractor and it is an unsupervised


approach for automatic keyword extraction by leveraging text features.
To implement YAKE we will use the yake library. This library can be
installed using the following command.

pip install yake


spaCy

SpaCy is a free, open-source library specifically designed for efficiently


performing various NLP tasks. It is usually used for setting up
production-level pipelines using pre-trained models for tasks like
information extractors or reviews of sentimental analysis systems. It can
also be used to extract key phrases and words from the text input. This
library can be installed using the following commands.

pip install -U spacy


python -m spacy download en_core_web_sm

Textacy

Textacy is a Python library that provides a simple and intuitive interface


for performing various natural language processing (NLP) tasks. It is built
on top of spaCy, another popular NLP library, and offers additional
functionalities and utilities to simplify common NLP workflows.

What is text-clustering?

Text clustering is one of the natural language processing tasks in which a


collection of text documents is grouped based on textual similarity.

All this process of clustering needs is intended to ensure that documents


within a single group are much more alike than they are to other groups.
This is a technique that has found application in several areas including
organizing large collections of documents, data summarisation,
information retrieval, and recommendation systems.

How can text-clustering in NPL be performed?

Here’s a step-by-step overview of how text clustering can be performed:

1. Text Preprocessing:

All text data must be cleaned and prepared for clustering before being
analyzed as part of any sort of textual preprocessing occurs.
Common steps in this process include-
• Tokenization : Splitting it into single words called tokens—a step
known as tokenization

• Lowercasing : Converting them- lowercasing- to same case.

• Stop words removal : Removing frequently occurring but not


meaningful items such as “a” or “the”.

• Use stemming : lemmatisation/stemming which implies how one


reduces all words so they look alike but still maintain its
uniqueness by using base or root form of the same word.

2. Feature extraction:

After preprocessing, it is now time for converting the text data into a
numeric format suitable for clustering.
This can be done by any of the following popular ways:

• Bag of Words (BoW) : Representation texts as a group of words


and their frequency

• Term Frequency:Inverse Document Frequency (TF-IDF) which is a


metric that shows how significant given word is in a particular
document among other documents

• Utilizing pre-trained models: such as Word2Vec (Mikolov et al.,


2013), GloVe and BERT to represent words in dense vectors that
capture semantic relationships.

3. Dimentionality Reduction:

Analyzing high-dimensional data can have high computational costs, and


it may contain noise.\nDiminishing the number of dimensions can be done
through principal component analysis or using t-distributed stochastic
neighbor embedding (t-SNE) to preserve the most essential information.

4. Clustering algorithms:

There are numerous algorithms available by which Text Data can be


clustered:
• K - Means: It is one of the most popular and easiest ways that
compiles data into k clusters depending on these feature
similarities.

• Hierarchical Clustering: This method creates a series of linked


clusters either from bottom to top which is known as
agglomerative or from top to bottom known as divisive way.

• DBSCAN (Density Based Spatial Clustering for Application with


Noise): It determines the noise levels through the number of
neighbouring data points around a core point which allows
clustering an arbitrary shape.

• The Gaussian Mixture Models (GMM): assume that the data is


generated by a combination of multiple Gaussian distributions and
gives the chance that any data point belongs to the given cluster.

5. Clustering Evaluation:

It can be difficult to evaluate quality of clustering because it usually


involves unsupervised learning.

There are several metrics used in evaluating clustering such as:

• Silhouette Score : A measure that calculates how much an item


resembles its own cluster relative to other clusters.

• Davies-Bouldin Index : This is used to find out the average


similarity ratio between each cluster and the one that is most
similar.

• Adjusted Rand Index (ARI) : It compares how much two different


clusterings are alike in terms of class labels they assign to
individual objects .

Applications of Text-clustering in NPL:

Text clustering has numerous applications across various domains. Here


are some of the key applications:
1. Document Organization and Management:

• Library Systems: Automatically organizing books, articles, and


other materials into relevant categories.

• Digital Archives: Structuring and categorizing large digital archives


to improve navigation and accessibility.

2. Information Retrieval:

• Search Engines: Improving search results by grouping similar


documents together, making it easier to present relevant results.

• Topic Modeling: Identifying and organizing documents based on


topic.

3. Recommendation Systems Content Recommendation:

• Personalized Content Delivery: By clustering browsing history or


preferences, it is possible to deliver custom content to users who
visit our websites.

4. Social Media Analysis Trend Analysis:

• One can group social media posts or tweets so as to unveil


emerging trends or topics especially if they are divided into
subgroups of interest.

• Clusterize user comments (for example only important ones) based


on different sentiments identified in that particular tweet or status
update.

Pros & cons of text clustering in NPL:

Aspect Pros Cons

Automatic Sensitivity to
Organization and Categorization: Parameters:
Management Organizes large volumes Algorithms like K-
Aspect Pros Cons

of text data into Means require the


meaningful groups. number of clusters to
be set in advance,
Scalability: Handles which can be
large datasets challenging.
efficiently.

Pattern Detection:
Identifies hidden Dependence on
patterns and Feature Extraction:
relationships. Quality of clusters
depends heavily on the
Trend Analysis: Detects chosen feature
emerging trends and extraction method.
Insight Discovery common themes.

Enhanced Search:
Improves the relevance
of search results by
Resource Intensive:
grouping similar
Clustering large
documents.
datasets can be
computationally
Personalized
expensive and time-
Recommendations:
consuming.
Improved Search Enables accurate
and content
Recommendations recommendations.

Various Applications:
Applicable across Interpretability: Some
different fields like algorithms, particularly
healthcare, e-commerce, advanced ones, can be
and market research. difficult to interpret and
understand.
Versatility Integration: Can be
Aspect Pros Cons

combined with other


NLP techniques for
enhanced analysis.

Hard Clustering Vs Soft Clustering in NLP

In NLP, hard clustering and soft clustering refer to different methods of


grouping text data based on similarity.

Hard clustering, also known as crisp clustering or traditional clustering,


is a technique that assigns each data point, in this case, a document or
word, to exactly one cluster. The assignment is based on a similarity
measure, such as cosine similarity or Euclidean distance, which calculates
the similarity between the data points. Hard clustering algorithms like k-
means or hierarchical clustering partition the data into distinct, non-
overlapping clusters. Each document or word belongs exclusively to a
single cluster, and there is no ambiguity in the assignment.

On the other hand, soft clustering, also called fuzzy clustering or


probabilistic clustering, allows for more flexible assignments. Instead of
assigning a data point to a single cluster, soft clustering assigns a
probability or degree of membership to each cluster. This means that a
document or word can belong to multiple clusters simultaneously, with
varying degrees of membership. Soft clustering algorithms, such as fuzzy
c-means or Gaussian mixture models, calculate the likelihood of a data
point belonging to each cluster based on probabilistic models.

Hard clustering: Assigns each data point to a single, exclusive cluster


based on similarity. There is no ambiguity in the assignment.

Soft clustering: Assigns a degree of membership or probability to each


cluster for a data point. Allows for overlapping membership in
multiple clusters.

You might also like