0% found this document useful (0 votes)

17 views14 pages

NLP Unit-3

The document discusses parsing algorithms used in natural language processing (NLP), explaining the role of parsers in breaking down text into grammatical structures. It outlines various parsing methods such as deep and shallow parsing, as well as specific algorithms like CYK, Earley’s, LL, LR, and Packrat parsing. Additionally, it covers keyphrase extraction techniques, text clustering, and their applications in NLP, highlighting the importance of feature extraction and evaluation metrics.

Uploaded by

gaursujal02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views14 pages

NLP Unit-3

Uploaded by

gaursujal02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT-3

Parsing Algorithms
A natural language parser is a program that figures out which group of words go together (as
“phrases”) and which words are the subject or object of a verb. The NLP parser separates a
series of text into smaller pieces based on the grammar rules. If a sentence that cannot be
parsed may have grammatical errors.
The Document Parsing algorithm breaks up a document into its most extensive constituents,
typically sentences and clauses. The initial step is usually to convert the sentences of the
source text into their stem format called the Sentence Graph. Document parsing also includes
tokenization.
There are two types of Parsing:
• The Top-down Parsing.
• The Bottom-up Parsing.

Deep Vs Shallow Parsing

Deep Parsing Shallow Parsing

In deep parsing, the search strategy will give a complete It is the task of parsing a limited part of the syntactic
syntactic structure to a sentence. information from the given task.

It is suitable for complex NLP applications. It can be used for less complex NLP applications.

Dialogue systems and summarization are the examples of Information extraction and text mining are the
NLP applications where deep parsing is used. examples of NLP applications where deep parsing is
used.

It is also called full parsing. It is also called chunking.

Some Parsing Algorithms:
CYK Algorithm

The CYK Algorithm is an algorithm for parsing context-free grammars (CFG). Used to
decide whether a given CFG string belongs to a given language or not. The CFG describes
language, and the algorithm checks whether a string S satisfies the conditions specified in the
CFG.

It finds N most likely context-free grammars for a set of sentences S. Relies on the principle
that it is unlikely that more than one relatively short grammar. It will be consistent with a
given sentence. Especially, CYK Algorithm employs a dynamic programming test to predict
if a string is in a grammar language or not.

Earley’s Algorithm

Earley’s algorithm is a top-down parser that operates on context-free grammars. It Designed

to solve a practical parsing problem known as the shift-reduce problem. In comparison, it
retains the simplicity and efficiency of LL parsers.

It is a top-down algorithm for generating a left-deep, acyclic constituent parse tree from a
formal sentence description. It is one of the challenging and efficient parsing algorithms.
Produced to date and has been applied successfully to such tasks as. Sentence identification,
text classification, machine translation, and statistical part-of-speech tagging.

Generally, It used the chart for parsing and implemented as a dynamic program relying on
solving simpler sub-problems. Similarly, the purpose of the algorithm is to decide whether a
given grammar generates a given text.

LL Parsing Algorithm

The LL Parsing Algorithm (LL stands for Left-to-right, Leftmost derivation) is the simplest
of all parsers. It is lucid to implement compared with other parsing algorithms. This method
is for creating a parser for a specific language. The parser employs a set of hand-written rules
for recognizing the various tokens in a given programming language.

LR Parsing Algorithm

LR Parsing Algorithm is a bottom-up parsing algorithm that has become practical, simple,
and efficient. Since the inception of computer languages. LR parsing has remained a standard
algorithm in contemporary programming language designs. Especially for parsers
implemented as compiler components or used with general-purpose programming languages.

Packrat Parsing Algorithm

It is a method for creating linear-time parsers for Top-Down Parsing Language grammars. It
stores the entire set of production rules and uses them to generate the leftmost derivation.
Instead of the standard LR(k) algorithm, which produces the LALR parse table. As a result, it
finds the best parsing of a given sentence.

It is implemented in practice as part of a wider workbench or toolkit rather than in isolation.

A primary benefit is that it can handle left-recursive grammars without backtracking. It
explains that it will not produce left-factoring during the parse phase, requiring additional
right factoring at run-time for proper evaluation.

Parser Combinator

The Parser Combinator takes in parser functions and outputs a new parser function. It plays
an outstanding role in designing a parser by integrating simple Look-ahead Parsing.

It is well suited for creating complex and fast text parsers. Moreover, it can handle context-
free and context-sensitive grammars. The basic rule using parser-combinator is to generate
code by implementing grammar rules instead of writing parser directly in any other
programming language.

Pratt Parsing Algorithm

Generally, Pratt algorithm allows you to create a “general-purpose parser”. Into which you
can plug in all of the parser rules you’ll need for your language. It’s a top-down, general-
purpose parser.

It can be programmed using any programming language. Capable of numerical computations

but implemented in Prolog or Lisp every time. The algorithm works by recursion until it
reaches the terminal markers (tokens). At which point it sorts its list of non-terminals. And
creates new derivation trees from each non-terminal by attaching rules to the terminals as
children.

WordNet Lexical Relations

WordNET is a lexical database of semantic relations between words in more than 200
languages. In the field of natural language processing, there are a variety of tasks such as
automatic text classification, sentiment analysis, text summarization, etc.

WordNet is a lexical database of semantic relations between words that links words into
semantic relations including synonyms, hyponyms, and meronyms. The synonyms are
grouped into synsets with short definitions and usage examples. It can thus be seen as a
combination and extension of a dictionary and thesaurus.

WordNet is a fantastic lexical resource. Its unique semantic network aids in the discovery of
word relationships, synonyms, grammar, and other topics. This aids NLP tasks like sentiment
analysis, automatic language translation, and text similarity
There are several types of lexical relations, such as;
1.homonym
2.polysemy
3.synonymy
4.antonym
5.hyponymy
6.metonymy
Lexical Word:

Lexical words are those that have independent meaning (such as a Noun (N), verb (V),
adjective (A), adverb (Adv), or preposition (P). The definition which reports the meaning of a
word or a phrase as it is actually used by people is called a lexical definition.

These words belong to the closed classes, like conjunctions, prepositions. There's
an unlimited number of lexical words, because they belong to the open classes. There's a
limited and relatively small number of function words, because they belong to the closed
classes.

Evidence for Deeper Structure

The deep structure of a linguistic expression is a theoretical construct that seeks to unify
several related structures. For example, the sentences "Pat loves Chris" and "Chris is loved by
Pat" mean roughly the same thing and use similar words.

Properties of Deep Structure

Major grammatical relations, such as subject of and object of, are defined at deep structure.
All lexical insertion occurs at deep structure. All transformations occur after deep structure.
Semantic interpretation occurs at deep structure.

This “deep structure” information consists of the values, beliefs and unwritten rules in an
organization. Previous research shows that failure to identify this is one of the reasons why
information systems (IS) fail.

Deep and Surface Structure Examples

Deep Structure Surface Structure

Please boss is hard. It is hard to please my boss.

John throw ball. John throws the ball.

Under bed is book. The book is underneath the bed.

Key Phrase Extraction
Keyphrase or keyword extraction in NLP is a text analysis technique that
extracts important words and phrases from the input text. These key phrases can
be used in a variety of tasks, including information retrieval, document
summarization, and content categorization.
This task is performed in two stages:
1. Candidate Generation: This process involves the identification of all
possible keywords from the input text.
2. Keyphrase Ranking: After the candidate keywords are generated, they
are ranked in order of importance for the identification of the best
keywords.
Some of the popular key phrase generating tools and algorithms
are RAKE, YAKE, spaCy, Textacy.
RAKE
RAKE stands for Rapid Automatic Keyword Extraction and it is a
frequency-based key phrase extractor. To implement RAKE we will
use rake-nltk library. This library can be installed by using the
following command.

pip install rake-nltk

# Importing libraries

from rake_nltk import Rake

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Initializing the Rake instance

rake = Rake()
# Input text

input_text = '''

NLP stands for Natural Language Processing.

It is the branch of Artificial Intelligence that gives the ability to machine

understand

and process human languages. Human languages can be in the form of

text or audio format.

Natural Language Processing started in 1950 When Alan Mathison

Turing published

an article in the name Computing Machinery and Intelligence.

It is based on Artificial intelligence. It talks about automatic interpretation

and

generation of natural language.

As the technology evolved, different approaches have come to deal with

NLP tasks.

'''

# Extracting keywords and phrases

rake.extract_keywords_from_text(input_text)

keywords = rake.get_ranked_phrases()

# Displaying the keywords

print(keywords)
# Generate WordCloud

wordcloud = WordCloud().generate(' '.join(keywords))

# Display the WordCloud

plt.figure(figsize=(10,10))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.show()

Output:

['alan mathison turing published', 'natural language processing started',

'natural language processing',
'name computing machinery', 'process human languages', 'natural
language', 'human languages',
'technology evolved', 'nlp tasks', 'nlp stands', 'machine understand',
'different approaches',
'automatic interpretation', 'audio format', 'artificial intelligence', 'artificial
intelligence',
'intelligence', 'text', 'talks', 'gives', 'generation', 'form', 'deal', 'come',
'branch', 'based',
'article', 'ability', '1950']

YAKE

YAKE stands for Yet Another Keyword Extractor and it is an unsupervised

approach for automatic keyword extraction by leveraging text features.
To implement YAKE we will use the yake library. This library can be
installed using the following command.

pip install yake

spaCy

SpaCy is a free, open-source library specifically designed for efficiently

performing various NLP tasks. It is usually used for setting up
production-level pipelines using pre-trained models for tasks like
information extractors or reviews of sentimental analysis systems. It can
also be used to extract key phrases and words from the text input. This
library can be installed using the following commands.

pip install -U spacy

python -m spacy download en_core_web_sm

Textacy

Textacy is a Python library that provides a simple and intuitive interface

for performing various natural language processing (NLP) tasks. It is built
on top of spaCy, another popular NLP library, and offers additional
functionalities and utilities to simplify common NLP workflows.

What is text-clustering?

Text clustering is one of the natural language processing tasks in which a

collection of text documents is grouped based on textual similarity.

All this process of clustering needs is intended to ensure that documents

within a single group are much more alike than they are to other groups.
This is a technique that has found application in several areas including
organizing large collections of documents, data summarisation,
information retrieval, and recommendation systems.

How can text-clustering in NPL be performed?

Here’s a step-by-step overview of how text clustering can be performed:

1. Text Preprocessing:

All text data must be cleaned and prepared for clustering before being
analyzed as part of any sort of textual preprocessing occurs.
Common steps in this process include-
• Tokenization : Splitting it into single words called tokens—a step
known as tokenization

• Lowercasing : Converting them- lowercasing- to same case.

• Stop words removal : Removing frequently occurring but not

meaningful items such as “a” or “the”.

• Use stemming : lemmatisation/stemming which implies how one

reduces all words so they look alike but still maintain its
uniqueness by using base or root form of the same word.

2. Feature extraction:

After preprocessing, it is now time for converting the text data into a
numeric format suitable for clustering.
This can be done by any of the following popular ways:

• Bag of Words (BoW) : Representation texts as a group of words

and their frequency

• Term Frequency:Inverse Document Frequency (TF-IDF) which is a

metric that shows how significant given word is in a particular
document among other documents

• Utilizing pre-trained models: such as Word2Vec (Mikolov et al.,

2013), GloVe and BERT to represent words in dense vectors that
capture semantic relationships.

3. Dimentionality Reduction:

Analyzing high-dimensional data can have high computational costs, and

it may contain noise.\nDiminishing the number of dimensions can be done
through principal component analysis or using t-distributed stochastic
neighbor embedding (t-SNE) to preserve the most essential information.

4. Clustering algorithms:

There are numerous algorithms available by which Text Data can be

clustered:
• K - Means: It is one of the most popular and easiest ways that
compiles data into k clusters depending on these feature
similarities.

• Hierarchical Clustering: This method creates a series of linked

clusters either from bottom to top which is known as
agglomerative or from top to bottom known as divisive way.

• DBSCAN (Density Based Spatial Clustering for Application with

Noise): It determines the noise levels through the number of
neighbouring data points around a core point which allows
clustering an arbitrary shape.

• The Gaussian Mixture Models (GMM): assume that the data is

generated by a combination of multiple Gaussian distributions and
gives the chance that any data point belongs to the given cluster.

5. Clustering Evaluation:

It can be difficult to evaluate quality of clustering because it usually

involves unsupervised learning.

There are several metrics used in evaluating clustering such as:

• Silhouette Score : A measure that calculates how much an item

resembles its own cluster relative to other clusters.

• Davies-Bouldin Index : This is used to find out the average

similarity ratio between each cluster and the one that is most
similar.

• Adjusted Rand Index (ARI) : It compares how much two different

clusterings are alike in terms of class labels they assign to
individual objects .

Applications of Text-clustering in NPL:

Text clustering has numerous applications across various domains. Here

are some of the key applications:
1. Document Organization and Management:

• Library Systems: Automatically organizing books, articles, and

other materials into relevant categories.

• Digital Archives: Structuring and categorizing large digital archives

to improve navigation and accessibility.

2. Information Retrieval:

• Search Engines: Improving search results by grouping similar

documents together, making it easier to present relevant results.

• Topic Modeling: Identifying and organizing documents based on

topic.

3. Recommendation Systems Content Recommendation:

• Personalized Content Delivery: By clustering browsing history or

preferences, it is possible to deliver custom content to users who
visit our websites.

4. Social Media Analysis Trend Analysis:

• One can group social media posts or tweets so as to unveil

emerging trends or topics especially if they are divided into
subgroups of interest.

• Clusterize user comments (for example only important ones) based

on different sentiments identified in that particular tweet or status
update.

Pros & cons of text clustering in NPL:

Aspect Pros Cons

Automatic Sensitivity to
Organization and Categorization: Parameters:
Management Organizes large volumes Algorithms like K-
Aspect Pros Cons

of text data into Means require the

meaningful groups. number of clusters to
be set in advance,
Scalability: Handles which can be
large datasets challenging.
efficiently.

Pattern Detection:
Identifies hidden Dependence on
patterns and Feature Extraction:
relationships. Quality of clusters
depends heavily on the
Trend Analysis: Detects chosen feature
emerging trends and extraction method.
Insight Discovery common themes.

Enhanced Search:
Improves the relevance
of search results by
Resource Intensive:
grouping similar
Clustering large
documents.
datasets can be
computationally
Personalized
expensive and time-
Recommendations:
consuming.
Improved Search Enables accurate
and content
Recommendations recommendations.

Various Applications:
Applicable across Interpretability: Some
different fields like algorithms, particularly
healthcare, e-commerce, advanced ones, can be
and market research. difficult to interpret and
understand.
Versatility Integration: Can be
Aspect Pros Cons

combined with other

NLP techniques for
enhanced analysis.

Hard Clustering Vs Soft Clustering in NLP

In NLP, hard clustering and soft clustering refer to different methods of

grouping text data based on similarity.

Hard clustering, also known as crisp clustering or traditional clustering,

is a technique that assigns each data point, in this case, a document or
word, to exactly one cluster. The assignment is based on a similarity
measure, such as cosine similarity or Euclidean distance, which calculates
the similarity between the data points. Hard clustering algorithms like k-
means or hierarchical clustering partition the data into distinct, non-
overlapping clusters. Each document or word belongs exclusively to a
single cluster, and there is no ambiguity in the assignment.

On the other hand, soft clustering, also called fuzzy clustering or

probabilistic clustering, allows for more flexible assignments. Instead of
assigning a data point to a single cluster, soft clustering assigns a
probability or degree of membership to each cluster. This means that a
document or word can belong to multiple clusters simultaneously, with
varying degrees of membership. Soft clustering algorithms, such as fuzzy
c-means or Gaussian mixture models, calculate the likelihood of a data
point belonging to each cluster based on probabilistic models.

Hard clustering: Assigns each data point to a single, exclusive cluster

based on similarity. There is no ambiguity in the assignment.

Soft clustering: Assigns a degree of membership or probability to each

cluster for a data point. Allows for overlapping membership in
multiple clusters.

5th Unit NLP
No ratings yet
5th Unit NLP
32 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Study MAterial Unit 2
No ratings yet
Study MAterial Unit 2
16 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
Computational Linguistics Overview
No ratings yet
Computational Linguistics Overview
14 pages
Module 6 - AI Applications
No ratings yet
Module 6 - AI Applications
8 pages
Kiran 2019
No ratings yet
Kiran 2019
4 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
18 pages
Longsem2024-25 Cse3015 Eth Ap2024256000125 Reference-material-III
No ratings yet
Longsem2024-25 Cse3015 Eth Ap2024256000125 Reference-material-III
89 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Unit 5
No ratings yet
Unit 5
70 pages
Module 3 NLP
No ratings yet
Module 3 NLP
32 pages
NLP Unit 03
No ratings yet
NLP Unit 03
23 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
Parsing Algorithms NLP
No ratings yet
Parsing Algorithms NLP
12 pages
Transition Networks in Computing
No ratings yet
Transition Networks in Computing
7 pages
CD AAT (Techtalk)
No ratings yet
CD AAT (Techtalk)
22 pages
Parsing Algorithms
No ratings yet
Parsing Algorithms
20 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
NLP Components and Techniques Guide
No ratings yet
NLP Components and Techniques Guide
26 pages
Al3501 - Teaching Content
No ratings yet
Al3501 - Teaching Content
3 pages
What Is Parsing
No ratings yet
What Is Parsing
47 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
NLP Basics: Key Concepts and Processes
No ratings yet
NLP Basics: Key Concepts and Processes
15 pages
Important Notes How Have Facing Problem in NLP
No ratings yet
Important Notes How Have Facing Problem in NLP
6 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Lecture15 Parsing
No ratings yet
Lecture15 Parsing
37 pages
Syntactic Analysis
No ratings yet
Syntactic Analysis
66 pages
5.natural Language Processing
No ratings yet
5.natural Language Processing
5 pages
Class X NLP
No ratings yet
Class X NLP
27 pages
Unit - 5
No ratings yet
Unit - 5
13 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
Module-2 ch-4
No ratings yet
Module-2 ch-4
32 pages
Deep Parsing and Tools For NLP
No ratings yet
Deep Parsing and Tools For NLP
50 pages
NLP Ans
No ratings yet
NLP Ans
9 pages
3nlp Computer
No ratings yet
3nlp Computer
83 pages
13) Natural Language Processing
No ratings yet
13) Natural Language Processing
28 pages
NLP
No ratings yet
NLP
17 pages
Module 14
No ratings yet
Module 14
7 pages
Natural Language Processing: T. Muchabaiwa
No ratings yet
Natural Language Processing: T. Muchabaiwa
17 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
16 pages
Natural Language Processing Parsing Techniques:: Unit IV
100% (1)
Natural Language Processing Parsing Techniques:: Unit IV
24 pages
NLP - Shortnotes Unit 3
No ratings yet
NLP - Shortnotes Unit 3
16 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
FALLSEM2019-20 CSE4022 ETH VL2019201002590 Reference Material I 17-Jul-2019 NLP1-Lecture 4
No ratings yet
FALLSEM2019-20 CSE4022 ETH VL2019201002590 Reference Material I 17-Jul-2019 NLP1-Lecture 4
34 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
Parsing Techniques
No ratings yet
Parsing Techniques
16 pages
NLP Reading Material-1
No ratings yet
NLP Reading Material-1
15 pages
NLP Key
No ratings yet
NLP Key
16 pages
Sem Parse
No ratings yet
Sem Parse
76 pages
(IJCST-V6I3P19) :vignesh Venkatesh
No ratings yet
(IJCST-V6I3P19) :vignesh Venkatesh
16 pages
Chapter15 NaturalLanguage
100% (1)
Chapter15 NaturalLanguage
35 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Common Lisp PDF Parser
No ratings yet
Common Lisp PDF Parser
2 pages
Advanced Functional Programming
No ratings yet
Advanced Functional Programming
366 pages
Parser Combinators in Scheme
No ratings yet
Parser Combinators in Scheme
7 pages
Neural Networks in Cardiology
No ratings yet
Neural Networks in Cardiology
24 pages
Typed, Algebraic Parser in Idris2
No ratings yet
Typed, Algebraic Parser in Idris2
83 pages
Erlang PDF Parser Tools Guide
No ratings yet
Erlang PDF Parser Tools Guide
2 pages
Parsing in Python
No ratings yet
Parsing in Python
7 pages
Compositional Logic Programming: Richard Mcphee Worcester College
100% (1)
Compositional Logic Programming: Richard Mcphee Worcester College
196 pages
Ss Mini Project
No ratings yet
Ss Mini Project
20 pages