0% found this document useful (0 votes)

72 views183 pages

NLP Module 6

The document discusses machine translation and rule-based machine translation systems. It explains that a rule-based machine translation system consists of a collection of rules, a bilingual lexicon, and software to process the rules. It describes the rule-based approach and some key components, including part-of-speech taggers, syntactic parsers, bilingual dictionaries, and morphological generators. An example of the steps in a rule-based machine translation system is also provided.

Uploaded by

Lisban Gonslaves

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views183 pages

NLP Module 6

Uploaded by

Lisban Gonslaves

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 183

The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes.

Distribution and modifications of the content is prohibited.

Natural Language Processing

CDSC 7013

Subject In-charge
Ms. Pradnya Sawant
Assistant Professor
Room No. 405
email: pradnyarane@sfit.ac.in

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 1
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 6
Applications of NLP

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 2
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Contents
• Machine translation
• Text Summarization
• Information retrieval
• Question Answering system
• Sentiment analysis

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 3
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 6
Lecture 1
• Machine translation

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 4
Machine Translation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Approaches to Machine Translation

• Machine translation (MT), the use of computers to
automate some or all of the process of translating
from one language to another.
• MT is classified into seven broad categories:
• rule-based
• knowledge-based
• principle-based
• statistical-based
• example-based
• hybrid-based
• online interactive based

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 6
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Approaches to Machine Translation

• The first three MT approaches are the most
widely used and earliest methods.
• Afterwards , most of the MT related research is
based on statistical and example-based
approaches.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 7
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Rule-based Approach
• It is the first strategy that was developed
• A Rule-Based Machine Translation (RBMT) system
consists of
• a collection of rules, called grammar rules,
• a bilingual or multilingual lexicon, and
• software programs to process the rules.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 8
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Rule-based Approach
• Building RBMT systems entails a huge human effort
to code all of the linguistic resources, such as
• source side part-of-speech taggers and syntactic parsers
• bilingual dictionaries
• source to target transliteration
• Target Language morphological generator
• structural transfer
• reordering rules.
• Nevertheless, a RBMT system always is extensible
and maintainable.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 9
Example:
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Rule Based Machine Translation (RBMT) System

• Having input sentences (in some SL), an RBMT

system generates them to output sentences (in
some TL) on the basis of morphological,
syntactic, and semantic analysis of both the
source and the target languages involved in a
concrete translation task.
• It applies a set of linguistic rules in three
different phases: analysis, transfer and
generation.
• It requires: syntax analysis, semantic analysis,
syntax generation and semantic generation.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 11
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

RBMT System

• RBMT generates the

target text given a
source text in the
following the steps

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 12
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

RBMT System
• Source language morphological analyzer analyzes a
source language word and provides the morphological
information.
• Source language parser is a syntax analyzer that analyzes
source language sentences.
• Translator is used to translate a source language word
into target language. (Word Translation)
• Target language morphological analyzer works as a
generator and it generates appropriate target language
words for the given grammatical information.
• Also target language parser works as a composer and it
composes a suitable target language sentence.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 13
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

RBMT System

• This type of MT system needs a minimum of three

dictionaries:
1. Source Language Dictionary: It is used by SL
morphological analyzer
2. Bilingual Dictionary: It is used by the translator
for translating source language into target language
3. Target Language Dictionary: It is used by
morphological to generate target language words.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 14
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1.Rule-based Approach
• Rules play a major role in various stages of
translation, such as :
• syntactic processing
• semantic interpretation and
• contextual processing of language.
• Generally, rules are written with linguistic
knowledge gathered from linguists.
• 3 different approaches under RBMT category are
• Direct Translation
• Interlingua MT and
• Transfer-based MT.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 15
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1.1 Direct Translation

• In this method, the SL text is analyzed structurally up to the

morphological level and is designed for a specific source
and target language pair.
• The performance of a direct MT system depends on
• the quality and quantity of the source-target language
dictionaries
• morphological analysis
• text processing software, and
• word-by-word translation with minor grammatical
adjustments on word order and morphology.
After words are translated, simple reordering rules are
applied – Example: move adjectives after nouns when
translating from English to French

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 16
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1.2 Transfer Based Translation

• On the basis of the structural differences between the source and target
language, a transfer system can be broken down into three different
stages:
i) Analysis
ii) Transfer and
iii) Generation.
• Analysis : the Source Language(SL) parser is used to produce the
syntactic representation of a SL sentence.
• Transfer : The result of the first stage is converted into equivalent
Target Level (TL)-oriented representations. i.e. Convert the source-
language parse tree to a target-language parse tree
• Generation : A TL morphological analyzer is used to generate the final
TL texts. i.e. Convert the target-language parse tree to an output
sentence
• Disadvantage of Rule based MT : It requires good dictionaries and
manual setting of rules.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 17
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1.3 Interlingua Based Translation

• In Interlingua approach, the translation is performed by first
representing the SL (Source Language) text into an
intermediary (semantic) form called Interlingua.
• The idea is to represent all sentences that mean the “same”
thing in the same way, regardless of the language they happen
to be in.
• The advantage of this approach is that Interlingua is a language
independent representation from which translations can be
generated to different TLs.
• Thus, the translation consists of two stages:
1. SL is first converted in to the Interlingua (IL) form
2. Translation from the IL to the TL(Target Language).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 18
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1.3 Interlingua Based Translation

• The main advantage of this Interlingua approach is that the
analyzer of the parser for the SL is independent of the
generator for the TL.
• Advantage of Interlingua : It is economical in situations
where translation among multiple languages is involved .
• Disadvantage:
• What would a language independent representation look
like? i.e. Difficulty in defining the interlingua.
• Interlingua does not take the advantage of similarities
between languages, for example Tamil and Telugu are
siblings of same family i.e. Dravidian Languages etc.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 19
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 6
Lecture 2
• Machine translation

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 20
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

2.Knowledge based MT

• Emphasis is on functionally complete

understanding of the source text prior to the
translation into the target text.
• KBMT is implemented on the Interlingua
architecture
• KBMT must be supported by world knowledge and
by linguistic semantic knowledge about meanings
of words and their combinations.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 21
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

2.Knowledge based MT

• Once the SL is analyzed, it will run through the

Augmenter. It is the knowledge base that converts
the source representation into an appropriate target
representation before synthesizing into the target
sentence.
• KBMT systems provide high quality translations.
• They are quite expensive to produce due to the
large amount of knowledge needed to accurately
represent sentences in different languages.
• E.g. The English-Vietnamese MT system

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 22
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

3. Principle Based MT (PBMT)

• PBMT Systems employ parsing methods based on the
Principles & Parameters Theory of Chomsky‘s Generative
Grammar.
• The parser generates a detailed syntactic structure that
contains lexical, phrasal, grammatical, and thematic
information.
• It also focuses on robustness, language-neutral
representations, and deep linguistic analysis.
• In the PBMT, the grammar is thought of as a set of
language-independent, interactive well-formed principles
and a set of language-dependent parameters.
• Thus, for a system that uses n languages, one must have n
parameter modules and a principles module.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 23
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

3. Principle Based MT (PBMT)

• It is well-suited for use with the interlingual architecture.
• PBMT parsing methods differ from the rule-based approaches.
• Although efficient in many circumstances, they have the
drawback of language-dependence and increase exponentially
in rules if one is using a multilingual translation system.
• They provide broad coverage of many linguistic phenomena,
but lack the deep knowledge about the translation domain that
KBMT and EBMT systems employ.
• PBMT systems is not efficient method for languages applying
different principles.
• E.g. UNITRAN (Universal translator used by YouTube,
Netflix)

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 24
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4. Empirical MT (EMT) Systems

• EMT systems rely on large parallelly aligned

corpora.
• Empirical systems acquire the knowledge about
set of rules describing the translation process
automatically from a collection of translation
examples.
• It uses automatically induced rules.
• 2 Categories : Statistical and Example-Based

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 25
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1 Statistical-based Approach

• Translation is based on the knowledge and statistical

models extracted from bilingual corpora.
• It requires bilingual or multilingual textual corpora of
the source and target language(s) are required.
• A supervised or unsupervised ML algorithm is used
to build statistical tables from the corpora.
• The statistical tables consist of the characteristics of
well-formed sentences, and the correlation between
the languages.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 26
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1 Statistical-based Approach

• During translation, the collected statistical information is used
to find the best translation for the input sentences, and this
translation step is called the decoding process.
• There are three different statistical approaches in MT named :
• Word-based Translation
• Phrase-based Translation
• Hierarchical Phrase-based model.
• The idea behind SMT comes from information theory.
• A document is translated according to the probability
distribution function indicated by p(e|f), which is the
probability of translating a sentence f in the SL (F) to a
sentence e in the TL(E).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 27
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1 Statistical-based Approach

• The problem of modeling the probability distribution p(e|f)
has been approached in a number of ways.
• One intuitive approach is to apply Bayes theorem.
• If p(f|e) indicate translation model and p(e) indicate language
model then the probability distribution
• p(e|f) = argmax p(f|e)p(e).
• The translation model p(f|e) is the probability that the source
sentence is the translation of the target sentence or the way
sentences in E get converted to sentences in F.
• The language model p(e) is the probability of seeing that TL
string or the kind of sentences that are likely in the language
E.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 28
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1 Statistical-based Approach

• This decomposition is attractive as it splits the
problem into two sub problems.
• Finding the best translation is done by picking the
one that gives the highest probability, as shown in
equation below.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 29
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1.1 Word-based Translation

• Here, the words in an input sentence are translated

word by word individually, and these words finally
are arranged in a specific way to get the target
sentence.
• This approach is the very first attempt in the
statistical-based MT system
• This is comparatively simple and efficient.
• Disadvantage: It reduce the performance of the
translation system.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 30
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1.2 Phrase-based Translation

• It is a more accurate SMT approach
• Here, each sentence is divided into separate phrases instead of
words.
• The alignment between the phrases in the input and output
sentences normally follows certain patterns.
• It resulted in better performance than the word-based
translation
• But it did not improve the model of sentence order patterns.
• The reordering technique may perform well with local phrase
orders but not as well with long sentences and complex
orders.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 31
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.1.3 Hierarchical Phrase-based Model

• The advantage of this approach is that hierarchical phrases
have recursive structures instead of simple phrases.
• This higher level of abstraction approach further improved the
accuracy of the SMT system.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 32
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 6
Lecture 3
• Machine translation
• Information Retrieval

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 33
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.2 Example Based MT (EBMT)

• It is based on analogical reasoning between two translation
examples.
• It relies on large parallel aligned corpora.
• An EBMT system is given a set of sentences in the SL and
their corresponding translations in the TL, and uses those
examples to translate other, similar SL sentences into the
TL.
• The basic logic is that, if a previously translated sentence
occurs again, the same translation is likely to be correct
again.
• EBMT systems are attractive in that they require a minimum
of prior knowledge; therefore, they quickly adapt to many
language pairs.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 34
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.2 Example Based MT

• A restricted form of example-based translation is

available commercially, known as a translation memory.
• In a translation memory, as the user translates text, the
translations are added to a database, and when the same
sentence occurs again, the previous translation is inserted
into the translated document.
• This saves the user the effort of re-translating that
sentence
• More advanced translation memory systems will also
return close but inexact matches on the assumption that
editing the translation of the close match will take less
time than generating a translation from scratch.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 35
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4.2 Example Based MT

• E.g.
• ALEPH : Aleph is company incorporated in
Qatar with an international network of
professional linguists offering a variety of
translation services
• wEBMT : Machine Translation System Using
the World Wide Web
• PanEBMT : Made by Carnegie Mellon
University

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 36
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

5. Online Interactive MT

• In this interactive translation system, the user is

allowed to suggest the correct translation to the
translator online.
• This approach is very useful in a situation where the
context of a word is unclear and there exists many
possible meanings for a particular word.
• In such cases, the structural ambiguity can be solved
with the interpretation of the user.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 37
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

6. Hybrid MT

• Takes advantage of both statistical and rule-based translation

methodologies
• It is more efficient
• These systems are based on both rules and statistics.
• Hybrid approach can be used in different ways:
• Translations are performed in the first stage using a rule-
based approach followed by adjusting or correcting the
output using statistical information
• In the other way, rules are used to pre-process the input
data as well as post-process the statistical output of a
statistical-based translation system. This technique is
better and has more power, flexibility, and control in
translation.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 38
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

6. Hybrid MT
• Example:
• METIS-II MT system is an example of hybridization
which avoids the usual need for parallel corpora by
using a bilingual dictionary and a monolingual corpus in
the TL.
• Oepen MT System: It integrates statistical methods
within an RBMT system to choose the best translation
from a set of competing hypotheses (translations)
generated using rule-based methods.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 39
Information Retrieval
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval

• IR deals with the representation, storage, and

access of information and is concerned with the
organization and retrieval of information from
large database collections.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 41
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval System

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 42
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval
• Here, the user issues a query q from the front-end application
(accessible via, e.g., a Web browser)
• q is processed by a query interaction module that transforms
it into a “machine-readable” query q’ to be fed into the core
of the system, a search and query analysis module.
• This is the part of the IR system having access to the content
management module directly linked with the back-end
information source (e.g., a database).
• Once a set of results r is made ready by the search module, it
is returned to the user via the result interaction module;
optionally, the result is modified (into r ) or updated until the
user is completely satisfied.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 43
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval
• The most widespread applications of IR are the ones
dealing with textual data.
• As textual IR deals with document sources and questions,
both expressed in natural language, a number of textual
operations take place “on top” of the classic retrieval steps.
• The processing of textual queries typically performed by
an IR engine are shown in figure.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 44
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval : Text

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 45
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval : Text

1. The user need is specified via the user interface, in
the form of a textual query qU (typically made of
keywords).
2. The query qU is parsed and transformed by a set of
textual operations; the same operations have been
previously applied to the contents indexed by the IR
system. This step yields a refined query q’U .
3. Query operations further transform the preprocessed
query into a system-level representation, qS .

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 46
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Information Retrieval : Text

4. The query qS is executed on top of a document
source D (e.g., a text database) to retrieve a set of
relevant documents, R.
Fast query processing is made possible by the index
structure previously built from the documents in the
document source.
5. The set of retrieved documents R is then ordered:
documents are ranked according to the estimated
relevance with respect to the user’s need.
6. The user then examines the set of ranked
documents for useful information; he might pinpoint
a subset of the documents as definitely of interest and
thus provide feedback to the system.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 47
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Processing in IR
• Not all words are equally effective for the
representation of a document’s semantics.
• Noun words (words or noun phrase groups) are the
most representative components of a document in
terms of content.
• Based on this observation, IR system also
preprocesses the text of the documents to determine
the most “important” terms to be used as index
terms: a subset of the words is therefore selected to
represent the content of a document.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 48
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Processing in IR
• When selecting candidate keywords, indexing
must fulfill two different and potentially opposite
goals:
• one is exhaustiveness, i.e., assigning a sufficiently
large number of terms to a document, and
• the other is specificity, i.e., the exclusion of
generic terms that carry little semantics and inflate
the index.
• Generic terms (conjunctions and prepositions) are
characterized by a low discriminative power as
their frequency across any document in the
collection tends to be high.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 49
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Processing in IR
• In other words, generic terms have high term
frequency, defined as the number of occurrences of
the term in a document.
• In contrast, specific terms have higher discriminative
power, due to their rare occurrences across collection
documents: they have low document frequency,
defined as the number of documents in a collection in
which a term occurs.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 50
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Textual Operations
• The textual preprocessing
phase typically performed by
an IR engine, takes as input
a document and yields its
index terms as output.
• The process of this
extraction is given in the
figure.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 51
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1. Document Parsing.
• Documents come in all sorts of languages, character
sets, and formats and the same document may
contain multiple languages or formats.
• e.g. A French email with Portuguese PDF
attachments.
• Document parsing deals with the recognition and
“breaking down” of the document structure into
individual components.
• In this preprocessing phase, unit documents are
created;
• e.g., emails with attachments are split into one
document representing the email and as many
documents as there are attachments.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 52
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

2. Lexical Analysis.
• After parsing, lexical analysis tokenizes a
document, seen as an input stream, into words.
• Issues related to lexical analysis include the
correct identification of accents, abbreviations,
dates, and cases.
• The difficulty of this operation depends much on
the language at hand.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 53
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

3. Stop-Word Removal
• A subsequent step optionally applied to the results of
lexical analysis is stop-word removal, i.e., the removal
of high-frequency words.
• The subsequent phases take the full-text structure
derived from the initial phases of parsing and lexical
analysis and process it in order to identify relevant
keywords to serve as index terms.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 54
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4. Phrase Detection
• This step captures text meaning.
• Phrase detection may be approached in several ways,
including
• rules
• morphological analysis
• syntactic analysis, and combinations thereof.
• A common approach to phrase detection relies on the
use of thesauri
• Thesauri usually contain synonyms and antonyms.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 55
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4. Phrase Detection
• Thesauri may be composed following different approaches.
• Human-made thesauri :
• They are generally hierarchical, containing related terms,
usage examples, and special cases
• Other formats are the associative one, where graphs are
derived from underlying WordNet’s synonym sets or
synsets.
• An alternative to the consultation of thesauri is to use
machine learning techniques.
• Key Extraction Algorithm (KEA) identifies candidate key-
phrases using lexical methods, calculates feature values for
each candidate, and uses a supervised ML algorithm to
predict which candidates are good phrases based on a corpus
of previously annotated documents.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 56
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

5. Stemming and Lemmatization

• Stemming and lemmatization aim at stripping
down word suffixes in order to normalize the
word.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 57
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

6. Weighting
• The final phase of text preprocessing deals with term
weighting.
• The words in a text have different descriptive power;
hence, index terms can be weighted differently to
account for their significance within a document
and/or a document collection.
• Such a weighting can be binary, e.g., assigning 0 for
term absence and 1 for presence.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 58
Question Answering System
(QAS)
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Answering System (QAS)

• QASs attempt to answer questions asked by users in
natural languages after retrieving and processing
information from different data sources.
• The format of answers may also vary from simple
text to multimedia.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 60
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

General Architecture of a QAS

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 61
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Answering System (QAS)

A typical Question Answering System, consists of
three main modules:
• Question Analysis
• Answer Retrieval
• Answer Generation.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 62
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Answering System (QAS)

The question analysis module :
• takes natural language questions as input
• specifies what the question is asking for, like
location, date, person’s name etc., and
• is responsible for analyzing the question
completely.
• Aims to understand question purpose and meaning.
• Understand the question : The question should be
analyzed in different ways.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 63
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Morpho-syntactic analysis of the question

• Firstly, carry out the morph-syntactic analysis of
words in the question.
• This is done by POS tagging.
• After POS tagging, find out the questioning
information (what the question is looking for).
• A question class helps the system to classify the
question type to provide a suitable answer

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 64
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Analysis
• To get the meaning of the question, we need to
classify the question semantic type.
• Question classification means to classify the
question into pre-defined semantic categories which
leads to consider different strategies of processing.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 65
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Classification
• Question classification process is used to generate
possible question classes.
• For example, a question can seek for date, time,
location, or person.
• E.g. The question “Who was the first American in
space?” is expecting that the person's name is in the
answer
• The search space of reasonable answers will be
definitely reduced here.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 66
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Question Analysis
• Once the question type is recognized, the question
analysis need to recognize more constraints that the
questions description type must meet.
• This process is simple as taking out keywords from
the question and find candidate answer sentences.
• These keywords may then be extended by using
morphological and/or synonyms replacements or
using query expansion techniques.
• They form representation of the question.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 67
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Answer Retrieval
• It involves the following steps:
• Document Retrieval
• Document Processing
• Syntactic Analysis
• Semantic Analysis
• Relation Identification

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 68
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Document Retrieval
• This module selects a set of relevant documents
from a domain specific repository.
• Conceptual indexing is used for the retrieval
process since the keyword based indexing ignores
the semantic content of the document collection.
• Both the documents and queries can be mapped into
concepts and these concepts are used as a
conceptual indexing space for identifying and
extracting documents.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 69
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Document Processing
• The retrieved documents are processed for
extracting candidate answer set.
• This module is responsible for selecting the
response based on the relevant fragments of the
documents.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 70
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Syntactic Analysis
• The documents are analyzed syntactically using the
NLP techniques such as POS tagging and NER.
• Firstly the documents are tokenized into a set of
sentences. Then the POS tagging and NER is
performed.
• Shallow parsing is performed to identify the phrasal
chunks.
• The chunks identified in the question analysis
module are matched with those identified in the
document and relevant sentences are retrieved.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 71
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Semantic Analysis
• Shallow parsing can be performed for finding the
semantic phrases or clauses.
• Semantic roles are identified and mapped to
semantic frames.
• The sentences whose semantic frames map exactly
to the semantic frames of the question are also
extracted.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 72
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Relation Identification
• The base ontology(relationship between the entities)
is populated with the domain knowledge
incrementally as we go through different sets of
documents.
• By this method a valid knowledge of any
specialized discipline can be incorporated into the
system.
• The relations among different concepts are
identified using the domain knowledge and the
ontological information is obtained.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 73
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Answer Generation
• The filtering of candidate answer set and answer
generation is performed.
• The user is supplied with a set of short and specific
answers ranked according to their relevance.
• The different stages are:
- Filtering
- Answer Ranking
- Answer Generation

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 74
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Filtering
• The extracted sentences are filtered and the
candidate answer set is produced.
• This is done by incorporating the information
obtained from the question classification and
document processing modules.
• The identified focus and frames are matched to get
the candidate set.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 75
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Answer Ranking
• The answer set is ranked based on the semantic similarity.
• Simple template matching is not adopted since it neglects the
semantic content and domain knowledge.
• Answers are ranked based on the similarity between the
question frame and the answer frame.
Example: The event E “John gave a balloon to the kid.” has the
roles “AGENT verb/give THEME to RECIPIENT, the semantic
frame is identified as
• “has_possession (start(E), Agent, Theme)
• has_possession (end(E),Recipient, Theme)
• transfer(during(E), Theme)”
matches exactly with the question frame.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 76
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Answer Generation
• From the answer set, specific answers have to be
generated in case the direct answers are not
available.
• Hidden relations can be identified from the domain
knowledge gathered from the ontology.
• Concept of natural language generation can also be
utilized for this purpose.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 77
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification of Question Answering System

• There are eight criteria in support of classifying the available
large number of QASs.
These criteria are
1. Application domains for which QASs are developed
2. Types of questions asked by the users
3. Types of analyses performed on users’ questions and source
documents
4. Types of data consulted in data sources
5. Characteristics of data sources
6. Types of representations used for questions and their matching
functions
7. Types of techniques used for retrieving answers
8. Forms of answers generated by QASs
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 78
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1. Classification based on application domain

• The task of generating answers of questions is

related to the type of questions asked.
• Some users may require general information on a
general topic
• Some may require specific information from a
particular application domain.
• Therefore, selection of the domain as a basis of
classification of QASs may be a natural choice.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 79
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

General domain (Open Domain) QASs

• In this, QASs answer domain independent
questions.
• It search for answers within a large document
collection.
• There is a large repository of questions that can be
asked.
• QASs exploit general ontology and world
knowledge for generating answers.
• Here, the quality of answers delivered is not high,
and questions are asked by casual users.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 80
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pros of general domain QASs

• There are a large number of casual users; general
domain QASs are more suitable for them.
• General domain QASs use a general dictionary.
• Users don’t need to acquire knowledge of domain
specific keywords for formulating questions.
• There is a large repository of questions.
• Wikipedia or news text can be utilized as a source
of information for such QASs

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 81
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Cons of general domain QASs

• The quality of answers is low.
• The answers satisfaction depends upon the users.
• Domain experts require specialized information in
answers and hence restricted domain QASs may be
more suitable for them.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 82
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Restricted Domain (Closed Domain) QASs

• It answers domain specific questions.
• Answers are searched within domain specific document
collections.
• Repository of question patterns is very limited; hence the
systems can achieve good accuracy in answering
questions.
• QASs exploit domain specific ontology and terminology.
• The quality of answers is expected to be high.
• There are various restricted domain QASs developed are:
• Temporal domain QAS, geospatial domain QAS, medical
domain QAS, patent QAS, community based QAS, etc.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 83
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Restricted domain (Closed Domain) QASs

• Different restricted domain QASs can be integrated

to make General domain QASs .
• QASs require assigning the given question to an
appropriate domain specific QAS based on the
knowledge derived from keywords of the question.
• It faces problems in handing and forwarding the
given questions to a particular restricted domain
QAS as systems suffer question classification
problems, ambiguity resolution problems, etc.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 84
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pros of restricted domain QASs

• Restricted domain QASs suite to domain expert
users as they need specialized answers.
• The quality of answers generated by restricted
domain QASs is high
• The level of satisfaction of the users depends on
their domain knowledge.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 85
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Cons of restricted domain QASs

• There is a limited repository of domain specific
questions; such QASs can answer a limited number
of questions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 86
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

2. Classification based on Types of Questions

The different categories are

1. Factoid type questions
2. List type questions
3. Hypothetical type questions
4. Confirmation questions
5. Causal questions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 87
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Factoid type questions [what, when, which, who,

how]
• These questions are simple and fact based that
require answers in a single short phrase or sentence.
• The factoid type questions generally start with wh-
word.
• Current QASs have got a satisfactory performance
in answering factoid type questions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 88
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

List type questions

• The list questions require a list of entities or facts in
answers e.g., – list name of employees getting
salary more than 5K?
• QASs consider such questions as a series of factoid
questions which are asked ten times one after the
other.
• The previous answers are ignored while firing next
questions by QASs.
• QASs generally observe a problem in fixing the
threshold value for the number or quantity of the
entity asked in list type questions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 89
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hypothetical type questions

• Hypothetical questions ask for information related
to any hypothetical event.
• They generally begin with ‘what would happen if’.
QASs require knowledge retrieval techniques for
generating answers.
• Moreover, the answers are subjective to these
questions.
• There are no specific correct answers to these
questions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 90
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Confirmation questions
• Confirmation questions require answers in the form
of yes or No.
• Systems require inference mechanism, world
knowledge and common sense reasoning to
generate answers.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 91
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Causal questions [how or why]

• Causal questions require explanations about an
entity.
• The answers are not named entities as observed in
the case of factoid type questions.
• QASs require advanced natural language processing
techniques to analyze the text at pragmatic and
discourse level for generating answers.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 92
Text Summarization
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Summarization
• Text summarization refers to the technique of shortening
long pieces of text.
• The intention is to create a coherent and fluent summary
having only the main points outlined in the document.
• There are two main types of how to summarize text in NLP:
• Extraction-based summarization
• Abstraction-based summarization

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 94
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Extraction-based Summarization
• The extractive text summarization technique involves
pulling key phrases from the source document and
combining them to make a summary.
• The extraction is made according to the defined metric
without making any changes to the texts.
• Here is an example:
• Source text: Peter and Elizabeth took a taxi to attend the
night party in the city. While in the party, Elizabeth
collapsed and was rushed to the hospital.
• Extractive summary: Peter and Elizabeth attend party city.
Elizabeth rushed hospital.
• As you can see above, the important words have been
extracted and joined to create a summary — although
sometimes the summary can be grammatically strange.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 95
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Abstraction-based Summarization
• The abstraction technique entails paraphrasing and shortening parts
of the source document.
• When abstraction is applied for text summarization in deep learning
problems, it can overcome the grammar inconsistencies of the
extractive method.
• The abstractive text summarization algorithms create new phrases
and sentences that relay the most useful information from the
original text — just like humans do.
• Therefore, abstraction performs better than extraction.
• However, the text summarization algorithms required to do
abstraction are more difficult to develop; that’s why the use of
extraction is still popular.
• Abstractive summary: Elizabeth was hospitalized after attending a
party with peter
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 96
How does a text summarization algorithm
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

work?
• Usually, text summarization in NLP is treated as a
supervised machine learning problem (where future
outcomes are predicted based on provided data). Typically,
here is how using the extraction-based approach to
summarize texts can work:
1. Introduce a method to extract the merited key phrases from
the source document. For example, you can use part-of-speech
tagging, words sequences, or other linguistic patterns to
identify the key phrases.
2. Gather text documents with positively-labeled key phrases.
The key phrases should be compatible to the stipulated
extraction technique. To increase accuracy, you can also create
negatively-labeled key phrases.
3. Train a binary machine learning classifier to make the text
summarization.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 97
A general Text Summarization
system
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 99
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

A general text summarization system

• The system uses the multiple documents in order to create
abstractive summarization.
• At first, a semantic graph is generated for every sentence in
the documents by preprocessing each sentence.
• Thereafter, the generated graph is reduced to a more reduced
graph to generate abstractive summary.
• Heuristic rules have been used to generate an abstractive
summary.
• The goal of the system is to condense the documents into a
shorter version and preserve important contents.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 100
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Preprocessing Module

• Preprocessing module is responsible to accept the input
text, and converts it to preprocessed sentences.
• It consists of four main processes: named entity recognition,
morphological and syntactic analysis, cross-reference
resolution, and pronominal resolution processes.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 101
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Preprocessing Module

• The named entity recognition process locates atomic
elements into predefined categories such as person names,
organizations, etc.
• In morphological analysis, each word is divided into
morphemes and figures out its grammatical categories, the
syntactic analysis parses the whole sentence to describe each
word syntactic function and build the parse tree, and typed
dependencies expresses syntactic knowledge in terms of
direct relationships between words.
• Co-reference and pronominal resolution reference
resolution processes identify co-reference named entities and
resolve pronominal references in the whole input text.
• Co-reference is defined as the identification of surface terms
(words within the document) that refer to the same entity.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 102
Rich Semantic Sub-graphs Generation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module
• The main objective of the Rich Semantic Graph Creation
Phase is to represent the input documents semantically using
Rich Semantic Graph (RSG).
• Unlike traditional semantic graph, the Rich Semantic Graph
is able to capture the meaning of words, sentences, and
paragraphs.
• The Rich Semantic Sub-graphs Generation module is
responsible to transform each preprocessed sentence to a
set of ranked rich semantic subgraphs.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 103
Rich Semantic Sub-graphs Generation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module
• The main objective of the Rich Semantic Sub-graphs
Generation module is to generate multiple rich semantic sub-
graphs for each input preprocessed sentence.
• This module includes three processes: Word Senses
Instantiation, Concepts Validation, and Semantic Sentences
Ranking processes.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 104
Rich Semantic Sub-graphs Generation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module
• Word Senses Instantiation process: For each input
preprocessed sentence, this process instantiates a set of word
concepts for both noun and verb senses based on the
domain ontology.
• Concept Validation Process: In this process, for each
preprocessed sentence, the sentence concepts instantiated
are interconnected and validated to generate multiple rich
semantic sub-graphs.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 105
Rich Semantic Sub-graphs Generation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module
Sentences Ranking Process:
• It aims to rank and to threshold the highest ranked rich
semantic sub-graphs for each sentence.
• To generate single rich semantic graph and to keep the
semantic consistency for the whole sentence, the process
considers the first ranked rich semantic sub-graph only.
• The ranking method is based on deriving the average weight
of each concept (word sense). The weight of the word
concept is derived according to its usage popularity
(WordNet usage popularity).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 106
The Rich Semantic Graph Generation
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

module
• Finally, the Rich Semantic Graph Generation module is
responsible to generate the final rich semantic graphs of
the whole input document from the highest-ranked rich
semantic sub-graphs of the document sentences.
• The semantic sub-graphs of the input document will be
merged to form the final rich semantic graph.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 107
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Rich Semantic Graph Reduction Phase

• This phase aims to reduce the generated rich semantic graph
of the original document to a more reduced graph.
• In this phase, a set of heuristic rules are applied on the
generated rich semantic graph to reduce it by merging,
deleting, or consolidating the graph nodes.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 108
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Rich Semantic Graph Reduction Phase

Example of a rule
• Sentence1= [SN1, MV1, ON1]
• Sentence2= [SN2, MV2, ON2]
• Each sentence is composed of three nodes: Subject Noun
(SN) node, Main verb (MV) node and Object Noun (ON)
node.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 109
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Rule 1.
• IF SN1 is instance of noun N And
• SN2 is instance of noun N And
• MV1 is similar to MV2 And
• ON1 is similar to ON2
• THEN
• Merge both MV1 and MV2 And
• Merge both ON1 and ON2

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 110
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Text Generation Phase

• The Rich Semantic Graph Generation module is responsible
for generating a set of ranked RSGs for the input ranked
semantic sub-graphs.
• This phase aims to generate the abstractive summary from
the reduced Rich Semantic Graph (RSG).
• There are four modules namely the Text planning, the
Sentence Planning, the Surface Realization, and the
Evaluation modules.
• These modules are performed by processes arranged as a
pipeline, so the output of each process is the input of the
next one as shown in figure 4.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 111
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Text Generation Phase

1) The Text Planning module: It aims to select the appropriate
content material to be expressed in the final text. This phase
includes one process called “Content Determination”, which
decides what information should be included in the
generated text.
2) The Sentence Planning module: It specifies the sentence
boundaries, and generates and orders intermediate
paragraphs. The main objective of this phase is to improve the
fluency or understandability of the text.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 112
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Text Generation Phase

The sentence planning consists of four main processes:
1. Lexicalization Process: In this process, for each verb/noun
object, its synonyms are selected by accessing the WordNet
ontology to generate the target content.
2. Discourse Structuring Process: The main aim of this
process is to build a structure that contains the selected object
synonyms in the form of pseudo-sentences.
3. Aggregation Process: The main aim of this process is to
decide how pseudo-sentences should be combined into semi-
paragraphs.
4. Referring Expression Process: This process identifies and
replaces the intended referent by its appropriate pronoun.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 113
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Text Generation Phase

3) The Surface Realization module: This phase aims to
transform the enhanced semi-paragraphs into paragraphs by
correcting them grammatically (inflect words for tense, etc.) and
adding the required punctuation (capitalization adding
semicolon, etc).
4) The Evaluation module: The main objective of this phase is
to evaluate and then rank the paragraphs according to two
factors: coherence between paragraph sentences and the
most frequently used paragraph word synonyms.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 114
Text Categorization/
Text Classification
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Categorization
• The goal in automatic text classification is to assign
a document to a category by evaluating its text
components.
• Given is the general block diagram of a text
categorization system.
• The dataset is split into training and testing dataset
for the classification process.
• Then starts the actual processing of text data.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 116
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Categorization

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 117
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Preprocessing
• The main objective of pre-processing is to obtain the key
features or key terms from stored text documents and to
enhance the relevancy between word and document and the
relevancy between word and category.
• Pre-processing step is crucial in determining the quality of
the next stage, that is, the classification stage.
• It is important to select the significant keywords that carry
the meaning and discard the words that do not contribute to
distinguishing between the documents.
• The pre-processing phase of the study converts the original
textual data in a data mining ready structure.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 118
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Preprocessing
• In general, text can be represented in two separate ways.
• The first is as a bag-of-words, in which a document is
represented as a set of words, together with their associated
frequency in the document.
• The bag-of-words model is a simplifying representation used in
natural language processing and information retrieval .
• In this process, a text is represented as the container of its
words, disregarding grammar and even word order but keeping
multiplicity.
• The bag-of-words model is commonly used in methods of
document classification, where the (frequency of) occurrence of
each word is used as a feature for training a classifier.
• Such a representation is essentially independent of the sequence
of words in the collection.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 119
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Text Preprocessing
• The second method is to represent text directly as
strings, in which each document is a sequence of
words.
• Most text and document data sets contain many
unnecessary words such as stop words, misspelling,
slang, etc.
• In many algorithms, especially statistical and
probabilistic learning algorithms, noise and
unnecessary features can have adverse effects on
system performance.
• Thus, the following section explain some
techniques and methods for text cleaning and pre-
processing text data sets.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 120
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

1. Tokenization
• Tokenization is a pre-processing method which breaks a
stream of text into words, phrases, symbols, or other
meaningful elements called tokens.
• The main goal of this step is the investigation of the words in
a sentence.
• Text classification require a parser which processes the
tokenization of the documents.
Example :
• After sleeping for four hours, he decided to sleep for another
four.
• In this case, the tokens are as follows:
• { “After” “sleeping” “for” “four” “hours” “he” “decided”
“to” “sleep” “for” “another” “four” }.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 121
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

2. Stop Word Removal

• Text and document classification includes many
words which do not contain important significance
to be used in classification algorithms, such as {“a”,
“about”, “above”, “across”, “after”, “afterwards”,
“again”,. . .}.
• The most common technique to deal with these
words is to remove them from the texts and
documents.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 122
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

3. Capitalization
• Text and document data points have a diversity of
capitalization to form a sentence.
• Since documents consist of many sentences, diverse
capitalization can be hugely problematic when classifying
large documents.
• The most common approach for dealing with inconsistent
capitalization is to reduce every letter to lower case.
• This technique projects all words in text and document into
the same feature space, but it causes a significant problem
for the interpretation of some words (e.g., “US” (United
States of America) to “us” (pronoun)).
• Slang and abbreviation converters can help account for these
exceptions.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 123
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

4. Slang and Abbreviation

• Slang and abbreviation are other forms of text
anomalies that are handled in the pre-processing
step.
• An abbreviation is a shortened form of a word or
phrase which contain mostly first letters form the
words, such as SVM which stands for Support
Vector Machine.
• Slang is a subset of the language used in informal
talk or text that has different meanings such as “lost
the plot”, which essentially means that they’ve gone
mad.
• A common method for dealing with these words is
converting them into formal language.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 124
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

5. Noise Removal
• Most of the text and document data sets contain
many unnecessary characters such as punctuation
and special characters.
• Critical punctuation and special characters are
important for human understanding of documents,
but it can be detrimental for classification
algorithms

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 125
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

6. Spelling Correction
• Spelling correction is an optional pre-processing
step.
• Typos (short for typographical errors) are
commonly present in texts and documents,
especially in social media text data sets (e.g.,
Twitter).
• Many algorithms, techniques, and methods have
addressed this problem in NLP.
• Many techniques and methods are available for
researchers including hashing-based and context-
sensitive spelling correction techniques.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 126
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

7. Lemmatization
• Lemmatization is a NLP process that replaces the
suffix of a word with a different one or removes the
suffix of a word completely to get the basic word
form (lemma).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 127
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Extraction
The common techniques of feature extractions are
• Term Frequency-Inverse Document Frequency (TF-
IDF)
• Term Frequency (TF)
• Word2Vec
• Global Vectors for Word Representation (GloVe).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 128
Reference
• https://sci2lab.github.io/ml_tutorial/tfidf/
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Dimensionality Reduction
• Using dimensionality reduction reduce the time and
memory complexity for the text classification
application.
• The most common techniques of dimensionality
reduction include
o Principal Component Analysis (PCA)
o Linear Discriminant Analysis (LDA)
o Non-negative Matrix Factorization (NMF).

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 134
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification Techniques
• In machine learning terminology, the classification problem
comes under the supervised learning principle, where the
system is trained and tested on the knowledge about classes
before the actual classification process.
• Unsupervised learning occurs when labeled data is not
accessible.
• The process is complicated and has performance issues.
• It is suitable for big data.
• Semi-supervised learning is followed when data is partly
labeled and partly unlabeled.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 135
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification Techniques
• Supervised learning is the most expensive and highly
difficult of the three.
• The main reason behind this notion is that it requires
assigning of labels to classes.
• Here, the learning process could be simplified by prior
assumptions.
• These kinds of assumptions about data introduce two
approaches such as parametric and non-parametric.
• The model that could summarize data based on underlying
parameters is called a parametric model.
• Logistic regression and Naïve Bayes algorithms are
parametric classifiers whereas Support vector machines, k-
nearest neighbor, rule induction, decision trees and neural
networks are non-parametric classifiers.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 136
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Naive Bayes Classifier

• These are probabilistic classifiers commonly used in ML.
• However, the Bayesian classifiers are statistical and also
possess learning ability.
• Multinomial models are used by Naïve Bayes for large
datasets.
• The performance could be enhanced by searching the
dependencies among attributes.
• It is mainly used in data pre-processing applications due to
ease of computation.
• Bayesian reasoning and probability inference are employed
in predicting the target class.
• Attributes play an important role in classification.
• Therefore, assigning different weight values to attributes
can potentially improve the performance.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 137
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Support Vector Machine

• The Support Vector Machine (SVM) algorithm is one of the
supervised machine learning algorithms that is employed for
various classification problems.
• SVMs are particularly suitable for high dimensional data.
• There are so many reasons supporting this claim.
• Specifically, the complexity of the classifiers depends on the
number of support vectors instead of data dimensions, they
produce the same hyper plane for repeated training sets, and
they have better generalization abilities.
• SVMs also perform with the same accuracy even when the
data is sparse.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 138
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Decision trees
• Decision trees are highly comprehensible models when
compared to neural nets.
• These work in a sequence, to test a decision against a
particular threshold value among the available values.
• Testing happens according to certain logical rules similar to
the concept of weights of neural networks.
• C4.5 and CART are widely used decision tree techniques.
• The tree growth phase partitions the training set and the
pruning phase generalizes data over it.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 139
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

K-Nearest Neighbor
• K-Nearest Neighbor (k-NN) works on the principle of
closest training samples, those data points that are close to
each other belong to one particular class, commonly called
instance-based learning.
• Though it is robust for noisy data, deciding the value of k is
complicated.
• Computational complexity further increases with increase in
dimensionality.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 140
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Artificial Neural Networks

• Artificial neural networks (ANNs) work in the same way as the
human brain in arriving at a decision.
• It works on the virtue of learning and evolution with minimal or no
human intervention.
• For data classification, competitive co-evolution algorithm based
neural network model is suggested.
• Radial Basis Function is the ANN component as it employs faster
learning algorithms.
• It has a compact network architecture that increases classification
accuracy.
• Also, evolutionary algorithms have a tendency to perform well in
dynamic environments by learning rules on the fly and highly
adaptive models of ‘fuzzy’ characteristics.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 141
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Centroid-based Classifier
• The centroid-based classification algorithm is very simple.
• For each set of documents belonging to the same class, we
compute their centroid vectors.
• If there are k classes in the training set, this leads to k
centroid vectors (C1, C2, C3...) where each Cn is the
centroid for the jet class.
• The class of a new document x is determined as follows.
First the document-frequencies of the various terms
computed from the training set.
• Then, compute the similarity between x to all k centroid
using the cosine measure.
• Finally, based on these similarities, and assign x to the class
corresponding to the most similar centroid

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 142
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Clustering
• The clustering is the task of finding groups of similar
documents in a collection of documents.
• The similarity is computed by using a similarity function.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 143
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

k-means Clustering
• k-means clustering is one the partitioning algorithms which
is widely used in the data mining.
• The k-means clustering, partitions n documents in the
context of text data into k clusters. representative around
which the clusters are built.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 144
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The basic form of k-means algorithm is:

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 145
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

k-means Clustering
• Finding an optimal solution for k-means clustering is
computationally difficult (NP-hard), however, there are
efficient heuristics that are employed in order to converge
rapidly to a local optimum.
• The main disadvantage of k-means clustering is that it is
indeed very sensitive to the initial choice of the number of k.
• Thus there are some techniques used to determine the initial
k, e.g. using another lightweight clustering algorithm such as
agglomerative clustering algorithm.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 146
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Ontology Based Classification

• Traditional classification methods ignore relationships
between words.
• But there exists a semantic relation between terms such as
synonymy, hyponymy etc.
• Thus for better classification results these semantic relations
need to be considered.
• Ontology stores words related to a particular domain and this
can be used for classification. E.g. Lingo algorithm.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 147
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Lingo Algorithm
• The general idea behind LINGO is to first find
meaningful descriptions of clusters, and then, based
on the descriptions, determine their content.
• To assign documents to the already labeled groups
LINGO could use the Latent Semantic Indexing
(LSI) in the setting for which it was originally
designed: given a query – retrieve the best matching
documents.
• When a cluster label is fed into the LSI as a query,
as a result contents of the cluster will be returned.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 148
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Lingo Algorithm
• This approach should take advantage of the LSI's ability to
capture high-order semantic dependencies in the input
collection.
• In this way not only would documents that contain the
cluster label be retrieved, but also the documents in which
the same concept is expressed without using the exact
phrase.
• In web search results clustering, however, the effect of
semantic retrieval is sharply diminished by the small size of
the input web snippets.
• This, in turn, severely affects the precision of cluster content
assignment.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 149
Sentiment Analysis
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Sentiment Analysis
• Sentiment classification is a task under Sentiment Analysis
(SA) that deals with automatically tagging text as positive,
negative or neutral from the perspective of the speaker/writer
with respect to a topic.
• Thus, a sentiment classifier tags the sentence ‘The movie is
entertaining and totally worth your money!’ in a movie
review as positive with respect to the movie.
• On the other hand, a sentence ‘The movie is so boring that I
was dozing away through the second half.’ is labeled as
negative.
• Finally, ‘The movie is directed by Nolan’ is labeled as
neutral.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 151
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The general block diagram for a sentiment analysis system

is as given below:

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 152
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Data collection
• Data is collected from various social networking sites,
blogging sites, and review sites

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 153
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Data Cleaning
1. Removal of URL’s: Data extracted may contain some url’s
which needs to be removed as they do not contain any
sentiments.
2. Case conversion: All the text should be converted to either
upper case or lower case i.e. there should be no difference
between ‘paper’ and ‘PAPER’.
3. Removal of punctuation: Punctuation such as full stop,
exclamatory sign, comma’s etc should be removed as they do
not represent any emotions.
4. Removal of Hash tag: Hash tag word is preceded by a hash
sign(#) and is generally used in social media for the
identification of specific subjects.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 154
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Data Cleaning
5. Tokenization : It divides given text into tokens.
6. Stemming: M.F. porter stemmer is most widely used
algorithm which stems the word.
7. Negation rule: This method removes negation word which
reverses meaning of word in review.
8. Conjunction rule: This method extracts meaning from review
using grammatical rule.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 155
Feature Extraction
and Selection
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Extraction
• Some of feature extraction techniques are

1. Terms presence and frequency: It is based on individual

word or n-grams and frequency counts.
2. Parts of Speech (POS): It extracts adjective nouns from data.
3. Opinion words and phrase: It is based on words which
represents opinions such as good or bad, like or hate etc.
4. Negation: Appearance of negation words in text may reverse
the meaning of opinion. For example “not good” is equal to
“bad”.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 157
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Selection
• Feature Selection methods can be divided into lexicon-based
methods that need human annotation, and statistical methods
which are automatic methods that are more frequently used.
Lexicon-based approaches usually begin with a small set of
‘seed’ words. Then they bootstrap this set through synonym
detection or on-line resources to obtain a larger lexicon.
Statistical approaches, on the other hand, are fully automatic.

• The feature selection techniques treat the documents either

as group of words (Bag of Words (BOWs)), or as a string
which retains the sequence of words in the document. BOW
is used more often because of its simplicity for the
classification process.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 158
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Selection
• Various feature selection methods are CountVectorizer ,TF-
IDF(Term Frequency–Inverse Document Frequency),
IG(Information Gain), MI(Mutual Information), Feature
Vector, Unigram, Bigram and N-gram methods.TF-IDF
score is to be taken into consideration to balance most
weighted and less weighted word. Chi square method gives
good results for both positive and negative classes. Mutual
information, Chi-square, TF-IDF and Information Gain
techniques are used to select feature from high dimensional
data

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 159
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Selection
1. Count Vector : It is defined by the number of occurrences of features in
review.
2. TF-IDF : It is defined by multiplying the value of frequency of word in
review (TF) and frequency of word in whole corpus (IDF).
TF-IDFi = ti,j * log (N/dfi)
TF-IDFi is the weight of a term i. ti,j is the frequency of term i in sample j. N
is the total number of samples in the corpus. dfi is the number of samples
containing term i.
3. Information gain is the most widely used attribute selection measure in
the area of sentiment analysis. It determines the relevant features to predict
review by studying the presence or absence of features in a document.
P(c|f) is the joint probability, class is c and feature is f and P(c) denotes the
marginal probability.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 160
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Selection
4. Mutual Information: MI is the process of selecting features that
are not uniformly distributed across the sentiment classes because
they are informative of their classes and we can see that MI gives
more importance to only a few terms.
Where P(f,c) represents joint probability distribution function, P(f)
and p(c) represent marginal probability distribution of f and c. c is
positive and negative classes.
5. Chi-square: Chi-square measures observed count and expected
count and analyzed how much deviation occurs between them.
W, X, Y, Z represents the frequencies, represent the presence or
absence of feature in the sample. W is the count of samples in
which feature f and c occurred together. N=W+X+Y+Z. f
represents the feature and c represents the class.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 161
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 162
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification
• Sentiment Classification techniques can be roughly divided
into machine learning approach, lexicon based approach and
hybrid approach.
• The Machine Learning Approach (ML) applies the famous
ML algorithms and uses linguistic features.
• The Lexicon-based Approach relies on a sentiment lexicon, a
collection of known and precompiled sentiment erms.
• It is divided into dictionary-based approaches and corpus-
based approaches which use statistical or semantic methods
to find sentiment polarity.
• The hybrid Approach combines both approaches and is very
common with sentiment lexicons playing a key role in the
majority of methods.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 163
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Feature Selection
• Various feature selection methods are CountVectorizer ,TF-
IDF(Term Frequency–Inverse Document Frequency),
IG(Information Gain), MI(Mutual Information), Feature
Vector, Unigram, Bigram and N-gram methods.TF-IDF
score is to be taken into consideration to balance most
weighted and less weighted word.
• Chi square method gives good results for both positive and
negative classes. Mutual information, Chi-square, TF-IDF
and Information Gain techniques are used to select feature
from high dimensional data

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 164
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification
• The classification methods using ML approach can be
roughly divided into supervised and unsupervised learning
methods.
• The supervised methods make use of a large number of
labeled training documents.
• The unsupervised methods are used when it is difficult to
find these labeled training documents.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 165
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Classification
• The lexicon-based approach depends on finding the opinion
lexicon which is used to analyze the text.
There are two methods in this approach.
• The dictionary-based approach which depends on finding
opinion seed words, and then searches the dictionary of their
synonyms and antonyms.
• The corpus-based approach begins with a seed list of
opinion words, and then finds other opinion words in a large
corpus to help in finding opinion words with context specific
orientations. This could be done by using statistical or
semantic methods.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 166
Named Entity Recognition
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

• Entities are the who (and some of the what) of text analytics.
On the most basic level, an entity in text is simply a proper
noun such as a person, place, or product: John Coltrane,
Coca Cola, and Indiana are all entities.
• Named Entity Recognition is a process where all the named
entities which are the proper nouns are identified and
classified into their predefined appropriate class. Named-
entity recognition (NER) is a subtask of information
extraction that seeks to locate and classify named entities in
text into predefined categories such as the names of persons,
organizations, locations, expressions of times, quantities,
monetary values, percentages, etc. Thus it is the task of
finding names such as organizations, persons, locations, etc.
in text.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 168
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 169
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

• The example given below shows the named entities and their
classes.
• Ram [PER] joined Symbiosis [ORG] in Pune [LOC] on 12th
[DATE] Jan [MONTH] 2012 [YEAR] for a 3 [NUMB]
course.
• NER is subdivided into two stage problems first is
identification of proper nouns and classification of these
proper nouns into their respective classes. The process of
NER involves a few stages where it consists of pre-
processing of text, data training, data testing, and lastly
result evaluation. These steps are again broadly classified
into pre-processing steps, feature extraction, NER algorithms
and labeling.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 170
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 171
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

Step 1: Document collection
• Documents of varied formats such as .pdf, .html, .docx etc.
from all sources will be collected. These documents will be
inputs for the system.
Step 2: Pre-processing
• Data pre-processing describes any type of processing
performed on raw data to prepare it for another processing
procedure.
Step 2.1: Validation of input document
• Validation is to check whether the given input text is in
language for which the system is implemented. It also checks
whether the input is syntactically correct, but does not check
the semantic correctness.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 172
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

Step 2.2: Tokenization
• The aim of the tokenization is the exploration of the words in a
Sentence where every word, symbol, special character in the
sentence is considered as a token.
Step 2.3: Stop word removal
• In stop word removal, words that occur very frequently and
does not contribute much to the context and content, and also
have no impact on their existence are removed.
Step 2.4: Stemming
• Trimming or cutting out the extraneous words to the stem is
called stemming. Here inflections are removed using stemming
algorithms. Step 2.5: Morphological analysis
• Morphological analysis is the procedure to find out the root
word. It is applied to recognize the inner structure of the word.
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 173
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

Step 3: Data Training:
• This step is required to train the system. Training is done
based on the feature extraction and the algorithm used. The
output of this stage will be given to the testing stage.

Step 3.1: Feature extraction – In this process a small subset

from the sentence is extracted and then a feature set is applied to
the NER algorithms.
Step 3.2: NER algorithms – Various NER NLP algorithms
include rule based, machine learning and hybrid approaches.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 174
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

Step 4: Data Testing
Step 4.1: Feature extraction – This process is the same as
explained in the training data stage with the test data. The
extracted features are then tagged.
Step 4.2: Labeling (tagging) – In this process the entities are
tagged using any of the algorithm.

Step 5: Result – the output of all the above stages will be then
go through the evaluation stage using evaluation parameters.

Step 6: Evaluation – The accuracy level of NER can done by

Precision (P), Recall (R) and F1-measure metrics.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 175
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

• This process involves two main sub tasks firstly, identifying
the proper nouns from the sequence of the text, secondly
classifying these proper nouns into their predefined
categories.
• NER task is done by the following rule based approaches. It
is also called hand crafted rules or linguistic approach. In
this approach the rules are written manually by the
researchers for the system and for any particular language.
• Rule based systems parse the source text and produce an
intermediate representation which may be a parse tree or
some abstract representation.
• Rule based are further classified into list lookup approach
and linguistic approach.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 176
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Named Entity Recognition

• List lookup Approach : In List Lookup a large corpus which
is called as a bag of words are built for all the named entities
and their classes. List lookup is performed to identify named
entities. This list is also called as gazetteer.
• Linguistic Approach: In linguistic approach one should have
a deep knowledge of the grammar of any specific language.
The understanding and the knowledge of the language leads
to more accurate rules so that the named entities will be
identified and classified very easily.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 177
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Machine Learning Approach

• In Machine Learning-based NER systems, the purpose of Named Entity
Recognition approach is converting identification problem into a
classification problem and employing a classification statistical model to
solve it.
• Machine learning approaches are also named as corpus based approaches. In
this type of approach, the systems look for patterns and relationships into text
to make a model using statistical models and machine learning algorithms.
• The systems identify and classify nouns into particular classes such as
persons, locations, times, etc based on this model, using machine learning
algorithms.
• There are three types of machine learning models that are used for NER that
are Supervised, semi-supervised and unsupervised machine learning model.
Supervised learning utilizes only the labelled data to generate a model.
• Semi-supervised learning aims to combine both the labelled data as well as
useful evidence from the unlabelled data in learning. Unsupervised learning
is designed to be able to learn without or with very few labelled data. These
models will be broadly classified further.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 178
Supervised Machine Learning
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Approach
• Supervised machine approach is also called statistical
models. It has proved to be very effective.
• NER in statistical models usually treat recognising named
entities as a sequence tagging problem in which each word is
tagged to it entity if its present in the entity class.
• The process of learning is called supervised, human
intervention is needed to train the system by giving the
trained data examples to construct the statistical model,
which cannot achieve good performance without large
amount of training data.
• Different supervised machine approaches are as follows.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 179
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hidden Markov Model (HMM)

• It is a statistical language model that computes the likelihood
of a sequence of words by employing a Markov chain, in
which the likelihood of the next word is based on the current
word.
• In this language model words are represented by states, NE
classes are represented by regions and there is a probability
associated with every transition from the current word to the
next word.
• This model can predict the NE class for a next word if the
current word and its NE class pair is given.
• It has a better capability of capturing the locality of
phenomena, which indicates names in text.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 180
Maximum Entropy Markov Model
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

(MEMM)
• MEMM is also a statistical model which is very flexible.
The assignment is output for each word or token is based on
its future (f), history (h) and the features (g).
• Here an essential prerequisite is the selection of appropriate
features. A maximum entropy solution to this, or any other
similar problem allows the computation of p (f | h) for any f
from the space of possible futures, F, for every h from the
space of possible histories, H.
• As the outputs of the models are defined by the futures, the
solution for this is to compute p (f | h) for any f from the
space of possible futures, F and for any h from the possible
histories, H.
• Thus in NER, finding the probability of f for any token with
respect to its index (t) can be formulated as p (f | ht).
St. Francis Institute of Technology NLP
Department of Computer Engineering Ms. Pradnya Sawant 181
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Conditional Random Fields

• Conditional random fields (CRFs) are a class of statistical
modelling method often applied in pattern recognition and
machine learning, where they are used for structured
prediction.
• Where, the prediction of labels or tags for entities in any
ordinary classifier is done without considering the context of
neighboring entity, CRF takes the context into account e.g.
the linear chain CRF predicts sequences of labels for
sequences of input samples.
• It is a discriminative undirected probabilistic graphical
model (random fields) used to encode known relationships
between observations and construct consistent
interpretations.

St. Francis Institute of Technology NLP

Department of Computer Engineering Ms. Pradnya Sawant 182
Sample Questions
1. What are the different text preprocessing methods used in
information retrieval?
2. What are the different approaches for Rule Based Machine
Translation (RBMT)?
3. Explain different types of text summarization techniques.
4. Explain the different steps in text summarization.
5. Write short note on Question Answering System.
6. Explain Information retrieval system in detail.
7. Explain machine translation in detail.
8. What are the empirical machine translation systems.

Machine Translation and Natural Language
No ratings yet
Machine Translation and Natural Language
5 pages
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
No ratings yet
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
8 pages
Machine Translation Approaches Issues An
No ratings yet
Machine Translation Approaches Issues An
7 pages
Machine Translation, Auto Encoders and Decoders
No ratings yet
Machine Translation, Auto Encoders and Decoders
29 pages
Indian Language Machine Translation Review
No ratings yet
Indian Language Machine Translation Review
29 pages
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
No ratings yet
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
29 pages
Machine Translation Thesis PDF
100% (3)
Machine Translation Thesis PDF
8 pages
Machine Translation With Statistical Approach
No ratings yet
Machine Translation With Statistical Approach
33 pages
2018 - Generating Noun Declension-Case Markers For English To Indian Languages in Declension Rule Based MT Systems
No ratings yet
2018 - Generating Noun Declension-Case Markers For English To Indian Languages in Declension Rule Based MT Systems
7 pages
NLP M5 Part-2 SPP
No ratings yet
NLP M5 Part-2 SPP
62 pages
Unit 5
No ratings yet
Unit 5
42 pages
Statistical Machine Translation Overview
No ratings yet
Statistical Machine Translation Overview
18 pages
13 Machine Translation
No ratings yet
13 Machine Translation
22 pages
Natural Language Processing For Language Translation
No ratings yet
Natural Language Processing For Language Translation
23 pages
Machine Translation
No ratings yet
Machine Translation
58 pages
NLP Unit 1
100% (1)
NLP Unit 1
34 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
A Sanskrit-to-English Machine Translation Using Hybridization of Direct and Rule-Based Approach
No ratings yet
A Sanskrit-to-English Machine Translation Using Hybridization of Direct and Rule-Based Approach
20 pages
NLP Lab Manual for Engineers
No ratings yet
NLP Lab Manual for Engineers
139 pages
Machine Translation Final Draft
No ratings yet
Machine Translation Final Draft
27 pages
ASWIN TS Unit 3 NLP Translations Gen AI
No ratings yet
ASWIN TS Unit 3 NLP Translations Gen AI
5 pages
Leeds 2006
No ratings yet
Leeds 2006
34 pages
Machine Translation
No ratings yet
Machine Translation
38 pages
TTT Class 1
No ratings yet
TTT Class 1
15 pages
Bangla-English MT Thesis
No ratings yet
Bangla-English MT Thesis
112 pages
Machine Translation Approaches & Evaluation
No ratings yet
Machine Translation Approaches & Evaluation
9 pages
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
No ratings yet
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
5 pages
Proceedings of International Ethical Hacking Conference 2018
No ratings yet
Proceedings of International Ethical Hacking Conference 2018
5 pages
Systematic Review On Techniques of Machine Translation For Indian Languages
No ratings yet
Systematic Review On Techniques of Machine Translation For Indian Languages
6 pages
An Introduction To Machine Translation: Andy Way, DCU
No ratings yet
An Introduction To Machine Translation: Andy Way, DCU
23 pages
Syntactic and Semantic
No ratings yet
Syntactic and Semantic
4 pages
PHD Thesis Machine Translation
100% (3)
PHD Thesis Machine Translation
7 pages
JSeva-ODEP-PhD - PristupniRad - Automatic Language Translation
No ratings yet
JSeva-ODEP-PhD - PristupniRad - Automatic Language Translation
13 pages
UNIT 6 Applications of NLP
No ratings yet
UNIT 6 Applications of NLP
7 pages
What Is Machine Translation?
No ratings yet
What Is Machine Translation?
4 pages
2016 - An Efficient English To Hindi Machine Translation System Using Hybrid Mechanism
No ratings yet
2016 - An Efficient English To Hindi Machine Translation System Using Hybrid Mechanism
5 pages
Review Article: Example-Based Machine Translation
No ratings yet
Review Article: Example-Based Machine Translation
46 pages
Machine Translation
No ratings yet
Machine Translation
13 pages
NLP for Language and Sentiment Analysis
No ratings yet
NLP for Language and Sentiment Analysis
25 pages
Machine Translation
No ratings yet
Machine Translation
5 pages
Hindi To English Machine Translation
No ratings yet
Hindi To English Machine Translation
4 pages
2019 - Design of A Morphological Generator For An English To Indian Languages in A Declension Rule-Based Machine Translation Systems
No ratings yet
2019 - Design of A Morphological Generator For An English To Indian Languages in A Declension Rule-Based Machine Translation Systems
12 pages
FN Paper 2
No ratings yet
FN Paper 2
13 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
NLP Unit V
No ratings yet
NLP Unit V
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
JETIR2211403
No ratings yet
JETIR2211403
6 pages
PhD Admission: Neural Translation
No ratings yet
PhD Admission: Neural Translation
9 pages
English To Yorùbá Machine Translation System Using Rule-Based Approach
No ratings yet
English To Yorùbá Machine Translation System Using Rule-Based Approach
6 pages
NLP Unit-5
No ratings yet
NLP Unit-5
14 pages
Referance 3
No ratings yet
Referance 3
4 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
Lecture 13 Translation and Terminology Lecture Notes
No ratings yet
Lecture 13 Translation and Terminology Lecture Notes
4 pages
Survey On Machine Translation Approaches Used in India: D S Rawat
No ratings yet
Survey On Machine Translation Approaches Used in India: D S Rawat
4 pages
s1704 PDF
No ratings yet
s1704 PDF
6 pages
Machine Learning in Translation Corpora Processing
No ratings yet
Machine Learning in Translation Corpora Processing
281 pages
Machine Translation Overview
No ratings yet
Machine Translation Overview
30 pages
NLP Module 4
No ratings yet
NLP Module 4
99 pages
NLP SEM7 Module 1
No ratings yet
NLP SEM7 Module 1
76 pages
NLP Discourse Processing Guide
No ratings yet
NLP Discourse Processing Guide
156 pages
ML QP Dec-2022 Regular
No ratings yet
ML QP Dec-2022 Regular
1 page
Past Habits and Incidents Grammar Exercise
No ratings yet
Past Habits and Incidents Grammar Exercise
1 page
English 9 Syllabus Sy 2023-2024
No ratings yet
English 9 Syllabus Sy 2023-2024
6 pages
Stress in Prefix and Suffix
100% (1)
Stress in Prefix and Suffix
8 pages
Persuasive Speech Topic Ideas
No ratings yet
Persuasive Speech Topic Ideas
2 pages
Communication Processes, Principles, and Ethics
No ratings yet
Communication Processes, Principles, and Ethics
12 pages
ENGLISH Adverbs of Degree
No ratings yet
ENGLISH Adverbs of Degree
19 pages
Detailed Lesson Plan in English Iii
No ratings yet
Detailed Lesson Plan in English Iii
15 pages
English 2 q4 WLP Week 4
No ratings yet
English 2 q4 WLP Week 4
2 pages
Documentary Research in Specialised Translation Studies
No ratings yet
Documentary Research in Specialised Translation Studies
50 pages
Passives All Tenses Grammar Drills Oneonone Activities Sentence Transf - 40269
No ratings yet
Passives All Tenses Grammar Drills Oneonone Activities Sentence Transf - 40269
3 pages
Lexical Semantics Exam Guide
No ratings yet
Lexical Semantics Exam Guide
7 pages
Nouns Worksheet 5th Grade English Grammar Olympiad
No ratings yet
Nouns Worksheet 5th Grade English Grammar Olympiad
2 pages
Maybe, Perhaps, Possibly, Probably, Likely: Analysis On The Use of Synonymous Adverbs
No ratings yet
Maybe, Perhaps, Possibly, Probably, Likely: Analysis On The Use of Synonymous Adverbs
14 pages
Drama As Criticism of Life
67% (3)
Drama As Criticism of Life
9 pages
Eng4 TG U4
100% (2)
Eng4 TG U4
160 pages
Text File - Programs
No ratings yet
Text File - Programs
5 pages
Assgiment Eng 511
No ratings yet
Assgiment Eng 511
4 pages
SAMPLE of Syllabus 4 SKILLS
No ratings yet
SAMPLE of Syllabus 4 SKILLS
8 pages
Tarea Ingles Unidad 5
No ratings yet
Tarea Ingles Unidad 5
8 pages
Subject Verb Agreement Exercise For Class 9
0% (1)
Subject Verb Agreement Exercise For Class 9
1 page
9º ING 4 Quinzena 2º Corte PDF
No ratings yet
9º ING 4 Quinzena 2º Corte PDF
3 pages
Osis
No ratings yet
Osis
97 pages
Pertemuan 3
No ratings yet
Pertemuan 3
64 pages
English Worksheet
No ratings yet
English Worksheet
1 page
Reinforce Company's Image: Ramification Emphatic
No ratings yet
Reinforce Company's Image: Ramification Emphatic
3 pages
Ship or Sheep PDF
No ratings yet
Ship or Sheep PDF
1 page
Nation Building Southern African Englishes
No ratings yet
Nation Building Southern African Englishes
3 pages
05.2 Bab 2
No ratings yet
05.2 Bab 2
6 pages
SyntaxII 1 IP Pollock
No ratings yet
SyntaxII 1 IP Pollock
10 pages
ICA Weaknesses
100% (2)
ICA Weaknesses
2 pages