The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes.
Distribution and modifications of the content is prohibited.
                Natural Language Processing
                                                                    CDSC 7013
                                                        Subject In-charge
                                                       Ms. Pradnya Sawant
                                                   Assistant Professor
                                                      Room No. 405
                                              email: pradnyarane@sfit.ac.in
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              1
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Module 6
Applications of NLP
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              2
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Contents
•     Machine translation
•     Text Summarization
•     Information retrieval
•     Question Answering system
•     Sentiment analysis
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              3
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   Module 6
   Lecture 1
   • Machine translation
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              4
Machine Translation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Approaches to Machine Translation
• Machine translation (MT), the use of computers to
  automate some or all of the process of translating
  from one language to another.
• MT is classified into seven broad categories:
          •    rule-based
          •    knowledge-based
          •    principle-based
          •    statistical-based
          •    example-based
          •    hybrid-based
          •    online interactive based
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              6
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Approaches to Machine Translation
• The first three MT approaches are the most
  widely used and earliest methods.
• Afterwards , most of the MT related research is
  based on statistical and example-based
  approaches.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              7
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Rule-based Approach
• It is the first strategy that was developed
• A Rule-Based Machine Translation (RBMT) system
  consists of
          • a collection of rules, called grammar rules,
          • a bilingual or multilingual lexicon, and
          • software programs to process the rules.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              8
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Rule-based Approach
• Building RBMT systems entails a huge human effort
      to code all of the linguistic resources, such as
          •    source side part-of-speech taggers and syntactic parsers
          •    bilingual dictionaries
          •    source to target transliteration
          •    Target Language morphological generator
          •    structural transfer
          •    reordering rules.
• Nevertheless, a RBMT system always is extensible
      and maintainable.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                              9
Example:
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Rule Based Machine Translation (RBMT) System
 • Having input sentences (in some SL), an RBMT
   system generates them to output sentences (in
   some TL) on the basis of morphological,
   syntactic, and semantic analysis of both the
   source and the target languages involved in a
   concrete translation task.
 • It applies a set of linguistic rules in three
   different phases: analysis, transfer and
   generation.
 • It requires: syntax analysis, semantic analysis,
   syntax generation and semantic generation.
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            11
                               The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    RBMT System
• RBMT generates the
  target text given a
  source text in the
  following the steps
  St. Francis Institute of Technology                                                                                                                                           NLP
  Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            12
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  RBMT System
  •      Source language morphological analyzer analyzes a
         source language word and provides the morphological
         information.
  •      Source language parser is a syntax analyzer that analyzes
         source language sentences.
  •      Translator is used to translate a source language word
         into target language. (Word Translation)
  •      Target language morphological analyzer works as a
         generator and it generates appropriate target language
         words for the given grammatical information.
  •      Also target language parser works as a composer and it
         composes a suitable target language sentence.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            13
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
RBMT System
• This type of MT system needs a minimum of three
   dictionaries:
1. Source Language Dictionary: It is used by SL
    morphological analyzer
2. Bilingual Dictionary: It is used by the translator
    for translating source language into target language
3. Target Language Dictionary: It is used by
    morphological to generate target language words.
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            14
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   1.Rule-based Approach
• Rules play a major role in various stages of
  translation, such as :
        • syntactic processing
        • semantic interpretation and
        • contextual processing of language.
• Generally, rules are written with linguistic
  knowledge gathered from linguists.
• 3 different approaches under RBMT category are
        • Direct Translation
        • Interlingua MT and
        • Transfer-based MT.
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            15
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1.1 Direct Translation
• In this method, the SL text is analyzed structurally up to the
  morphological level and is designed for a specific source
  and target language pair.
• The performance of a direct MT system depends on
    • the quality and quantity of the source-target language
       dictionaries
    • morphological analysis
    • text processing software, and
    • word-by-word translation with minor grammatical
       adjustments on word order and morphology.
    After words are translated, simple reordering rules are
    applied – Example: move adjectives after nouns when
    translating from English to French
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            16
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    1.2 Transfer Based Translation
•     On the basis of the structural differences between the source and target
      language, a transfer system can be broken down into three different
      stages:
          i) Analysis
          ii) Transfer and
         iii) Generation.
•     Analysis : the Source Language(SL) parser is used to produce the
      syntactic representation of a SL sentence.
•     Transfer : The result of the first stage is converted into equivalent
      Target Level (TL)-oriented representations. i.e. Convert the source-
      language parse tree to a target-language parse tree
•     Generation : A TL morphological analyzer is used to generate the final
      TL texts. i.e. Convert the target-language parse tree to an output
      sentence
•     Disadvantage of Rule based MT : It requires good dictionaries and
      manual setting of rules.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            17
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1.3 Interlingua Based Translation
• In Interlingua approach, the translation is performed by first
    representing the SL (Source Language) text into an
    intermediary (semantic) form called Interlingua.
• The idea is to represent all sentences that mean the “same”
    thing in the same way, regardless of the language they happen
    to be in.
• The advantage of this approach is that Interlingua is a language
    independent representation from which translations can be
    generated to different TLs.
• Thus, the translation consists of two stages:
1. SL is first converted in to the Interlingua (IL) form
2. Translation from the IL to the TL(Target Language).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            18
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1.3 Interlingua Based Translation
  • The main advantage of this Interlingua approach is that the
         analyzer of the parser for the SL is independent of the
         generator for the TL.
     •    Advantage of Interlingua : It is economical in situations
          where translation among multiple languages is involved .
     •    Disadvantage:
     •    What would a language independent representation look
          like? i.e. Difficulty in defining the interlingua.
     •    Interlingua does not take the advantage of similarities
          between languages, for example Tamil and Telugu are
          siblings of same family i.e. Dravidian Languages etc.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            19
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   Module 6
   Lecture 2
   • Machine translation
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            20
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  2.Knowledge based MT
  • Emphasis       is     on     functionally complete
    understanding of the source text prior to the
    translation into the target text.
  • KBMT is implemented on the Interlingua
    architecture
  • KBMT must be supported by world knowledge and
    by linguistic semantic knowledge about meanings
    of words and their combinations.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            21
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  2.Knowledge based MT
  • Once the SL is analyzed, it will run through the
    Augmenter. It is the knowledge base that converts
    the source representation into an appropriate target
    representation before synthesizing into the target
    sentence.
  • KBMT systems provide high quality translations.
  • They are quite expensive to produce due to the
    large amount of knowledge needed to accurately
    represent sentences in different languages.
  • E.g. The English-Vietnamese MT system
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            22
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  3. Principle Based MT (PBMT)
 • PBMT Systems employ parsing methods based on the
       Principles & Parameters Theory of Chomsky‘s Generative
       Grammar.
 •     The parser generates a detailed syntactic structure that
       contains lexical, phrasal, grammatical, and thematic
       information.
 •     It also focuses on robustness, language-neutral
       representations, and deep linguistic analysis.
 •     In the PBMT, the grammar is thought of as a set of
       language-independent, interactive well-formed principles
       and a set of language-dependent parameters.
 •     Thus, for a system that uses n languages, one must have n
       parameter modules and a principles module.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            23
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   3. Principle Based MT (PBMT)
• It is well-suited for use with the interlingual architecture.
• PBMT parsing methods differ from the rule-based approaches.
• Although efficient in many circumstances, they have the
  drawback of language-dependence and increase exponentially
  in rules if one is using a multilingual translation system.
• They provide broad coverage of many linguistic phenomena,
  but lack the deep knowledge about the translation domain that
  KBMT and EBMT systems employ.
• PBMT systems is not efficient method for languages applying
  different principles.
• E.g. UNITRAN (Universal translator used by YouTube,
  Netflix)
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            24
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4. Empirical MT (EMT) Systems
   • EMT systems rely on large parallelly aligned
    corpora.
  • Empirical systems acquire the knowledge about
    set of rules describing the translation process
    automatically from a collection of translation
    examples.
  • It uses automatically induced rules.
  • 2 Categories : Statistical and Example-Based
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            25
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.1 Statistical-based Approach
• Translation is based on the knowledge and statistical
  models extracted from bilingual corpora.
• It requires bilingual or multilingual textual corpora of
  the source and target language(s) are required.
• A supervised or unsupervised ML algorithm is used
  to build statistical tables from the corpora.
• The statistical tables consist of the characteristics of
  well-formed sentences, and the correlation between
  the languages.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            26
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.1 Statistical-based Approach
• During translation, the collected statistical information is used
  to find the best translation for the input sentences, and this
  translation step is called the decoding process.
• There are three different statistical approaches in MT named :
    • Word-based Translation
    • Phrase-based Translation
    • Hierarchical Phrase-based model.
• The idea behind SMT comes from information theory.
• A document is translated according to the probability
      distribution function indicated by p(e|f), which is the
      probability of translating a sentence f in the SL (F) to a
      sentence e in the TL(E).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            27
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.1 Statistical-based Approach
• The problem of modeling the probability distribution p(e|f)
  has been approached in a number of ways.
• One intuitive approach is to apply Bayes theorem.
• If p(f|e) indicate translation model and p(e) indicate language
  model then the probability distribution
• p(e|f) = argmax p(f|e)p(e).
• The translation model p(f|e) is the probability that the source
  sentence is the translation of the target sentence or the way
  sentences in E get converted to sentences in F.
• The language model p(e) is the probability of seeing that TL
  string or the kind of sentences that are likely in the language
  E.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            28
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.1 Statistical-based Approach
• This decomposition is attractive as it splits the
  problem into two sub problems.
• Finding the best translation is done by picking the
  one that gives the highest probability, as shown in
  equation below.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            29
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
4.1.1 Word-based Translation
• Here, the words in an input sentence are translated
  word by word individually, and these words finally
  are arranged in a specific way to get the target
  sentence.
• This approach is the very first attempt in the
  statistical-based MT system
• This is comparatively simple and efficient.
• Disadvantage: It reduce the performance of the
  translation system.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            30
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
4.1.2 Phrase-based Translation
• It is a more accurate SMT approach
• Here, each sentence is divided into separate phrases instead of
    words.
•   The alignment between the phrases in the input and output
    sentences normally follows certain patterns.
•   It resulted in better performance than the word-based
    translation
•   But it did not improve the model of sentence order patterns.
•   The reordering technique may perform well with local phrase
    orders but not as well with long sentences and complex
    orders.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            31
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
4.1.3 Hierarchical Phrase-based Model
• The advantage of this approach is that hierarchical phrases
  have recursive structures instead of simple phrases.
• This higher level of abstraction approach further improved the
  accuracy of the SMT system.
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            32
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   Module 6
   Lecture 3
   • Machine translation
   • Information Retrieval
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            33
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.2 Example Based MT (EBMT)
• It is based on analogical reasoning between two translation
  examples.
• It relies on large parallel aligned corpora.
• An EBMT system is given a set of sentences in the SL and
  their corresponding translations in the TL, and uses those
  examples to translate other, similar SL sentences into the
  TL.
• The basic logic is that, if a previously translated sentence
  occurs again, the same translation is likely to be correct
  again.
• EBMT systems are attractive in that they require a minimum
  of prior knowledge; therefore, they quickly adapt to many
  language pairs.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            34
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.2 Example Based MT
       • A restricted form of example-based translation is
         available commercially, known as a translation memory.
       • In a translation memory, as the user translates text, the
         translations are added to a database, and when the same
         sentence occurs again, the previous translation is inserted
         into the translated document.
       • This saves the user the effort of re-translating that
         sentence
       • More advanced translation memory systems will also
         return close but inexact matches on the assumption that
         editing the translation of the close match will take less
         time than generating a translation from scratch.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            35
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4.2 Example Based MT
• E.g.
   • ALEPH : Aleph is company incorporated in
      Qatar with an international network of
      professional linguists offering a variety of
      translation services
   • wEBMT : Machine Translation System Using
      the World Wide Web
   • PanEBMT : Made by Carnegie Mellon
      University
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            36
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  5. Online Interactive MT
• In this interactive translation system, the user is
  allowed to suggest the correct translation to the
  translator online.
• This approach is very useful in a situation where the
  context of a word is unclear and there exists many
  possible meanings for a particular word.
• In such cases, the structural ambiguity can be solved
  with the interpretation of the user.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            37
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  6. Hybrid MT
• Takes advantage of both statistical and rule-based translation
  methodologies
• It is more efficient
• These systems are based on both rules and statistics.
• Hybrid approach can be used in different ways:
     • Translations are performed in the first stage using a rule-
        based approach followed by adjusting or correcting the
        output using statistical information
     • In the other way, rules are used to pre-process the input
        data as well as post-process the statistical output of a
        statistical-based translation system. This technique is
        better and has more power, flexibility, and control in
        translation.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            38
                              The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
   6. Hybrid MT
• Example:
• METIS-II MT system is an example of hybridization
  which avoids the usual need for parallel corpora by
  using a bilingual dictionary and a monolingual corpus in
  the TL.
• Oepen MT System: It integrates statistical methods
  within an RBMT system to choose the best translation
  from a set of competing hypotheses (translations)
  generated using rule-based methods.
 St. Francis Institute of Technology                                                                                                                                           NLP
 Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            39
Information Retrieval
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval
  • IR deals with the representation, storage, and
    access of information and is concerned with the
    organization and retrieval of information from
    large database collections.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            41
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
 Information Retrieval System
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            42
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval
  • Here, the user issues a query q from the front-end application
    (accessible via, e.g., a Web browser)
  • q is processed by a query interaction module that transforms
    it into a “machine-readable” query q’ to be fed into the core
    of the system, a search and query analysis module.
  • This is the part of the IR system having access to the content
    management module directly linked with the back-end
    information source (e.g., a database).
  • Once a set of results r is made ready by the search module, it
    is returned to the user via the result interaction module;
    optionally, the result is modified (into r ) or updated until the
    user is completely satisfied.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            43
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval
  • The most widespread applications of IR are the ones
    dealing with textual data.
  • As textual IR deals with document sources and questions,
    both expressed in natural language, a number of textual
    operations take place “on top” of the classic retrieval steps.
  • The processing of textual queries typically performed by
    an IR engine are shown in figure.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            44
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval : Text
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            45
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval : Text
  1. The user need is specified via the user interface, in
  the form of a textual query qU (typically made of
  keywords).
  2. The query qU is parsed and transformed by a set of
  textual operations; the same operations have been
  previously applied to the contents indexed by the IR
  system. This step yields a refined query q’U .
  3. Query operations further transform the preprocessed
  query into a system-level representation, qS .
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            46
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Information Retrieval : Text
  4. The query qS is executed on top of a document
  source D (e.g., a text database) to retrieve a set of
  relevant documents, R.
  Fast query processing is made possible by the index
  structure previously built from the documents in the
  document source.
  5. The set of retrieved documents R is then ordered:
  documents are ranked according to the estimated
  relevance with respect to the user’s need.
  6. The user then examines the set of ranked
  documents for useful information; he might pinpoint
  a subset of the documents as definitely of interest and
  thus provide feedback to the system.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            47
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Processing in IR
    • Not all words are equally effective for the
      representation of a document’s semantics.
    • Noun words (words or noun phrase groups) are the
      most representative components of a document in
      terms of content.
    • Based on this observation, IR system also
      preprocesses the text of the documents to determine
      the most “important” terms to be used as index
      terms: a subset of the words is therefore selected to
      represent the content of a document.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            48
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Processing in IR
     • When selecting candidate keywords, indexing
       must fulfill two different and potentially opposite
       goals:
     • one is exhaustiveness, i.e., assigning a sufficiently
       large number of terms to a document, and
     • the other is specificity, i.e., the exclusion of
       generic terms that carry little semantics and inflate
       the index.
     • Generic terms (conjunctions and prepositions) are
       characterized by a low discriminative power as
       their frequency across any document in the
       collection tends to be high.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            49
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Processing in IR
• In other words, generic terms have high term
  frequency, defined as the number of occurrences of
  the term in a document.
• In contrast, specific terms have higher discriminative
  power, due to their rare occurrences across collection
  documents: they have low document frequency,
  defined as the number of documents in a collection in
  which a term occurs.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            50
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Textual Operations
• The textual preprocessing
  phase typically performed by
  an IR engine, takes as input
  a document and yields its
  index terms as output.
• The process of this
  extraction is given in the
  figure.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            51
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1. Document Parsing.
 • Documents come in all sorts of languages, character
       sets, and formats and the same document may
       contain multiple languages or formats.
 •     e.g. A French email with Portuguese PDF
       attachments.
 •     Document parsing deals with the recognition and
       “breaking down” of the document structure into
       individual components.
 •     In this preprocessing phase, unit documents are
       created;
 •     e.g., emails with attachments are split into one
       document representing the email and as many
       documents as there are attachments.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            52
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  2. Lexical Analysis.
    • After parsing, lexical analysis tokenizes a
      document, seen as an input stream, into words.
    • Issues related to lexical analysis include the
      correct identification of accents, abbreviations,
      dates, and cases.
    • The difficulty of this operation depends much on
      the language at hand.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            53
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  3. Stop-Word Removal
• A subsequent step optionally applied to the results of
  lexical analysis is stop-word removal, i.e., the removal
  of high-frequency words.
• The subsequent phases take the full-text structure
  derived from the initial phases of parsing and lexical
  analysis and process it in order to identify relevant
  keywords to serve as index terms.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            54
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4. Phrase Detection
• This step captures text meaning.
• Phrase detection may be approached in several ways,
  including
    • rules
    • morphological analysis
    • syntactic analysis, and combinations thereof.
• A common approach to phrase detection relies on the
  use of thesauri
• Thesauri usually contain synonyms and antonyms.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            55
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
4. Phrase Detection
• Thesauri may be composed following different approaches.
• Human-made thesauri :
   • They are generally hierarchical, containing related terms,
       usage examples, and special cases
    • Other formats are the associative one, where graphs are
       derived from underlying WordNet’s synonym sets or
       synsets.
• An alternative to the consultation of thesauri is to use
  machine learning techniques.
• Key Extraction Algorithm (KEA) identifies candidate key-
  phrases using lexical methods, calculates feature values for
  each candidate, and uses a supervised ML algorithm to
  predict which candidates are good phrases based on a corpus
  of previously annotated documents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            56
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  5. Stemming and Lemmatization
  • Stemming and lemmatization aim at stripping
    down word suffixes in order to normalize the
    word.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            57
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  6. Weighting
• The final phase of text preprocessing deals with term
  weighting.
• The words in a text have different descriptive power;
  hence, index terms can be weighted differently to
  account for their significance within a document
  and/or a document collection.
• Such a weighting can be binary, e.g., assigning 0 for
  term absence and 1 for presence.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            58
Question Answering System
          (QAS)
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Answering System (QAS)
• QASs attempt to answer questions asked by users in
  natural languages after retrieving and processing
  information from different data sources.
• The format of answers may also vary from simple
  text to multimedia.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            60
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
General Architecture of a QAS
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            61
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Answering System (QAS)
A typical Question Answering System, consists of
three main modules:
• Question Analysis
• Answer Retrieval
• Answer Generation.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            62
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Answering System (QAS)
The question analysis module :
• takes natural language questions as input
• specifies what the question is asking for, like
  location, date, person’s name etc., and
• is responsible for analyzing the question
  completely.
• Aims to understand question purpose and meaning.
• Understand the question : The question should be
  analyzed in different ways.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            63
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Morpho-syntactic analysis of the question
• Firstly, carry out the morph-syntactic analysis of
  words in the question.
• This is done by POS tagging.
• After POS tagging, find out the questioning
  information (what the question is looking for).
• A question class helps the system to classify the
  question type to provide a suitable answer
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            64
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Analysis
• To get the meaning of the question, we need to
  classify the question semantic type.
• Question classification means to classify the
  question into pre-defined semantic categories which
  leads to consider different strategies of processing.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            65
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Classification
• Question classification process is used to generate
  possible question classes.
• For example, a question can seek for date, time,
  location, or person.
• E.g. The question “Who was the first American in
  space?” is expecting that the person's name is in the
  answer
• The search space of reasonable answers will be
  definitely reduced here.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            66
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Question Analysis
• Once the question type is recognized, the question
  analysis need to recognize more constraints that the
  questions description type must meet.
• This process is simple as taking out keywords from
  the question and find candidate answer sentences.
• These keywords may then be extended by using
  morphological and/or synonyms replacements or
  using query expansion techniques.
• They form representation of the question.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            67
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Answer Retrieval
• It involves the following steps:
    • Document Retrieval
    • Document Processing
    • Syntactic Analysis
    • Semantic Analysis
    • Relation Identification
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            68
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Document Retrieval
• This module selects a set of relevant documents
  from a domain specific repository.
• Conceptual indexing is used for the retrieval
  process since the keyword based indexing ignores
  the semantic content of the document collection.
• Both the documents and queries can be mapped into
  concepts and these concepts are used as a
  conceptual indexing space for identifying and
  extracting documents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            69
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Document Processing
• The retrieved documents are processed for
  extracting candidate answer set.
• This module is responsible for selecting the
  response based on the relevant fragments of the
  documents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            70
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Syntactic Analysis
• The documents are analyzed syntactically using the
  NLP techniques such as POS tagging and NER.
• Firstly the documents are tokenized into a set of
  sentences. Then the POS tagging and NER is
  performed.
• Shallow parsing is performed to identify the phrasal
  chunks.
• The chunks identified in the question analysis
  module are matched with those identified in the
  document and relevant sentences are retrieved.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            71
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Semantic Analysis
• Shallow parsing can be performed for finding the
  semantic phrases or clauses.
• Semantic roles are identified and mapped to
  semantic frames.
• The sentences whose semantic frames map exactly
  to the semantic frames of the question are also
  extracted.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            72
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Relation Identification
• The base ontology(relationship between the entities)
  is populated with the domain knowledge
  incrementally as we go through different sets of
  documents.
• By this method a valid knowledge of any
  specialized discipline can be incorporated into the
  system.
• The relations among different concepts are
  identified using the domain knowledge and the
  ontological information is obtained.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            73
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Answer Generation
• The filtering of candidate answer set and answer
  generation is performed.
• The user is supplied with a set of short and specific
  answers ranked according to their relevance.
• The different stages are:
  - Filtering
  - Answer Ranking
  - Answer Generation
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            74
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Filtering
• The extracted sentences are filtered and the
  candidate answer set is produced.
• This is done by incorporating the information
  obtained from the question classification and
  document processing modules.
• The identified focus and frames are matched to get
  the candidate set.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            75
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Answer Ranking
• The answer set is ranked based on the semantic similarity.
• Simple template matching is not adopted since it neglects the
    semantic content and domain knowledge.
• Answers are ranked based on the similarity between the
    question frame and the answer frame.
Example: The event E “John gave a balloon to the kid.” has the
roles “AGENT verb/give THEME to RECIPIENT, the semantic
frame is identified as
• “has_possession (start(E), Agent, Theme)
• has_possession (end(E),Recipient, Theme)
• transfer(during(E), Theme)”
matches exactly with the question frame.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            76
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Answer Generation
• From the answer set, specific answers have to be
  generated in case the direct answers are not
  available.
• Hidden relations can be identified from the domain
  knowledge gathered from the ontology.
• Concept of natural language generation can also be
  utilized for this purpose.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            77
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification of Question Answering System
• There are eight criteria in support of classifying the available
   large number of QASs.
These criteria are
1. Application domains for which QASs are developed
2. Types of questions asked by the users
3. Types of analyses performed on users’ questions and source
documents
4. Types of data consulted in data sources
5. Characteristics of data sources
6. Types of representations used for questions and their matching
functions
7. Types of techniques used for retrieving answers
8. Forms of answers generated by QASs
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            78
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1. Classification based on application domain
• The task of generating answers of questions is
  related to the type of questions asked.
• Some users may require general information on a
  general topic
• Some may require specific information from a
  particular application domain.
• Therefore, selection of the domain as a basis of
  classification of QASs may be a natural choice.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            79
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  General domain (Open Domain) QASs
• In this, QASs answer domain independent
  questions.
• It search for answers within a large document
  collection.
• There is a large repository of questions that can be
  asked.
• QASs exploit general ontology and world
  knowledge for generating answers.
• Here, the quality of answers delivered is not high,
  and questions are asked by casual users.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            80
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Pros of general domain QASs
• There are a large number of casual users; general
  domain QASs are more suitable for them.
• General domain QASs use a general dictionary.
• Users don’t need to acquire knowledge of domain
  specific keywords for formulating questions.
• There is a large repository of questions.
• Wikipedia or news text can be utilized as a source
  of information for such QASs
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            81
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Cons of general domain QASs
• The quality of answers is low.
• The answers satisfaction depends upon the users.
• Domain experts require specialized information in
  answers and hence restricted domain QASs may be
  more suitable for them.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            82
                               The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    Restricted Domain (Closed Domain) QASs
• It answers domain specific questions.
• Answers are searched within domain specific document
  collections.
• Repository of question patterns is very limited; hence the
  systems can achieve good accuracy in answering
  questions.
• QASs exploit domain specific ontology and terminology.
• The quality of answers is expected to be high.
• There are various restricted domain QASs developed are:
• Temporal domain QAS, geospatial domain QAS, medical
  domain QAS, patent QAS, community based QAS, etc.
  St. Francis Institute of Technology                                                                                                                                           NLP
  Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            83
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Restricted domain (Closed Domain) QASs
• Different restricted domain QASs can be integrated
  to make General domain QASs .
• QASs require assigning the given question to an
  appropriate domain specific QAS based on the
  knowledge derived from keywords of the question.
• It faces problems in handing and forwarding the
  given questions to a particular restricted domain
  QAS as systems suffer question classification
  problems, ambiguity resolution problems, etc.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            84
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Pros of restricted domain QASs
• Restricted domain QASs suite to domain expert
  users as they need specialized answers.
• The quality of answers generated by restricted
  domain QASs is high
• The level of satisfaction of the users depends on
  their domain knowledge.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            85
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Cons of restricted domain QASs
• There is a limited repository of domain specific
  questions; such QASs can answer a limited number
  of questions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            86
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  2. Classification based on Types of Questions
The different categories are
1. Factoid type questions
2. List type questions
3. Hypothetical type questions
4. Confirmation questions
5. Causal questions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            87
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Factoid type questions [what, when, which, who,
  how]
• These questions are simple and fact based that
  require answers in a single short phrase or sentence.
• The factoid type questions generally start with wh-
  word.
• Current QASs have got a satisfactory performance
  in answering factoid type questions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            88
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  List type questions
• The list questions require a list of entities or facts in
  answers e.g., – list name of employees getting
  salary more than 5K?
• QASs consider such questions as a series of factoid
  questions which are asked ten times one after the
  other.
• The previous answers are ignored while firing next
  questions by QASs.
• QASs generally observe a problem in fixing the
  threshold value for the number or quantity of the
  entity asked in list type questions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            89
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Hypothetical type questions
• Hypothetical questions ask for information related
  to any hypothetical event.
• They generally begin with ‘what would happen if’.
  QASs require knowledge retrieval techniques for
  generating answers.
• Moreover, the answers are subjective to these
  questions.
• There are no specific correct answers to these
  questions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            90
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Confirmation questions
• Confirmation questions require answers in the form
  of yes or No.
• Systems require inference mechanism, world
  knowledge and common sense reasoning to
  generate answers.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            91
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Causal questions [how or why]
• Causal questions require explanations about an
  entity.
• The answers are not named entities as observed in
  the case of factoid type questions.
• QASs require advanced natural language processing
  techniques to analyze the text at pragmatic and
  discourse level for generating answers.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            92
Text Summarization
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Summarization
• Text summarization refers to the technique of shortening
  long pieces of text.
• The intention is to create a coherent and fluent summary
  having only the main points outlined in the document.
• There are two main types of how to summarize text in NLP:
    • Extraction-based summarization
    • Abstraction-based summarization
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            94
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Extraction-based Summarization
• The extractive text summarization technique involves
  pulling key phrases from the source document and
  combining them to make a summary.
• The extraction is made according to the defined metric
  without making any changes to the texts.
• Here is an example:
• Source text: Peter and Elizabeth took a taxi to attend the
  night party in the city. While in the party, Elizabeth
  collapsed and was rushed to the hospital.
• Extractive summary: Peter and Elizabeth attend party city.
  Elizabeth rushed hospital.
• As you can see above, the important words have been
  extracted and joined to create a summary — although
  sometimes the summary can be grammatically strange.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            95
                               The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    Abstraction-based Summarization
• The abstraction technique entails paraphrasing and shortening parts
  of the source document.
• When abstraction is applied for text summarization in deep learning
  problems, it can overcome the grammar inconsistencies of the
  extractive method.
• The abstractive text summarization algorithms create new phrases
  and sentences that relay the most useful information from the
  original text — just like humans do.
• Therefore, abstraction performs better than extraction.
• However, the text summarization algorithms required to do
  abstraction are more difficult to develop; that’s why the use of
  extraction is still popular.
• Abstractive summary: Elizabeth was hospitalized after attending a
   party with peter
  St. Francis Institute of Technology                                                                                                                                           NLP
  Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            96
  How does a text summarization algorithm
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  work?
• Usually, text summarization in NLP is treated as a
    supervised machine learning problem (where future
    outcomes are predicted based on provided data). Typically,
    here is how using the extraction-based approach to
    summarize texts can work:
1. Introduce a method to extract the merited key phrases from
the source document. For example, you can use part-of-speech
tagging, words sequences, or other linguistic patterns to
identify the key phrases.
2. Gather text documents with positively-labeled key phrases.
The key phrases should be compatible to the stipulated
extraction technique. To increase accuracy, you can also create
negatively-labeled key phrases.
3. Train a binary machine learning classifier to make the text
summarization.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            97
A general Text Summarization
           system
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                            99
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  A general text summarization system
• The system uses the multiple documents in order to create
      abstractive summarization.
•     At first, a semantic graph is generated for every sentence in
      the documents by preprocessing each sentence.
•     Thereafter, the generated graph is reduced to a more reduced
      graph to generate abstractive summary.
•     Heuristic rules have been used to generate an abstractive
      summary.
•     The goal of the system is to condense the documents into a
      shorter version and preserve important contents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          100
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Preprocessing Module
• Preprocessing module is responsible to accept the input
  text, and converts it to preprocessed sentences.
• It consists of four main processes: named entity recognition,
  morphological and syntactic analysis, cross-reference
  resolution, and pronominal resolution processes.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          101
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Preprocessing Module
• The named entity recognition process locates atomic
  elements into predefined categories such as person names,
  organizations, etc.
• In morphological analysis, each word is divided into
  morphemes and figures out its grammatical categories, the
  syntactic analysis parses the whole sentence to describe each
  word syntactic function and build the parse tree, and typed
  dependencies expresses syntactic knowledge in terms of
  direct relationships between words.
• Co-reference and pronominal resolution reference
  resolution processes identify co-reference named entities and
  resolve pronominal references in the whole input text.
• Co-reference is defined as the identification of surface terms
  (words within the document) that refer to the same entity.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          102
  Rich Semantic Sub-graphs Generation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Module
• The main objective of the Rich Semantic Graph Creation
  Phase is to represent the input documents semantically using
  Rich Semantic Graph (RSG).
• Unlike traditional semantic graph, the Rich Semantic Graph
  is able to capture the meaning of words, sentences, and
  paragraphs.
• The Rich Semantic Sub-graphs Generation module is
  responsible to transform each preprocessed sentence to a
  set of ranked rich semantic subgraphs.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          103
  Rich Semantic Sub-graphs Generation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Module
• The main objective of the Rich Semantic Sub-graphs
  Generation module is to generate multiple rich semantic sub-
  graphs for each input preprocessed sentence.
• This module includes three processes: Word Senses
  Instantiation, Concepts Validation, and Semantic Sentences
  Ranking processes.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          104
  Rich Semantic Sub-graphs Generation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Module
• Word Senses Instantiation process: For each input
  preprocessed sentence, this process instantiates a set of word
  concepts for both noun and verb senses based on the
  domain ontology.
• Concept Validation Process: In this process, for each
  preprocessed sentence, the sentence concepts instantiated
  are interconnected and validated to generate multiple rich
  semantic sub-graphs.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          105
  Rich Semantic Sub-graphs Generation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Module
Sentences Ranking Process:
• It aims to rank and to threshold the highest ranked rich
   semantic sub-graphs for each sentence.
• To generate single rich semantic graph and to keep the
   semantic consistency for the whole sentence, the process
   considers the first ranked rich semantic sub-graph only.
• The ranking method is based on deriving the average weight
   of each concept (word sense). The weight of the word
   concept is derived according to its usage popularity
   (WordNet usage popularity).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          106
  The Rich Semantic Graph Generation
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  module
• Finally, the Rich Semantic Graph Generation module is
  responsible to generate the final rich semantic graphs of
  the whole input document from the highest-ranked rich
  semantic sub-graphs of the document sentences.
• The semantic sub-graphs of the input document will be
  merged to form the final rich semantic graph.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          107
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Rich Semantic Graph Reduction Phase
• This phase aims to reduce the generated rich semantic graph
  of the original document to a more reduced graph.
• In this phase, a set of heuristic rules are applied on the
  generated rich semantic graph to reduce it by merging,
  deleting, or consolidating the graph nodes.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          108
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Rich Semantic Graph Reduction Phase
Example of a rule
• Sentence1= [SN1, MV1, ON1]
• Sentence2= [SN2, MV2, ON2]
• Each sentence is composed of three nodes: Subject Noun
   (SN) node, Main verb (MV) node and Object Noun (ON)
   node.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          109
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Rule 1.
•     IF SN1 is instance of noun N And
•     SN2 is instance of noun N And
•     MV1 is similar to MV2 And
•     ON1 is similar to ON2
•     THEN
•     Merge both MV1 and MV2 And
•     Merge both ON1 and ON2
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          110
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Text Generation Phase
• The Rich Semantic Graph Generation module is responsible
  for generating a set of ranked RSGs for the input ranked
  semantic sub-graphs.
• This phase aims to generate the abstractive summary from
  the reduced Rich Semantic Graph (RSG).
• There are four modules namely the Text planning, the
  Sentence Planning, the Surface Realization, and the
  Evaluation modules.
• These modules are performed by processes arranged as a
  pipeline, so the output of each process is the input of the
  next one as shown in figure 4.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          111
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Text Generation Phase
1) The Text Planning module: It aims to select the appropriate
content material to be expressed in the final text. This phase
includes one process called “Content Determination”, which
decides what information should be included in the
generated text.
2) The Sentence Planning module: It specifies the sentence
boundaries, and generates and orders intermediate
paragraphs. The main objective of this phase is to improve the
fluency or understandability of the text.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          112
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Text Generation Phase
The sentence planning consists of four main processes:
1. Lexicalization Process: In this process, for each verb/noun
object, its synonyms are selected by accessing the WordNet
ontology to generate the target content.
2. Discourse Structuring Process: The main aim of this
process is to build a structure that contains the selected object
synonyms in the form of pseudo-sentences.
3. Aggregation Process: The main aim of this process is to
decide how pseudo-sentences should be combined into semi-
paragraphs.
4. Referring Expression Process: This process identifies and
replaces the intended referent by its appropriate pronoun.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          113
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The Text Generation Phase
3) The Surface Realization module: This phase aims to
transform the enhanced semi-paragraphs into paragraphs by
correcting them grammatically (inflect words for tense, etc.) and
adding the required punctuation (capitalization adding
semicolon, etc).
4) The Evaluation module: The main objective of this phase is
to evaluate and then rank the paragraphs according to two
factors: coherence between paragraph sentences and the
most frequently used paragraph word synonyms.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          114
Text Categorization/
 Text Classification
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Categorization
• The goal in automatic text classification is to assign
  a document to a category by evaluating its text
  components.
• Given is the general block diagram of a text
  categorization system.
• The dataset is split into training and testing dataset
  for the classification process.
• Then starts the actual processing of text data.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          116
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Categorization
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          117
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Preprocessing
• The main objective of pre-processing is to obtain the key
  features or key terms from stored text documents and to
  enhance the relevancy between word and document and the
  relevancy between word and category.
• Pre-processing step is crucial in determining the quality of
  the next stage, that is, the classification stage.
• It is important to select the significant keywords that carry
  the meaning and discard the words that do not contribute to
  distinguishing between the documents.
• The pre-processing phase of the study converts the original
  textual data in a data mining ready structure.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          118
                                 The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
      Text Preprocessing
• In general, text can be represented in two separate ways.
• The first is as a bag-of-words, in which a document is
    represented as a set of words, together with their associated
    frequency in the document.
•   The bag-of-words model is a simplifying representation used in
    natural language processing and information retrieval .
•   In this process, a text is represented as the container of its
    words, disregarding grammar and even word order but keeping
    multiplicity.
•   The bag-of-words model is commonly used in methods of
    document classification, where the (frequency of) occurrence of
    each word is used as a feature for training a classifier.
•   Such a representation is essentially independent of the sequence
    of words in the collection.
    St. Francis Institute of Technology                                                                                                                                           NLP
    Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          119
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Text Preprocessing
• The second method is to represent text directly as
  strings, in which each document is a sequence of
  words.
• Most text and document data sets contain many
  unnecessary words such as stop words, misspelling,
  slang, etc.
• In many algorithms, especially statistical and
  probabilistic learning algorithms, noise and
  unnecessary features can have adverse effects on
  system performance.
• Thus, the following section explain some
  techniques and methods for text cleaning and pre-
  processing text data sets.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          120
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  1. Tokenization
• Tokenization is a pre-processing method which breaks a
  stream of text into words, phrases, symbols, or other
  meaningful elements called tokens.
• The main goal of this step is the investigation of the words in
  a sentence.
• Text classification require a parser which processes the
   tokenization of the documents.
 Example :
• After sleeping for four hours, he decided to sleep for another
   four.
• In this case, the tokens are as follows:
• { “After” “sleeping” “for” “four” “hours” “he” “decided”
   “to” “sleep” “for” “another” “four” }.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          121
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  2. Stop Word Removal
• Text and document classification includes many
  words which do not contain important significance
  to be used in classification algorithms, such as {“a”,
  “about”, “above”, “across”, “after”, “afterwards”,
  “again”,. . .}.
• The most common technique to deal with these
  words is to remove them from the texts and
  documents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          122
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  3. Capitalization
• Text and document data points have a diversity of
  capitalization to form a sentence.
• Since documents consist of many sentences, diverse
  capitalization can be hugely problematic when classifying
  large documents.
• The most common approach for dealing with inconsistent
  capitalization is to reduce every letter to lower case.
• This technique projects all words in text and document into
  the same feature space, but it causes a significant problem
  for the interpretation of some words (e.g., “US” (United
  States of America) to “us” (pronoun)).
• Slang and abbreviation converters can help account for these
  exceptions.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          123
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  4. Slang and Abbreviation
• Slang and abbreviation are other forms of text
  anomalies that are handled in the pre-processing
  step.
• An abbreviation is a shortened form of a word or
  phrase which contain mostly first letters form the
  words, such as SVM which stands for Support
  Vector Machine.
• Slang is a subset of the language used in informal
  talk or text that has different meanings such as “lost
  the plot”, which essentially means that they’ve gone
  mad.
• A common method for dealing with these words is
  converting them into formal language.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          124
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  5. Noise Removal
• Most of the text and document data sets contain
  many unnecessary characters such as punctuation
  and special characters.
• Critical punctuation and special characters are
  important for human understanding of documents,
  but it can be detrimental for classification
  algorithms
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          125
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  6. Spelling Correction
• Spelling correction is an optional pre-processing
  step.
• Typos (short for typographical errors) are
  commonly present in texts and documents,
  especially in social media text data sets (e.g.,
  Twitter).
• Many algorithms, techniques, and methods have
  addressed this problem in NLP.
• Many techniques and methods are available for
  researchers including hashing-based and context-
  sensitive spelling correction techniques.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          126
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  7. Lemmatization
• Lemmatization is a NLP process that replaces the
  suffix of a word with a different one or removes the
  suffix of a word completely to get the basic word
  form (lemma).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          127
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Extraction
The common techniques of feature extractions are
• Term Frequency-Inverse Document Frequency (TF-
  IDF)
• Term Frequency (TF)
• Word2Vec
• Global Vectors for Word Representation (GloVe).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          128
Reference
• https://sci2lab.github.io/ml_tutorial/tfidf/
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Dimensionality Reduction
• Using dimensionality reduction reduce the time and
  memory complexity for the text classification
  application.
• The most common techniques of dimensionality
  reduction include
o Principal Component Analysis (PCA)
o Linear Discriminant Analysis (LDA)
o Non-negative Matrix Factorization (NMF).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          134
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification Techniques
• In machine learning terminology, the classification problem
      comes under the supervised learning principle, where the
      system is trained and tested on the knowledge about classes
      before the actual classification process.
•     Unsupervised learning occurs when labeled data is not
      accessible.
•     The process is complicated and has performance issues.
•      It is suitable for big data.
•     Semi-supervised learning is followed when data is partly
      labeled and partly unlabeled.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          135
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    Classification Techniques
• Supervised learning is the most expensive and highly
     difficult of the three.
•    The main reason behind this notion is that it requires
     assigning of labels to classes.
•    Here, the learning process could be simplified by prior
     assumptions.
•    These kinds of assumptions about data introduce two
     approaches such as parametric and non-parametric.
•    The model that could summarize data based on underlying
     parameters is called a parametric model.
• Logistic regression and Naïve Bayes algorithms are
     parametric classifiers whereas Support vector machines, k-
     nearest neighbor, rule induction, decision trees and neural
     networks are non-parametric classifiers.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          136
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    Naive Bayes Classifier
• These are probabilistic classifiers commonly used in ML.
• However, the Bayesian classifiers are statistical and also
     possess learning ability.
•    Multinomial models are used by Naïve Bayes for large
     datasets.
•    The performance could be enhanced by searching the
     dependencies among attributes.
•    It is mainly used in data pre-processing applications due to
     ease of computation.
•    Bayesian reasoning and probability inference are employed
     in predicting the target class.
•     Attributes play an important role in classification.
•     Therefore, assigning different weight values to attributes
     can potentially improve the performance.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          137
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Support Vector Machine
• The Support Vector Machine (SVM) algorithm is one of the
      supervised machine learning algorithms that is employed for
      various classification problems.
•     SVMs are particularly suitable for high dimensional data.
•     There are so many reasons supporting this claim.
•     Specifically, the complexity of the classifiers depends on the
      number of support vectors instead of data dimensions, they
      produce the same hyper plane for repeated training sets, and
      they have better generalization abilities.
•     SVMs also perform with the same accuracy even when the
      data is sparse.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          138
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Decision trees
• Decision trees are highly comprehensible models when
      compared to neural nets.
•     These work in a sequence, to test a decision against a
      particular threshold value among the available values.
•     Testing happens according to certain logical rules similar to
      the concept of weights of neural networks.
•     C4.5 and CART are widely used decision tree techniques.
•     The tree growth phase partitions the training set and the
      pruning phase generalizes data over it.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          139
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  K-Nearest Neighbor
• K-Nearest Neighbor (k-NN) works on the principle of
  closest training samples, those data points that are close to
  each other belong to one particular class, commonly called
  instance-based learning.
• Though it is robust for noisy data, deciding the value of k is
  complicated.
• Computational complexity further increases with increase in
  dimensionality.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          140
                               The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
    Artificial Neural Networks
• Artificial neural networks (ANNs) work in the same way as the
  human brain in arriving at a decision.
• It works on the virtue of learning and evolution with minimal or no
  human intervention.
• For data classification, competitive co-evolution algorithm based
  neural network model is suggested.
• Radial Basis Function is the ANN component as it employs faster
  learning algorithms.
• It has a compact network architecture that increases classification
  accuracy.
• Also, evolutionary algorithms have a tendency to perform well in
  dynamic environments by learning rules on the fly and highly
  adaptive models of ‘fuzzy’ characteristics.
  St. Francis Institute of Technology                                                                                                                                           NLP
  Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          141
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Centroid-based Classifier
• The centroid-based classification algorithm is very simple.
• For each set of documents belonging to the same class, we
      compute their centroid vectors.
•     If there are k classes in the training set, this leads to k
      centroid vectors (C1, C2, C3...) where each Cn is the
      centroid for the jet class.
•     The class of a new document x is determined as follows.
      First the document-frequencies of the various terms
      computed from the training set.
•     Then, compute the similarity between x to all k centroid
      using the cosine measure.
•     Finally, based on these similarities, and assign x to the class
      corresponding to the most similar centroid
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          142
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Clustering
• The clustering is the task of finding groups of similar
  documents in a collection of documents.
• The similarity is computed by using a similarity function.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          143
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  k-means Clustering
• k-means clustering is one the partitioning algorithms which
  is widely used in the data mining.
• The k-means clustering, partitions n documents in the
  context of text data into k clusters. representative around
  which the clusters are built.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          144
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
 The basic form of k-means algorithm is:
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          145
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  k-means Clustering
• Finding an optimal solution for k-means clustering is
  computationally difficult (NP-hard), however, there are
  efficient heuristics that are employed in order to converge
  rapidly to a local optimum.
• The main disadvantage of k-means clustering is that it is
  indeed very sensitive to the initial choice of the number of k.
• Thus there are some techniques used to determine the initial
  k, e.g. using another lightweight clustering algorithm such as
  agglomerative clustering algorithm.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          146
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Ontology Based Classification
• Traditional classification methods ignore relationships
  between words.
• But there exists a semantic relation between terms such as
  synonymy, hyponymy etc.
• Thus for better classification results these semantic relations
  need to be considered.
• Ontology stores words related to a particular domain and this
  can be used for classification. E.g. Lingo algorithm.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          147
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Lingo Algorithm
• The general idea behind LINGO is to first find
  meaningful descriptions of clusters, and then, based
  on the descriptions, determine their content.
• To assign documents to the already labeled groups
  LINGO could use the Latent Semantic Indexing
  (LSI) in the setting for which it was originally
  designed: given a query – retrieve the best matching
  documents.
• When a cluster label is fed into the LSI as a query,
  as a result contents of the cluster will be returned.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          148
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Lingo Algorithm
• This approach should take advantage of the LSI's ability to
  capture high-order semantic dependencies in the input
  collection.
• In this way not only would documents that contain the
  cluster label be retrieved, but also the documents in which
  the same concept is expressed without using the exact
  phrase.
• In web search results clustering, however, the effect of
  semantic retrieval is sharply diminished by the small size of
  the input web snippets.
• This, in turn, severely affects the precision of cluster content
  assignment.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          149
Sentiment Analysis
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Sentiment Analysis
• Sentiment classification is a task under Sentiment Analysis
  (SA) that deals with automatically tagging text as positive,
  negative or neutral from the perspective of the speaker/writer
  with respect to a topic.
• Thus, a sentiment classifier tags the sentence ‘The movie is
  entertaining and totally worth your money!’ in a movie
  review as positive with respect to the movie.
• On the other hand, a sentence ‘The movie is so boring that I
  was dozing away through the second half.’ is labeled as
  negative.
• Finally, ‘The movie is directed by Nolan’ is labeled as
  neutral.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          151
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  The general block diagram for a sentiment analysis system
  is as given below:
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          152
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Data collection
• Data is collected from various social networking sites,
      blogging sites, and review sites
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          153
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Data Cleaning
1. Removal of URL’s: Data extracted may contain some url’s
which needs to be removed as they do not contain any
sentiments.
2. Case conversion: All the text should be converted to either
upper case or lower case i.e. there should be no difference
between ‘paper’ and ‘PAPER’.
3. Removal of punctuation: Punctuation such as full stop,
exclamatory sign, comma’s etc should be removed as they do
not represent any emotions.
4. Removal of Hash tag: Hash tag word is preceded by a hash
sign(#) and is generally used in social media for the
identification of specific subjects.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          154
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Data Cleaning
5. Tokenization : It divides given text into tokens.
6. Stemming: M.F. porter stemmer is most widely used
algorithm which stems the word.
7. Negation rule: This method removes negation word which
reverses meaning of word in review.
8. Conjunction rule: This method extracts meaning from review
using grammatical rule.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          155
Feature Extraction
  and Selection
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Extraction
• Some of feature extraction techniques are
1. Terms presence and frequency: It is based on individual
word or n-grams and frequency counts.
2. Parts of Speech (POS): It extracts adjective nouns from data.
3. Opinion words and phrase: It is based on words which
represents opinions such as good or bad, like or hate etc.
4. Negation: Appearance of negation words in text may reverse
the meaning of opinion. For example “not good” is equal to
“bad”.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          157
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Selection
• Feature Selection methods can be divided into lexicon-based
      methods that need human annotation, and statistical methods
      which are automatic methods that are more frequently used.
      Lexicon-based approaches usually begin with a small set of
      ‘seed’ words. Then they bootstrap this set through synonym
      detection or on-line resources to obtain a larger lexicon.
      Statistical approaches, on the other hand, are fully automatic.
• The feature selection techniques treat the documents either
      as group of words (Bag of Words (BOWs)), or as a string
      which retains the sequence of words in the document. BOW
      is used more often because of its simplicity for the
      classification process.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          158
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Selection
• Various feature selection methods are CountVectorizer ,TF-
      IDF(Term Frequency–Inverse Document Frequency),
      IG(Information Gain), MI(Mutual Information), Feature
      Vector, Unigram, Bigram and N-gram methods.TF-IDF
      score is to be taken into consideration to balance most
      weighted and less weighted word. Chi square method gives
      good results for both positive and negative classes. Mutual
      information, Chi-square, TF-IDF and Information Gain
      techniques are used to select feature from high dimensional
      data
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          159
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Selection
1. Count Vector : It is defined by the number of occurrences of features in
review.
2. TF-IDF : It is defined by multiplying the value of frequency of word in
review (TF) and frequency of word in whole corpus (IDF).
TF-IDFi = ti,j * log (N/dfi)
TF-IDFi is the weight of a term i. ti,j is the frequency of term i in sample j. N
is the total number of samples in the corpus. dfi is the number of samples
containing term i.
3. Information gain is the most widely used attribute selection measure in
the area of sentiment analysis. It determines the relevant features to predict
review by studying the presence or absence of features in a document.
P(c|f) is the joint probability, class is c and feature is f and P(c) denotes the
marginal probability.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          160
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Selection
4. Mutual Information: MI is the process of selecting features that
are not uniformly distributed across the sentiment classes because
they are informative of their classes and we can see that MI gives
more importance to only a few terms.
Where P(f,c) represents joint probability distribution function, P(f)
and p(c) represent marginal probability distribution of f and c. c is
positive and negative classes.
5. Chi-square: Chi-square measures observed count and expected
count and analyzed how much deviation occurs between them.
W, X, Y, Z represents the frequencies, represent the presence or
absence of feature in the sample. W is the count of samples in
which feature f and c occurred together. N=W+X+Y+Z. f
represents the feature and c represents the class.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          161
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          162
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification
• Sentiment Classification techniques can be roughly divided
  into machine learning approach, lexicon based approach and
  hybrid approach.
• The Machine Learning Approach (ML) applies the famous
  ML algorithms and uses linguistic features.
• The Lexicon-based Approach relies on a sentiment lexicon, a
  collection of known and precompiled sentiment erms.
• It is divided into dictionary-based approaches and corpus-
  based approaches which use statistical or semantic methods
  to find sentiment polarity.
• The hybrid Approach combines both approaches and is very
  common with sentiment lexicons playing a key role in the
  majority of methods.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          163
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Feature Selection
• Various feature selection methods are CountVectorizer ,TF-
  IDF(Term Frequency–Inverse Document Frequency),
  IG(Information Gain), MI(Mutual Information), Feature
  Vector, Unigram, Bigram and N-gram methods.TF-IDF
  score is to be taken into consideration to balance most
  weighted and less weighted word.
• Chi square method gives good results for both positive and
  negative classes. Mutual information, Chi-square, TF-IDF
  and Information Gain techniques are used to select feature
  from high dimensional data
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          164
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification
• The classification methods using ML approach can be
  roughly divided into supervised and unsupervised learning
  methods.
• The supervised methods make use of a large number of
  labeled training documents.
• The unsupervised methods are used when it is difficult to
      find these labeled training documents.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          165
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Classification
• The lexicon-based approach depends on finding the opinion
   lexicon which is used to analyze the text.
There are two methods in this approach.
• The dictionary-based approach which depends on finding
   opinion seed words, and then searches the dictionary of their
   synonyms and antonyms.
• The corpus-based approach begins with a seed list of
   opinion words, and then finds other opinion words in a large
   corpus to help in finding opinion words with context specific
   orientations. This could be done by using statistical or
   semantic methods.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          166
Named Entity Recognition
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
• Entities are the who (and some of the what) of text analytics.
  On the most basic level, an entity in text is simply a proper
  noun such as a person, place, or product: John Coltrane,
  Coca Cola, and Indiana are all entities.
• Named Entity Recognition is a process where all the named
  entities which are the proper nouns are identified and
  classified into their predefined appropriate class. Named-
  entity recognition (NER) is a subtask of information
  extraction that seeks to locate and classify named entities in
  text into predefined categories such as the names of persons,
  organizations, locations, expressions of times, quantities,
  monetary values, percentages, etc. Thus it is the task of
  finding names such as organizations, persons, locations, etc.
  in text.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          168
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          169
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
• The example given below shows the named entities and their
  classes.
• Ram [PER] joined Symbiosis [ORG] in Pune [LOC] on 12th
  [DATE] Jan [MONTH] 2012 [YEAR] for a 3 [NUMB]
  course.
• NER is subdivided into two stage problems first is
      identification of proper nouns and classification of these
      proper nouns into their respective classes. The process of
      NER involves a few stages where it consists of pre-
      processing of text, data training, data testing, and lastly
      result evaluation. These steps are again broadly classified
      into pre-processing steps, feature extraction, NER algorithms
      and labeling.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          170
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          171
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
Step 1: Document collection
• Documents of varied formats such as .pdf, .html, .docx etc.
   from all sources will be collected. These documents will be
   inputs for the system.
Step 2: Pre-processing
• Data pre-processing describes any type of processing
   performed on raw data to prepare it for another processing
   procedure.
Step 2.1: Validation of input document
• Validation is to check whether the given input text is in
   language for which the system is implemented. It also checks
   whether the input is syntactically correct, but does not check
   the semantic correctness.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          172
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
Step 2.2: Tokenization
• The aim of the tokenization is the exploration of the words in a
   Sentence where every word, symbol, special character in the
   sentence is considered as a token.
Step 2.3: Stop word removal
• In stop word removal, words that occur very frequently and
   does not contribute much to the context and content, and also
   have no impact on their existence are removed.
Step 2.4: Stemming
• Trimming or cutting out the extraneous words to the stem is
   called stemming. Here inflections are removed using stemming
   algorithms. Step 2.5: Morphological analysis
• Morphological analysis is the procedure to find out the root
   word. It is applied to recognize the inner structure of the word.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          173
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
Step 3: Data Training:
• This step is required to train the system. Training is done
   based on the feature extraction and the algorithm used. The
   output of this stage will be given to the testing stage.
Step 3.1: Feature extraction – In this process a small subset
from the sentence is extracted and then a feature set is applied to
the NER algorithms.
Step 3.2: NER algorithms – Various NER NLP algorithms
include rule based, machine learning and hybrid approaches.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          174
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
Step 4: Data Testing
Step 4.1: Feature extraction – This process is the same as
explained in the training data stage with the test data. The
extracted features are then tagged.
Step 4.2: Labeling (tagging) – In this process the entities are
tagged using any of the algorithm.
Step 5: Result – the output of all the above stages will be then
go through the evaluation stage using evaluation parameters.
Step 6: Evaluation – The accuracy level of NER can done by
Precision (P), Recall (R) and F1-measure metrics.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          175
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
• This process involves two main sub tasks firstly, identifying
  the proper nouns from the sequence of the text, secondly
  classifying these proper nouns into their predefined
  categories.
• NER task is done by the following rule based approaches. It
  is also called hand crafted rules or linguistic approach. In
  this approach the rules are written manually by the
  researchers for the system and for any particular language.
• Rule based systems parse the source text and produce an
  intermediate representation which may be a parse tree or
  some abstract representation.
• Rule based are further classified into list lookup approach
  and linguistic approach.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          176
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Named Entity Recognition
• List lookup Approach : In List Lookup a large corpus which
  is called as a bag of words are built for all the named entities
  and their classes. List lookup is performed to identify named
  entities. This list is also called as gazetteer.
• Linguistic Approach: In linguistic approach one should have
  a deep knowledge of the grammar of any specific language.
  The understanding and the knowledge of the language leads
  to more accurate rules so that the named entities will be
  identified and classified very easily.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          177
                                 The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
      Machine Learning Approach
•      In Machine Learning-based NER systems, the purpose of Named Entity
       Recognition approach is converting identification problem into a
       classification problem and employing a classification statistical model to
       solve it.
•      Machine learning approaches are also named as corpus based approaches. In
       this type of approach, the systems look for patterns and relationships into text
       to make a model using statistical models and machine learning algorithms.
•      The systems identify and classify nouns into particular classes such as
       persons, locations, times, etc based on this model, using machine learning
       algorithms.
•      There are three types of machine learning models that are used for NER that
       are Supervised, semi-supervised and unsupervised machine learning model.
       Supervised learning utilizes only the labelled data to generate a model.
•      Semi-supervised learning aims to combine both the labelled data as well as
       useful evidence from the unlabelled data in learning. Unsupervised learning
       is designed to be able to learn without or with very few labelled data. These
       models will be broadly classified further.
    St. Francis Institute of Technology                                                                                                                                           NLP
    Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          178
  Supervised Machine Learning
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Approach
• Supervised machine approach is also called statistical
  models. It has proved to be very effective.
• NER in statistical models usually treat recognising named
  entities as a sequence tagging problem in which each word is
  tagged to it entity if its present in the entity class.
• The process of learning is called supervised, human
  intervention is needed to train the system by giving the
  trained data examples to construct the statistical model,
  which cannot achieve good performance without large
  amount of training data.
• Different supervised machine approaches are as follows.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          179
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Hidden Markov Model (HMM)
• It is a statistical language model that computes the likelihood
  of a sequence of words by employing a Markov chain, in
  which the likelihood of the next word is based on the current
  word.
• In this language model words are represented by states, NE
  classes are represented by regions and there is a probability
  associated with every transition from the current word to the
  next word.
• This model can predict the NE class for a next word if the
  current word and its NE class pair is given.
• It has a better capability of capturing the locality of
  phenomena, which indicates names in text.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          180
  Maximum Entropy Markov Model
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  (MEMM)
• MEMM is also a statistical model which is very flexible.
  The assignment is output for each word or token is based on
  its future (f), history (h) and the features (g).
• Here an essential prerequisite is the selection of appropriate
  features. A maximum entropy solution to this, or any other
  similar problem allows the computation of p (f | h) for any f
  from the space of possible futures, F, for every h from the
  space of possible histories, H.
• As the outputs of the models are defined by the futures, the
  solution for this is to compute p (f | h) for any f from the
  space of possible futures, F and for any h from the possible
  histories, H.
• Thus in NER, finding the probability of f for any token with
  respect to its index (t) can be formulated as p (f | ht).
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          181
                             The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
  Conditional Random Fields
• Conditional random fields (CRFs) are a class of statistical
  modelling method often applied in pattern recognition and
  machine learning, where they are used for structured
  prediction.
• Where, the prediction of labels or tags for entities in any
  ordinary classifier is done without considering the context of
  neighboring entity, CRF takes the context into account e.g.
  the linear chain CRF predicts sequences of labels for
  sequences of input samples.
• It is a discriminative undirected probabilistic graphical
  model (random fields) used to encode known relationships
  between      observations      and    construct     consistent
  interpretations.
St. Francis Institute of Technology                                                                                                                                           NLP
Department of Computer Engineering                                                                                                                              Ms. Pradnya Sawant                          182
Sample Questions
1. What are the different text preprocessing methods used in
   information retrieval?
2. What are the different approaches for Rule Based Machine
   Translation (RBMT)?
3. Explain different types of text summarization techniques.
4. Explain the different steps in text summarization.
5. Write short note on Question Answering System.
6. Explain Information retrieval system in detail.
7. Explain machine translation in detail.
8. What are the empirical machine translation systems.