Natural Language Processing
Dr. Ankur Priyadarshi
                       Assistant Professor
           Computer Science and Information Technology
     Syllabus
Prerequisites:
1. Basic knowledge about English grammar and
Theory of Computation.
2. Basic knowledge in Machine Learning tools.
 Course objectives
      1.      To understand the algorithms available for the processing of
linguistic information and computational properties of natural languages.
      2.      To conceive basic knowledge on various morphological,
syntactic and semantic NLP tasks.
      3.      To familiarize various NLP software libraries and datasets
publicly available.
      4. To develop systems for various NLP problems with moderate
complexity.
      5. To learn various strategies for NLP system evaluation and error
analysis.
      Unit I:
INTRODUCTION TO NLP
               Natural Language
                  Processing
⊹   Natural language processing (NLP) refers to the branch of computer
    science—and more specifically, the branch of artificial intelligence or
    AI—concerned with giving computers the ability to understand text and
    spoken words in much the same way human beings can.
⊹   NLP combines computational linguistics—rule-based modeling of human
    language—with statistical, machine learning, and deep learning models.
⊹   Together, these technologies enable computers to process human
    language in the form of text or voice data and to ‘understand’ its full
    meaning, complete with the speaker or writer’s intent and sentiment.
      NLP APPLICATIONS
1. Information Extraction
2. Question Answering
3. Sentiment Analysis
4. Machine Translation and many..
 Speech recognition, Intent classification, Urgency detection, Auto-correct, Market Intelligence, Email
 filtering, Voice assistants and chatbots, Advertisement to target audience, Recruitment
       Information Extraction (IE)
1.   Working with an enormous amount of text data is always hectic
     and time-consuming.
2.   Hence, many companies and organisations rely on Information
     Extraction techniques to automate manual work with intelligent
     algorithms.
3.   Information extraction can reduce human effort, reduce expenses,
     and make the process less error-prone and more efficient.
                         Example: IE
We can extract the following information from the text:
●   Country – India, Captain – Virat Kohli
●   Batsman – Virat Kohli, Runs – 2
●   Bowler – Kyle Jamieson
●   Match venue – Wellington
●   Match series – New Zealand
●   Series highlight – single fifty, 8 innings, 3 formats
               Question Answering
⊹   Question answering is a critical NLP problem and a
    long-standing artificial intelligence milestone.
⊹   QA systems allow a user to express a question in natural
    language and get an immediate and brief response.
⊹   QA systems are now found in search engines and phone
    conversational    interfaces,   and   they’re      fairly   good   at
    answering simple snippets of information.
⊹   On more hard questions, however, these normally only go as
    far as returning a list of snippets that we, the users, must
    then browse through to find the answer to our question.
Sentiment Analysis
⊹   Sentiment analysis (or opinion mining) is a natural
    language processing (NLP) technique used to determine
    whether data is positive, negative or neutral.
⊹   Sentiment Analysis, as the name suggests, it means to
    identify the view or emotion behind a situation. It basically
    means to analyze and find the emotion or intent behind a
    piece of text or speech or any mode of communication.
Suppose, there is a fast-food chain company and they sell a variety of
different food items like burgers, pizza, sandwiches, milkshakes, etc. They
have created a website to sell their food and now the customers can order
any food item from their website and they can provide reviews as well, like
whether they liked the food or hated it.
 ● User Review 1: I love this cheese sandwich, it’s so delicious.
 ● User Review 2: This chicken burger has a very bad taste.
 ● User Review 3: I ordered this pizza today.
So, as we can see that out of these above 3 reviews,
The first review is definitely a positive one and it signifies that the customer was
really happy with the sandwich. The second review is negative, and hence the
company needs to look into their burger department. And, the third one doesn’t
signify whether that customer is happy or not, and hence we can consider this as a
neutral statement.
Machine Translation
Machine Translation (MT) is the task of automatically
converting one natural language into another, preserving
the meaning of the input text, and producing fluent text in the
output language.
Machine Translation (MT) is the task of automatically converting one natural
language into another, preserving the meaning of the input text, and producing
fluent text in the output language.
While machine translation is one of the oldest subfields of artificial intelligence
research, the recent shift towards large-scale empirical techniques has led to
very significant improvements in translation quality.
The Stanford Machine Translation group's research interests lie in techniques
that utilize both statistical methods and deep linguistic analyses.
Machine translation: approaches
 ●   Rule-based Machine Translation (RBMT): 1970s-1990s
 ●   Statistical Machine Translation (SMT): 1990s-2010s
 ●   Neural Machine Translation (NMT): 2014-...
   Rule based MT (RBMT)
A rule-based system requires experts’ knowledge about the source and
the target language to develop syntactic, semantic and morphological
rules to achieve the translation.
The Wikipedia article of RBMT includes a basic example of rule-based
translation from English to German. The translation needs an
English-German dictionary, a rule set for English grammar and a rule
set for German grammar
An RBMT system contains a pipeline of Natural Language Processing
(NLP) tasks including Tokenization, Part-of-Speech tagging and so on.
Most of these jobs have to be done in both source and target language.
SYSTRAN is one of the oldest Machine Translation company.
It translates from and to around 20 languages.
SYSTRAN was used for the Apollo-Soyuz project (1973) and by the
European Commission (1975)
Advantages
   ●   No bilingual text required
   ●   Domain-independent
   ●   Total control (a possible new rule for every situation)
   ●   Reusability (existing rules of languages can be transferred
       when paired with new languages)
Disadvantages
   ●   Requires good dictionaries
   ●   Manually set rules (requires expertise)
 Statistical MT
This approach uses statistical models based on the analysis of bilingual
text corpora.
It was first introduced in 1955, but it gained interest only after 1988
when the IBM Watson Research Center started using it.
                     SMT Examples
●   Google Translate (between 2006 and 2016, when they
    announced to change to NMT)
●   Microsoft Translator (in 2016 changed to NMT)
●   Moses: Open source toolkit for statistical machine translation
Advantages
   ●   Less manual work from linguistic experts
   ●   One SMT suitable for more language pairs
   ●   Less out-of-dictionary translation: with the right language
       model, the translation is more fluent
Disadvantages
   ●   Requires bilingual corpus
   ●   Specific errors are hard to fix
   ●   Less suitable for language pairs with big differences in word
Neural MT
❖   The neural approach uses neural networks to achieve machine
    translation.
❖   Compared to the previous models, NMTs can be built with one
    network instead of a pipeline of separate tasks.
NMT examples
   ●   Google Translate (from 2016) link to language team at Google
       AI
   ●   Microsoft Translate (from 2016) link to MT research at
       Microsoft
   ●   Translation on Facebook: link to NLP at Facebook AI
   ●   OpenNMT: An open-source neural machine translation
       system.
Advantages
   ●   End-to-end models (no pipeline of specific tasks)
Disadvantages
   ●   Requires bilingual corpus
   ●   Rare word problem
NLP PHASES
     Lexical Analysis
 ●    It involves identifying and analyzing the structure of words. Lexicon of a
      language means the collection of words and phrases in that particular
      language.
 ●    The lexical analysis divides the text into paragraphs, sentences, and words.
      So we need to perform Lexicon Normalization.
The most common lexicon normalization techniques are Stemming:
 ●    Stemming: Stemming is the process of reducing derived words to their word
      stem, base, or root form generally a written word form like-“ing”, “ly”, “es”, “s”,
      etc
 ●    Lemmatization: Lemmatization is the process of reducing a group of words
      into their lemma or dictionary form. It takes into account things like POS(Parts
      of Speech), the meaning of the word in the sentence, the meaning of the
      word in the nearby sentences, etc. before reducing the word to its lemma.
   Syntactic Analysis
Syntactic Analysis is used to check grammar, arrangements of words,
and the interrelationship between the words.
     Example: Mumbai goes to the Sara
Here “Mumbai goes to Sara”, which does not make any sense, so this
sentence is rejected by the Syntactic analyzer.
Syntactical parsing involves the analysis of words in the sentence for
grammar.
Dependency Grammar and Part of Speech (POS) tags are the important
attributes of text syntactic.
                     Semantic analysis
The way we understand what someone has said is an unconscious
process relying on our intuition and knowledge about language itself.
In other words, the way we understand language is heavily based on
meaning and context. Computers need a different approach, however.
The word “semantic” is a linguistic term and means "related to
meaning or logic."
Semantic analysis is the process of understanding the meaning and
interpretation of words, signs and sentence structure.
       Discourse Integration
Discourse integration is closely related to pragmatics (context of the sentence).
Discourse integration is considered as the larger context for any smaller part of NL
structure. NL is so complex and, most of the time, sequences of text are dependent
on prior discourse.
This concept occurs often in pragmatic ambiguity. This analysis deals with how the
immediately preceding sentence can affect the meaning and interpretation of the
next sentence. Here, context can be analyzed in a bigger context, such as paragraph
level, document level, and so on.
   Pragmatic Analysis
Pragmatic Analysis is part of the process of extracting information from text.
Specifically, it’s the portion that focuses on taking a structures set of text and
figuring out what the actual meaning was.
It actually comes from the field of linguistics (as a lot of NLP does), where the
context is considered from the text.
Why is this important? Because a lot of text’s meaning does have to do with
the context in which it was said/written.
Ambiguity, and limiting ambiguity, are at the core of natural language
processing, so needless to say, pragmatic analysis is actually quite crucial
with respect to extracting meaning or information.
    Difficulty In NLP
●   Contextual words and phrases and homonyms
●   Synonyms
●   Irony and sarcasm
●   Ambiguity
●   Errors in text or speech
●   Colloquialisms and slang
●   Domain-specific language
●   Low-resource languages
●   Lack of research and development