0% found this document useful (0 votes)
39 views100 pages

NLP Merged

The document outlines a course on Natural Language Processing (NLP) at BITS Pilani, detailing its objectives, evaluation plan, and key concepts such as language models, word embedding, and parsing. It also provides a course plan, references, and mentions various NLP tools and applications, emphasizing the importance of understanding human language through computational methods. The course aims to equip students with fundamental NLP techniques and their practical applications in technology.

Uploaded by

himkant9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views100 pages

NLP Merged

The document outlines a course on Natural Language Processing (NLP) at BITS Pilani, detailing its objectives, evaluation plan, and key concepts such as language models, word embedding, and parsing. It also provides a course plan, references, and mentions various NLP tools and applications, emphasizing the importance of understanding human language through computational methods. The course aims to equip students with fundamental NLP techniques and their practical applications in technology.

Uploaded by

himkant9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

BITS Pilani

Pilani Campus

Natural Language Processing


DSECL ZG530 Session 1
Dr. Vijayalakshmi Anand Date –23 Nov 2024
BITS Pilani BITS-Pilani Time – 10.40 to 12.40
Pilani Campus vijayalakshmi_anand@pilani.bits-pilani.ac.in These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materials freely available online.

Agenda
Objective of course
• To learn the fundamental concepts and techniques of
• Course Objectives
natural language processing (NLP) including Language
• Course & Evaluation Plan Models, Word Embedding, Part of speech Tagging, Parsing
• Introduction to Natural Language Processing • To learn computational properties of natural languages and
• Application Areas the commonly used algorithms for processing linguistic
information
• Few Terminologies
• To introduce basic mathematical models and methods used
• Frequently applied Data Preparation Process for NLP in NLP applications to formulate computational solutions.
• To introduce research and development work in Natural
language Processing

4
BITS Pilani, Pilani Campus 18 January 2025 BITS Pilani, Pilani Campus
Course plan
Text books and Reference books

T1 Jurafsky and Martin, SPEECH and LANGUAGE PROCESSING: An Introduction


to Natural Language Processing, Computational Linguistics, and Speech
Recognition, McGraw Hill
T2 Manning and Schütze, Foundations of Statistical Natural Language
Processing, MIT Press. Cambridge, MA

R1 Allen James, Natural Language Understanding


R2 Neural Machine Translation by Philipp Koehn
R3 Semantic Web Primer (Information Systems) By Antoniou, Grigoris; Van
Harmelen, Frank

5
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Tentative Evaluation Plan What is Natural Language Processing?

Name Weight ◼ Natural Language Processing


Quiz (best 2 out of 3) 10% ◼ Process information contained in natural
Assignment 1 and 2 20% language text.
Students are requested to check ◼ Also known as Computational Linguistics (CL),
Mid-term Exam 30% the canvas announcements for the
details Human Language Technology (HLT), Natural
End Semester Exam 40% Language Engineering (NLE)

8
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
What is it.. Natural Language processing

◼ Analyze, understand and generate human


languages just like humans do.
◼ Applying computational techniques to language
domain.. • Increase of online data
◼ To explain linguistic theories, to use the theories to
build systems that can be of social use.. • Process unstructured data and extract
◼ Started off as a branch of Artificial Intelligence..
meaningful information
◼ Borrows from Linguistics, Psycholinguistics, • Challenges
Cognitive Science & Statistics. • Solution-Natural langue processing
◼ Make computers learn our language rather than
we learn theirs. 9
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Application of NLP History of NLP

1950-Alan turing published paper


1957-Chomsky published his book, Syntactic
Structures
1958-LISP
1964-ELIZA
1966-AI,NLP ,Machine transalation
1980-IBM

12
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Example

Dave: Open the pod bay doors, HAL.


HAL: I am sorry, Dave. I am afraid I can’t do that.
Dave: What’s the problem.
HAL: I think you know what the problem is just as well as I do.
Dave: I don’t know what you’re talking about.
HAL: I know that you and Frank were planning to disconnect
me, and I’m afraid that’s something I cannot allow to happen.

General speech and language understanding and generation


capabilities
2020 – Conversational Agents Politeness: emotional intelligence
2022 -- ChatGPT
Self-awareness: a model of self, including goals and plans
Belief ascription: modeling others; reasoning about their
goals and plans

13
14
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Contd..

Hal: I can tell from the tone of your voice, Dave, that you’re upset.
Why don’t you take a stress pill and get some rest.
To attain the levels of performance we attribute to
[Dave has just drawn another sketch of Dr. Hunter]. HAL, we need to be able to define, model, acquire and
HAL: Can you hold it a bit closer? manipulate
[Dave does so].
HAL: That’s Dr. Hunter, isn’t it? • Knowledge of the world and of agents in it,
Dave: Yes. • Text meaning,
• Intention

Recognition of emotion from speech


Vision capability including visual recognition of emotions and faces
Also: situational ambiguity

15 16
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Main components of NLP Natural language understanding

➢allows users to interact more naturally with


the computer.
➢Different levels of analysis
• Morphological Analysis
• Syntactic analysis
• Semantic analysis
• Discourse analysis

17 18
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Natural language generation Steps used for NLU

➢NLG will decide how to respond by


• Deep Planning
• Syntactic generation

19 20
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Morphology analysis Lexical analysis

➢Morphology is the arrangement and • Lexicon of a language means the collection of


relationships of the smallest meaningful units words and phrases in a language.
in a language. • Lexical analysis is dividing the whole chunk of
➢Eg: text into paragraphs, sentences, and words.
• Firehouse • friendful or beautyship → Lexically Incorrect
• Doghouse • friendship and beautiful is correct →
• runs Lexically Correct

21 22
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Syntactic analysis Semantic analysis

➢Syntax analysis checks the text for ➢It draws the exact meaning or the dictionary
meaningfulness comparing to the rules of meaning from the text. The text is checked for
formal meaningfulness.
➢Eg: ➢Eg:Colourless blue idea
“the girl go to the school “
Agra goes to the Poonam

23 24
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Discourse integeration Pragmatic analysis

➢The meaning of any sentence depends upon ➢ how sentences are used in different situations
the meaning of the sentence just before it how use affects the interpretation of sentence
➢Eg: ➢Eg:
she wanted it ➢Close the window
➢She cuts banana with a pen

25 26
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Introduction to Natural Language Processing


Natural Language Generation
Different Levels of Language Analysis

-
Examples:
Syntax, Semantics, and Pragmatics ➢producing text from computer data
➢Steps
1. Green frogs have large noses.
Pragmatically wrong ▪ discourse planning
2. Green ideas have large noses.
Semantics, and Pragmatics
▪ Surface realizer
3. Large have green ideas nose.
▪ Lexical selection
All three wrong

4. I are fond in animals (exercise)

5. My cat is studying Linguistics (exercise)

6. Spain is larger than China (exercise) 28


BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Why NLP is hard

➢Ambiguity • Ambiguities at all level


➢Eg: I. Structural Ambiguities
▪ The professor said on Monday he
• Teacher strikes idle kids. would give an exam
• Sarah gave a bath to her dog wearing a pink t- ▪ Visiting relatives can be nuisance. (two
meanings)
shirt. II. Lexical Ambiguities:
• RBI raises interest rates or RBI raises, interest ▪ I saw bats
rates.

29 30
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd..

Non Standard english World knowledge


➢ don’t follow any rules
Eg:
➢ Eg: gm, gudnit etc.
Segmentation issues Mary and Sue are sisters.
➢ Eg The old city-bus stop Mary and Sue are mothers.
Idioms Tricky entity names
➢ Eg:
dark horse • Let it be was recorded.
lose face • Where is a bug’s life playing?

31 32
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Some of the tasks in NLP Contd..

Sentence segmentation ➢ Tokeniztion


➢ It breaks the paragraph into separate sentences. • splits longer strings of text into smaller pieces, or tokens.
➢ Eg:Independence Day is one of the important festivals Eg:JavaTpoint offers Corporate Training, Summer Training,
for every Indian citizen. It is celebrated on the 15th of Online Training, and Winter Training.
August each year ever since India got independence "JavaTpoint", "offers", "Corporate", "Training", "Summer",
from the British rule. The day celebrates "Training", "Online", "Training", "and", "Winter", "Training",
independence in the true sense. "."
• "Independence Day is one of the important festivals for
every Indian citizen." ➢ Lemmatization
• "It is celebrated on the 15th of August each year ever • capture canonical forms based on a word's lemma.
since India got independence from the British rule." • Eg:better → good
• "This day celebrates independence in the true sense."
33
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Schematic representation NLP


➢Stemming
• process of eliminating affixes (suffixed, prefixes,
infixes, circumfixes) from a word in order to
obtain a word stem.
• eg. running → run

➢Stop Words
• contribute little to overall meaning
• Eg:He is a good boy.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Syntactic representations of language

• Most syntactic representations of


language are based on the notion of
context-free grammars
Representations and
• CFGs represent sentence structure in
Understanding
terms of what phrases are subparts of
other phrases.

• Often presented in a tree form

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Representations and Understanding


The Organization of Natural Language Processing Systems
CFG Rules
S -> NP VP
NP -> N N
NP -> N
VP -> V NP
VP -> V PP Natural
Natural
PP -> P NP Language
Language
Generation
Understanding
Lexicon
Rice : N
Flies : N, V
Like : V, P
Sand : N

NP – Noun Phrases

VP – Verb Phrases
PP – Prepositional Phrases
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Evaluating Language Understanding Systems Contingency Table

• What metrics to use?


• How to deal with complex outputs like
translations?
• Are the human judgments measuring
something real? reliable?
• Is the sample of texts sufficiently
representative?
• How reliable or certain are the results?
41 42
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Tradeoff between Precision and Recall NLTK Installation

The Natural Language Toolkit (NLTK) is a platform used for building


programs for text analysis.
Open Anaconda terminal, run
pip install nltk.

Anaconda and Jupiter are best and popular data science tools
In Jupiter, the console commands can be executed by the ‘!’ sign before the
command within the cell.

! pip install nltk


NLTK book
NLTK discussion forum
https://www.nltk.org/install.html

43
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
NLP Tools NLP Tools
Some commercial tools Open Source Tools

IBM Watson | A pioneer AI platform for businesses Stanford Core NLP is a popular Java library built and maintained by Stanford University.
SpaCy - One of the newest open-source Natural Language Processing with Python libraries
Google Cloud NLP API | Google technology applied to NLP
Amazon Comprehend | An AWS service to get insights from text
Gensim is a highly specialized Python library that largely deals with topic modeling tasks using
algorithms like Latent Dirichlet Allocation (LDA)

Natural Language Toolkit (NLTK) is the most popular Python library

Generative Pre-trained Transformer 3 (GPT·3) is an autoregressive language model released


recently by Open AI, pre-trained on (175 billion parameters). It is autocompleting program and is used
mainly for predicting text

AllenNLP: Powerful tool for prototyping with good text processing capabilities. Automates some of the
tasks which are essential for almost every deep learning model. It provides a lot of modules like
Seq2VecEncoder, Seq2SeqEncoder.

Berkeley Neural Parser (Python). It is a high-accuracy parser with models for 11 languages. It cracks
the syntactic structure of sentences into nested sub phrases. This tool enables the easy extraction of
information from syntactic constructs

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

References Thank You… ☺


• Q&A
https://emerj.com/partner-content/nlp-current-applications-and-future-possibilities/

• https://venturebeat.com/2019/04/05/why-nlp-will-be-big-in-2019/

• https://www.nltk.org/book/
• Suggestions / Feedback
• https://www.coursera.org/learn/python-text-mining/home/week/1

• https://openai.com/api/

• https://analyticssteps.com/blogs/top-nlp-tools

• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

• https://www.cstr.ed.ac.uk/emasters/course/natural_lang.html

• https://web.stanford.edu/class/cs224u/2016/materials/cs224u-2016-intro.pdf

• https://www.mygreatlearning.com/blog/trending-natural-language-processing-
applications/
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

Natural Language Processing


DSECL ZG530
Session 2
Dr.Vijayalakshmi Anand Date – Ist June2024
BITS Pilani Time – 1.40pm to 3.40pm
Pilani Campus BITS-Pilani These slides are prepared by the instructor, with grateful acknowledgement
of James Allen and many others who made their course materials freely
available online.

Content Sequence Plan Session#2 – N gram Language model


M1 Natural Language Understanding and Generation
M2 N-gram Language Modelling • Human word Prediction
M7 Vector semantics and Embedding • Language Models
M3 Neural networks and Neural language Models ➢ What is language model
M4 Part-of-Speech Tagging ➢ Why language models
M5 Hidden Markov Models and MEMM • N-gram language models
M6 Topic Modelling ➢ Uni-gram , bi-gram and N-gram
M8 Grammars and Parsing • Evaluation of Language models
M9 Statistical Constituency Parsing ➢ Perplexity
M10 Dependency Parsing • Smoothing
M11 Encoder-Decoder Models, Attention and Contextual Embedding ➢ Laplace smoothing
M12 Word sense disambiguation ➢Interpolation and Back off
M13 Semantic web ontology and Knowledge Graph
M14 Introduction to NLP Applications
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Next word Prediction Example

• We may have ability to predict future words in an


utterance .
• How?
❖ Based on domain knowledge
red blood
❖ Based on syntactic knowledge
the <adj/noun>
❖ Based on Lexical knowledge
baked <potato>

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Language modelling Language Models

1.Machine Translation:
A model that computes either of these:
Machine translation system is used to translate one
Probability of a sentence ( P(W) ) or language to another language .For example Chinese to
Probability of an upcoming word (P(wn|w1,w2…wn- English or German to English etc.
1)) is called a language model.
I eat lunch
Simply we can say that I am eating
我在吃” Me am eating
A language model learns to predict the probability of a
Eating am I
sequence of words.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Continued..

2.Spell correction 3.Speech Recognition


Speech recognition is the ability of a machine or
Example: program to identify words and phrases in spoken
language and convert them to a machine-readable
I picked up the phone to answer her fall format.

I picked up the phone to answer her call>>>>>>>I


picked up the phone to answer her fall

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Sentiment analysis How to build a language model

Sentiment analysis using a language model involves


determining the emotional tone or polarity (e.g.,
positive, negative, or neutral) of a given piece of • Recall the definition of conditional probabilities
text.
• Use Case in Real Life p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
• E-commerce: Analyzing customer reviews for
feedback. • More variables:
• Social Media Monitoring: Understanding public P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
sentiment about a brand. • The Chain Rule in General
• Healthcare: Detecting emotional distress in P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|xn-1,xn-2……,x1)
patients' messages.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The Chain Rule applied to compute joint
Example probability of words in sentence
In a factory there are 100 units of a certain product, 5 of which are defective. We
pick three units from the 100 units at random. What is the probability that none of
them are defective?
Let Ai as the event and i=1,2,3 • 𝑃 𝑤1 𝑤2 … … 𝑤𝑛 = ς𝑖 (𝑃(𝑤𝑖 |𝑤1 𝑤2 ……..𝑤𝑖−1 )

• For ex:
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is | its water)
× P(so|its water is) × P( transparent |its water is so)
• P(most biologists and specialist believe that in fact the
mythical unicorn horns derived from the narwhal)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Markov Assumption Markov Assumption

• Simplifying assumption:
limit history of fixed number of wordsN1
Andrei Markov

P(the | its water is so transparent that) »P(the | that)

or
P(the | its water is so transparent that) »P(the | transparent that)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


N-gram Language models

• N gram is a sequence of tokens(words)


• Unigram language model(N=1 )
N -Gram Language models

Example:
P(I want to eat Chinese food)≈ P(I)P(want)
P(to)P(eat)P(Chinese)P(food)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Bigram model N-gram models

N=2 • We can extend to trigrams, 4-grams, 5-grams


Advantages
• no human supervision, easy to extend to more data,
allows querying about open-class relations,
Example: Disadvantage:
P(I want to eat Chinese food)≈ P(I|<start>) P(want|I) P(to|want)
• In general this is an insufficient model of language
P(eat|to) P(Chinese|eat) P(food|Chinese)
P(<end>|food) – because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room
on the fifth floor is (are) crashing.”

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Estimating bigram probabilities

• Estimating the probability as the relative frequency is


the maximum likelihood estimate (or MLE ), because this
value makes the observed data maximally likely
• The Maximum Likelihood Estimate
count(wi-1,wi )
Estimating N-gram Probabilities P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

An example Evaluation

<s> I am Sam </s>


c(wi-1,wi )
P(wi | w i-1 ) = <s> Sam I am </s>
c(wi-1) <s> I do not like green eggs and ham </s> Evaluation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


How good is our model? Experimental setup

• Does our language model prefer good sentences to bad


ones?
– Assign higher probability to “real” or “frequently
observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t
seen.
– A test set is an unseen dataset that is different from our
training set, totally unused.
– An evaluation metric tells us how well our model does on
the test set.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Extrinsic evaluation of N-gram


Example1:
models

• Best evaluation for comparing models A and B


– Put each model in a task
• spelling corrector, speech recognizer, MT system
– Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
– Compare accuracy for A and B

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Difficulty of extrinsic (in-vivo) Intuition of Perplexity
evaluation of N-gram models

• Extrinsic evaluation • The Shannon Game:

– Time-consuming; can take days or weeks – How well can we predict the next word?
mushrooms 0.1
– Bad approximation pepperoni 0.1
I always order pizza with cheese and….
• unless the test data looks just like the training data The president of India is……
anchovies 0.01

• So generally only useful in pilot experiments I wrote a ……


….
fried rice 0.0001
….
• So and 1e-100
– Sometimes use intrinsic evaluation: perplexity – Unigrams are terrible at this game. (Why?)
• A better model of a text
– is one which assigns a higher probability to the word
that actually occurs

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Perplexity Example1

The best language model is one that best predicts an unseen


test set
• Gives the highest P(sentence)
1
-
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Example2 Corpora

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram Generalization and Zeros


Order
Perplexity 962 170 109

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The perils of overfitting
Problems with simple MLE estimations
• N-grams only work well for word prediction if the
test corpus looks like the training corpus • Training set: • Test set
– In real life, it often doesn’t
… denied the allegations … denied the offer
… denied the reports … denied the loan
– We need to train robust models that generalize!
… denied the claims
– One kind of generalization: Zeros! … denied the request
• Things that don’t ever occur in the training set
–But occur in the test set P(“offer” | denied the) = 0

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Zero probability bigrams Laplace Smoothing(Add 1 smoothing)

• Bigrams with zero probability • Pretend we saw each word one more time than we
– mean that we will assign 0 probability to the test did
set! • Just add one to all the counts!
• And hence we cannot compute perplexity
(can’t divide by 0)! • MLE estimate: c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
• Add-1 estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

After Add-1 smoothing Berkeley Restaurant Project corpus

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Raw Bigram Counts Converting to probabilities

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Berkeley Restaurant Corpus:


Contd… Laplace smoothed bigram
counts

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Reconstituted counts
Laplace-smoothed bigrams

V=1446 in the Berkeley restaurant project corpus


BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Compare with raw bigram counts Add-1 estimation is a blunt


instrument

• So add-1 isn’t used for N-grams:


– We’ll see better methods
• But add-1 is used to smooth other NLP models
– For text classification
– In domains where the number of zeros isn’t so huge.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Backoff and Interpolation Linear Interpolation

• Involves using different pieces of information to derive a


• Sometimes it helps to use less context
probability
– Condition on less context for contexts you haven’t learned much about
• Backoff:
– use trigram if you have good evidence, Simple interpolation
otherwise bigram,
otherwise unigram
• Interpolation:
– mix unigram, bigram, trigram

• Interpolation works better Conditional


interpolation

52
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Unknown words: Open versus


How to set the lambdas?
closed vocabulary tasks

• If we know all the words in advanced


• Use a held-out corpus – Vocabulary V is fixed
– Closed vocabulary task
Held-Out Test • Often we don’t know this
Training Data Data Data – Out Of Vocabulary = OOV words
– Open vocabulary task
• Instead: create an unknown word token <UNK>
– Training of <UNK> probabilities
• Choose λs to maximize the probability of held-out • Create a fixed lexicon L of size V
• At text normalization phase, any training word not in L changed to <UNK>
• Now we train its probabilities like a normal word
data: – At decoding time
• If text input: Use UNK probabilities for any word not in training
– Fix the N-gram probabilities (on the training data)
– Then search for λs that give largest probability to held-out set:

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Huge web-scale n-grams Smoothing for Web-scale N-grams

• “Stupid backoff” (Brants et al. 2007)


• How to deal with, e.g., Google N-gram corpus
• Pruning
– Entropy-based pruning
– Only store N-grams with count > threshold.
ì i

• Efficiency ïï count(wi-k+1 ) i
if count(wi-k+1 )>0
S(wi | wi-k+1 ) = í count(wi-k+1 )
i-1 i-1
– Efficient data structures like tries
– Bloom filters: approximate language models ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise
– Store words as indexes, not strings
• Use Huffman coding to fit large numbers of words into two bytes
count(wi )
S(wi ) =
N

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Exercise 1 Exercise2

Which of the following sentence is better. Find


1.<s> do? Two gram out using bigram model
2.<s>I like Hendry?
Two gram
3.<s> Do I like? Use 1.<S>I like college<S>
trigram model 2.<S>Do I like
4.<s>Do I like Hendry<S>
college?
Four grams

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Thank You… ☺

• Q&A
• Suggestions / Feedback

Natural Language Processing


DSECL ZG530
Dr.Vijayalakshmi Anand
BITS Pilani BITS-Pilani
Pilani Campus

BITS Pilani, Pilani Campus

Content Sequence Plan


BITS Pilani
Pilani Campus

M1 Natural Language Understanding and Generation


M2 N-gram Language Modelling
M7 Vector semantics and Embedding
M3 Neural networks and Neural language Models
M4 Part-of-Speech Tagging

M5 Hidden Markov Models and MEMM


M6 Topic Modelling
M8 Grammars and Parsing
Session 2 M9 Statistical Constituency Parsing

Date – 8 Jun 2024 M10 Dependency Parsing

Time – 1.40pm to 3.40pm M11 Encoder-Decoder Models, Attention and Contextual Embedding
These slides are prepared by the instructor, with grateful acknowledgement M12 Word sense disambiguation
of James Allen and many others who made their course materials freely M13 Semantic web ontology and Knowledge Graph
available online. M14 Introduction to NLP Applications
BITS Pilani, Pilani Campus
Session contents

• Lexical Semantics
• Vector Semantics
• Word embeddings Lexical Semantics
– Frequency based
– Prediction based

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Stages in Natural language Semantic analysis


processing
What is semantic analysis ?
Semantic analysis, in general, is a key
method for assisting computers in
comprehending the meaning of natural
language text.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Applications What do words mean?

Question answering: Plagiarism detection • N-gram or text classification methods we've seen so far
Q: “How tall is Mt. Everest?” – Words are just strings (or indices wi in a vocabulary
Ans: “The official height of list)
Mount Everest is 29029 feet”
– That's not very satisfactory!

“tall” is similar to “height”

CS447 Natural Language Processing (J. Hockenmaier)


BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Lemmas and senses

• A sense or “concept” is the meaning component of a word


• Lemmas can be polysemous (have multiple senses)

Lexical semantics mouse (N) : lemma

1. any of numerous small rodents...


2. a hand-operated device that controls a cursor...
sense

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Lexical semantics Relations Synonyms
• Have relations with each other Word that have the same meaning in some or all contexts.
– Synonymy – filbert / hazelnut
– Antonymy – couch / sofa
– Similarity – big / large
– Relatedness: semantic field and frame – automobile / car
– Connotation – vomit / throw up
• More generally, a model of word meaning should – Water / H20

allow us to draw inferences to address meaning- Two lexemes are synonyms if they can be successfully
related tasks like question-answering or dialogue. substituted for each other in all situations

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Synonymy is a relation between senses rather


But than words

There are no examples of perfect synonymy Consider the words big and large
– Why should that be? Are they synonyms?
– Even if many aspects of meaning are identical – How big is that plane?
– Still may not preserve the acceptability based on notions of politeness, – Would I be flying on a large or small plane?
slang, register, genre, etc. How about here:
Example: – Miss Nelson, for instance, became a kind of big sister to Benjamin.
– Water and H20 – ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
Why?
– big has a sense that means being older, or grown up
– large lacks this sense

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Antonyms Relation: Similarity

Senses that are opposites with respect to one feature of • Words with similar meanings.
their meaning
Otherwise, they are very similar! word1 word2 similarity


vanish disappear 9.8
dark / light
behave obey 7.3
– short / long belief impression 5.95
– hot / cold muscle bone 3.65
– up / down modest flexible 0.98

– in / out hole agreement 0.3

More formally: antonyms can SimLex-999 dataset (Hill et al., 2015)


– define a binary opposition or at opposite ends of a scale (long/short,
fast/slow)
– Be reverses: rise/fall, up/down

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Relation: Word relatedness Semantic field

• Also called "word association“ • Words that


– cover a particular semantic domain
• Words can be related in any way, perhaps via a – bear structured relations with each other.
semantic frame or field – Useful for topic modelling
Example
• coffee, tea: similar hospitals
• coffee, cup: related, not similar surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Semantic frame Connotation (sentiment)

• A semantic frame is a set of words that denote • Words have affective meanings
perspectives or participants in a particular type of event. – Positive connotations (happy)
– E.g. verbs like buy (the event from the perspective of the buyer), sell (from the – Negative connotations (sad)
perspective of the seller), pay (focusing on the monetary aspect), or nouns like
buyer.
• Evaluation (sentiment!)
– Positive evaluation (great, love)
– Frames have semantic roles (like buyer, seller, goods, money), and words in a
sentence can take on these roles. – Negative evaluation (terrible, hate)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Connotation

Words seem to vary along 3 affective dimensions:


– valence: the pleasantness of the stimulus
– arousal: the intensity of emotion provoked by the stimulus
Osgood et al. (1957)
– dominance: the degree of control exerted by the stimulus Vector Semantics

Word Score Word Score


Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
leadership 0.983 empty 0.081

Values from NRC VAD Lexicon (Mohammad 2018)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


What does recent English
Vector Semantics
borrowing ongchoi mean?
• Suppose you see these sentences:
–Ong choi is delicious sautéed with garlic.
–Ong choi is superb over rice
–Ong choi leaves with salty sauces

• And you've also seen these: Vector semantics is the standard way to represent word
– …spinach sautéed with garlic over rice meaning in NLP, helping vector semantics us model
– Chard stems and leaves are delicious
many of the aspects of word meaning
– Collard greens and other salty leafy greens

• Conclusion:
– Ongchoi is a leafy green like spinach, chard, or collard greens
– We could conclude this based on words like "leaves" and "delicious" and
"sauteed"

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Word embeddings: Why


Word embeddings

• Vectors for representing words are called embeddings Consider sentiment analysis:
• Each word = a vector (not just "good" or "worse") – With words, a feature is a word identity
• Defining meaning as a point in space based on • Feature 5: 'The previous word was "terrible"'
distribution • requires exact same word to be in training and test
• Similar words are "nearby in semantic space"
– With embeddings:
• Feature is a word vector
• 'The previous word was vector [35,22,17…]
• Now in the test set we might see a similar vector
[34,21,14]
• We can generalize to similar but unseen words!!!

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Word embeddings: types

1. Frequency based Embedding


– A common baseline model Frequency based Embedding
– Sparse long vectors
– Words are represented by (a simple function of) the counts of nearby words
– dimensions corresponding to words in the vocabulary or documents in a collection
– Count Vector
– TF-IDF Vector
– Co-Occurrence Vector
2. Prediction based Embedding
– Dense vectors
– Representation is created by training a classifier to predict whether a word is likely to
appear nearby
– Word2vec: Skip-gram and CBOW
– GloVe

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Count vector Co-occurrence matrix

What is count vectorization? • We represent how often a word occurs in a document •


Count vectorizer is a method to convert text to numerical • Term-document matrix
data. Or how often a word occurs with another
Example • Term-term matrix
word-word co-occurrence matrix or word-context
matrix.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Term Document Matrix

Each cell: count of word w in a document d: Two documents are similar if their vectors are similar
Each document is a count vector in ℕv: a column below

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The words in a term-document The words in a term-document


matrix matrix
Each word is a count vector in ℕD: a row below Two words are similar if their vectors are similar

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Term-context matrix for word Term
similarity frequency tf
• Two words are similar in meaning if their context ▪The term frequency tft,d of term t in document d is defined
vectors are similar as the number of times that t occurs in d.
• The context could be the document, smaller contexts, ▪We want to use tf when computing query-document match
generally a window around the word, for example of 4 scores.
words to the left and 4 words to the right ▪But how?
▪Raw term frequency is not what we want because:
▪A document with tf = 10 occurrences of the term is more
relevant than a document with tf = 1 occurrence of the term.
▪But not 10 times more relevant.
▪Relevance does not increase proportionally with term
frequency.

10 37
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Frequency in document
vs. frequency in Desired weight for rare
collection terms

▪In addition, to term frequency (the frequency of the term in ▪Rare terms are more informative than frequent terms.
the document) . . . ▪Consider a term in the query that is rare in the collection
▪. . .we also want to use the frequency of the term in the (e.g., ARACHNOCENTRIC).
collection for weighting and ranking. ▪A document containing this term is very likely to be
relevant.
▪→ We want high weights for rare terms like
ARACHNOCENTRIC.

17 17 17 18
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Desired weight for frequent Document
terms frequency
▪Frequent terms are less informative than rare terms. ▪We want high weights for rare terms like
▪Consider a term in the query that is frequent in the ARACHNOCENTRIC.
collection (e.g., GOOD, INCREASE, LINE). ▪We want low (positive) weights for frequent words like
▪A document containing this term is more likely to be GOOD, INCREASE and LINE.
relevant than a document that doesn’t . . . ▪We will use document frequency to factor this into
▪. . . but words like GOOD, INCREASE and LINE are not computing the matching score.
sure indicators of relevance. ▪The document frequency is the number of documents in
▪→ For frequent terms like GOOD, INCREASE and LINE, the collection that the term occurs in.
we want positive weights . . .
▪. . . but lower weights than for rare terms.
17 19 17 20
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

idf Examples for


weight idf
▪ Compute idft using the formula:
▪dft is the document frequency, the number of documents that t occurs
in.
▪dft is an inverse measure of the informativeness of term t. term dft idft
▪We define the idf weight of term t as follows: calpurnia 1 6
animal 100 4
sunday 1000 3
(N is the number of documents in the collection.) fly 10,000 2
▪idft is a measure of the informativeness of the term. under 100,000 1
▪[log N/dft ] instead of [N/dft ] to “dampen” the effect of idf the 1,000,000 0

17 21 17 22
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Effect of idf on tf-idf
ranking weighting
▪The tf-idf weight of a term is the product of its tf weight and
▪idf affects the ranking of documents for queries with at its idf weight.
least two terms.
▪For example, in the query “arachnocentric line”, idf
weighting increases the relative weight of
ARACHNOCENTRIC and decreases the relative weight of
▪tf-weight
LINE.
▪idf-weight
▪idf has little effect on ranking for one-term queries.
▪Best known weighting scheme in information retrieval
▪Note: the “-” in tf-idf is a hyphen, not a minus sign!
▪Alternative names: tf.idf, tf x idf

17 23 17 25
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Thank You… ☺

• Q&A
• Suggestions / Feedback

Natural Language Processing


DSECL ZG30
Dr.Vijayalakshmi Anand
BITS Pilani BITS-Pilani
Pilani Campus

BITS Pilani, Pilani Campus


Content Sequence Plan
BITS Pilani
Pilani Campus

M1 Natural Language Understanding and Generation


M2 N-gram Language Modelling
M7 Vector semantics and Embedding
M3 Neural networks and Neural language Models
M4 Part-of-Speech Tagging

M5 Hidden Markov Models and MEMM


M6 Topic Modelling
M8 Grammars and Parsing
Session 2 M9 Statistical Constituency Parsing

Date – 16DEC 2023 M10 Dependency Parsing

Time – 1.40pm to 3.40pm M11 Encoder-Decoder Models, Attention and Contextual Embedding
These slides are prepared by the instructor, with grateful acknowledgement M12 Word sense disambiguation
of James Allen and many others who made their course materials freely M13 Semantic web ontology and Knowledge Graph
available online. M14 Introduction to NLP Applications
BITS Pilani, Pilani Campus

Sparse versus dense vectors

• tf-idf vectors are


–long (length |V|= 20,000 to 50,000)
–sparse (most elements are zero)
Prediction based Embedding • Alternative: learn vectors which are
–short (length 50-1000)
–dense (most elements are non-zero)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


TF –IDF example
apple: IDF(apple)=log(3/2)=log(1.5)≈0.176
Word embeddings
Document 1: "apple banana apple" banana: IDF(banana)=log(3/3)=log(1)=0
Document 2: "banana orange" orange: IDF(orange)=log(3/2)=log(1.5)≈0.176
Document 3: "apple orange orange
banana" Document 1: "apple banana apple"
•apple: 0.667×0.176≈0.117
•banana: 0.333×0=0
Document 1: "apple banana •orange: 0×0.176=0
apple" Document 2: "banana orange"
•apple: 2/3 = 0.667 •banana: 0.5×0 =0
•banana: 1/3 = 0.333 •orange: 0.5×0.176≈0.088
•orange: 0/3 = 0 •apple: 0×0.176=0
Document 2: "banana orange" Document 3: "apple orange orange banana"
•banana: 1/2 = 0.5 •apple: 0.25×0.176≈0.044
•orange: 1/2 = 0.5 •orange: 0.5×0.176 ≈0.088
•apple: 0/2 = 0 •banana: 0.25×0=0
Document 3: "apple orange
orange banana" Document 1: [0.117, 0, 0]
•apple: 1/4 = 0.25 Document 2: [0, 0, 0.088]
•orange: 2/4 = 0.5 Document 3: [0.044, 0, 0.088]
•banana: 1/4 = 0.25

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Common methods for getting short dense


vectors
Word2vec –What is wod2vec

• An unsupervised NLP method developed by Google in 2013 (Mikolov et al. 2013)


• “Neural Language Model”-inspired models • Quantify the relationship between words
– Word2vec (skipgram, CBOW), GloVe • Framework for learning word vectors Idea:
• Singular Value Decomposition (SVD) – We have a large corpus (“body”) of text: a long list of words
– Every word in a fixed vocabulary is represented by a vector
– A special case of this is called LSA – Latent Semantic Analysis
– Go through each position t in the text, which has a center word c and context
• Alternative to these "static embeddings": (“outside”) words o
– Contextual Embeddings (ELMo, BERT) – Use the similarity of the word vectors for c and o to calculate the probability of o given
c (or vice versa)
– Compute distinct embeddings for a word in its context – Keep adjusting the word vectors to maximize this probability
– Separate embeddings for each token of a word

http://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Word2vec Word2vec –Types of word2vec
1.Corpus:
I love natural language processing and
machine learning"
2.Vocabulary:
["I", "love", "natural", "language", "processing", "and", "machine", "learning“]
3.Vector representation:
I" -> [0.1, 0.2, 0.3] "love" -> [0.2, 0.3, 0.4] "natural" -> [0.3, 0.4, 0.5] "language" -> [0.4,
0.5, 0.6] "processing" -> [0.5, 0.6, 0.7] "and" -> [0.6, 0.7, 0.8] "machine" -> [0.7, 0.8, 0.9]
"learning" -> [0.8, 0.9, 1.0]
4.Context Window -Define a context window size.
Assume window size of 2 (two words on each side of the center word).
5: Probability Calculation
Center word: "natural"
Context words: ["love", "language"]
6.Calculate the probability of each context word given the center word using the word
vectors.
7: Adjusting Word Vectors
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Featurized representation: word


Basic word representation embedding
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
\ V = [a, aaron, …, zulu, <UNK>]
Gender −1 1 -0.95 0.97 0.00 0.0
1
1-hot representation
Royal 0.01 0.02 0.93 0.95 -0.01 0.0
0
Age 0.03 0.02 0.70 0.69 0.03 -
0.0
2
Food 0.09 0.01 0.02 0.01 0.95 0.9
7
I want a glass of orange
I want a glass of apple

[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations] Andrew Ng

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Skip-Gram Classifier
Word embeddings
(assuming a +/- 2 word window)
Training data

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair


(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Similarity is computed from dot product

• Remember: two vectors are similar if they have a high


dot product
–Cosine is just a normalized dot product
• So:
–Similarity(w,c) ∝ w ∙ c
• We’ll need to normalize to get a probability
– (cosine isn't a probability either)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Turning dot products into probabilities How Skip-Gram Classifier computes P(+|w, c)

• Sim(w,c) ≈ w ∙ c
• To turn this into a probability
This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
• We'll use the sigmoid from logistic regression:

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Skip-Gram Training data


Skip-Gram Training data

•For every + example select k negative examples


…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4
…lemon, a [tablespoon of apricot jam, a]
pinch…
c1 c2 [target] c3 c4

For each positive


example we'll grab k
negative examples,
The paper (Mikolov et al., 2013) says that K=2 ~ 5 works for large data sets, and21 K=5 ~ 20 sampling by frequency
22

for small data sets.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Skip-Gram Training data Skip-gram parameters

• SGNS learns two sets of


embeddings
– Target embeddings matrix W
…lemon, a [tablespoon of apricot jam, a] – Context embedding matrix C
pinch…
• It's common to just add them
c1 c2 [target] c3 c4
together, representing word i as
the vector wi + ci
• Thus the parameters we need to
learn are two matrices W and C,
23
each containing an embedding
for every one of the |V| words in
the vocabulary |V|

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Word2vec training : Log Loss function


how to learn vectors

• Given the set of positive and negative training instances, and an initial set of
embedding vectors
• The goal of learning is to adjust those word vectors such that we:
– Maximize the similarity of the target word, context word pairs (w , cpos) drawn
from the positive data
– Minimize the similarity of the (w , cneg) pairs drawn from the negative data.

1/18/2025 25

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Loss function for one w with cpos , cneg1 ...cnegk Learning the classifier

• Maximize the similarity of the target with the actual context • How to learn?
words, and minimize the similarity of the target with the k – Stochastic gradient descent!
negative sampled non-neighbor words.
• We’ll adjust the word weights to
– make the positive pairs more likely
– and the negative pairs less likely,
– over the entire training set.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Reminder: gradient descent


Intuition of one step of gradient descent

• Tries to shift embeddings


• So the target embeddings
(here for apricot) are
closer to (have a higher •
dot product with) context
embeddings for nearby
words (here jam)
• And further from (lower
dot product
with) context d
embeddings for noise wt+ 1 = wt − h L( f (x; w), y)
words that don’t occur dw
nearby (here
Tolstoy and matrix). let’
don’
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

we’
Update equation in SGD
The derivatives of the loss function

Start with randomly initialized C and W matrices, then incrementally do updates

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Skip-gram training example Embedding example

1.Corpus:
"cats like milk"
"dogs like bones”
2.V = ["cats", "like", "milk", "dogs", "bones“]
3: Initialize Embedding Matrices[assumed=2]
4Initialize two matrices 𝑊 and 𝐶 with random values:
𝑊=[cats like milk dogs bones] =

𝐶=[cats like milk dogs bones]=

https://jalammar.github.io/illustrated-word2vec/

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Summary: How to learn word2vec (skip-gram)
embeddings

5.Generate Training Data:


Assume Context window size=1
("cats", "like"), ("like", "cats"), ("like", "milk"), ("milk", "like"), ... ("dogs", "like"), • Start with V random d-dimensional vectors as initial
("like", "dogs"), ("like", "bones"), ("bones", "like")
embeddings
6.Negative sampling
Positive example: ("cats", "like") • Train a classifier based on embedding similarity
Negative examples: ("cats", "bones"), ("cats", "dogs") –Take a corpus and take pairs of words that co-occur as positive
7.Train the model examples
–Take pairs of words that don't co-occur as negative examples
–Train the classifier to distinguish these by slowly adjusting all the
embeddings to improve the classifier performance
–Throw away the classifier code and keep the embeddings.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Skip gram example Contd..

BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus
BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus

Contd..

BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani, Pilani Campus
Training model Model architecture

1. Create model Architecture


2. Forward Propagation
3. Error Calculation
4. Weight tuning using backward pass

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Weighed matrix

Skip gram negative example

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


How does negative sampling work
Contd..
• skip-gram with negative sampling (SGNS) • skip-gram with negative sampling (SGNS)
• N = 2 , C=4 (Here C is the context window) • N = 2 , C=4 (Here C is the context window)

Source Credit : https://jalammar.github.io/illustrated-word2vec/ Source Credit : https://jalammar.github.io/illustrated-word2vec/

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Learning skip-gram embeddings CBOW [Continuous bag of


Train using gradient descent words]
• Just as in logistic regression, the learning algorithm starts with randomly initialized W
and C matrices, and then walks through the training corpus using gradient descent to What is CBOW?
update W and C so as to maximize the objective criteria. -predict a target word given the context words in a
• Thus matrices W and C function as the parameters  that logistic regression is tuning. sentence
e.g The product is really good
• Once embeddings are learned, we’ll have two embeddings for each word wi: ti and ci.
– We can choose to throw away the C matrix and just keep W, in which case each
word i will be represented by the vector ti.
– Alternatively we can add the two embeddings together, using the summed
embedding ti + ci as the new d-dimensional embedding

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Working of CBOW with
example Contd..

Corpus
The product is really good
The product is wonderful
The product is awful
Step1:
combine the sentences
The product is really good The product is wonderful The product is
awful
Step2:
Select window size

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Contd..
Step3: Get one hot encoding for each word

Step4:Training data

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Contd.. Cond..

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Word Embedding matrix

Thank you

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Natural Language
Processing
DSECL ZG5ZG530 Session 6: POS tagging
Prof.Vijayalakshmi
Date – 29-06-2024
BITS Pilani Time – 1.40 pm to 3.40pm
Pilani Campus BITS-Pilani
These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materials freely available online.
BITS Pilani, Pilani Campus

Session6:POS tagging What is POS Tagging


❖ What is Part of speech tagging
• The process of assigning a part-of-speech
❖ Why POS tagging is required
or lexical class marker to each word in a
❖ Application
❖ (Mostly) English Word Classes
sentence (and all sentences in a
❖ Tag set collection).
• Penn tree bank
❖ Part-of-Speech Tagging
❖ Different approaches
• Rule based
• Stochastic based
❑ HMM Part-of-Speech Tagging
• Hybrid system
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Process Why POS?

❖List of all possible tag for each word in ❖First step in many applications
sentence ❖POS tell us a lot about word
❖Eg: ❖Pronunciation depends on POS
❖To find named entities
❖Stemming

❖Choose best suitable tag sequence

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Application of POS Information retrieval

❖Parsing ❖Word Disambiguation


Recovering syntactic structures requires correct -ability to determine which meaning of word is
POS tags activated by the use of word in a particular
Eg: context.
Eg:
• I can hear bass sound.
• He likes to eat grilled bass.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Question answering system Why POS tagging is hard?

❖automatically answer questions posed by ❖Ambiguity


humans in a natural language. Eg:
❖Eg: Does he eat bass? • He will race/VB the car.
• When will the race/NOUN end?
• The boat floated/VBD … down the river sank
❖ Average of ~2 parts of speech for each word

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Parts of Speech POS examples

• 8 (ish) traditional parts of speech • N noun chair, bandwidth, pacing


– Noun, verb, adjective, preposition, adverb, article, • V verb study, debate, munch
interjection, pronoun, conjunction, etc • ADJ adjective purple, tall, ridiculous
– Called: parts-of-speech, lexical categories, word
classes, morphological classes, lexical tags... • ADV adverb unfortunately, slowly
– Lots of debate within linguistics about the • P preposition of, by, to
number, nature, and universality of these • PRO pronoun I, me, mine
• We’ll completely ignore this debate.
• DET determiner the, a, that, those

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


POS Tagging Open and Closed Classes
• Assigning label to something or someone
• Open class: new ones can be created all the time
• The process of assigning a part-of-speech to – English has 4: Nouns, Verbs, Adjectives, Adverbs
each word in a collection. – Many languages have these 4, but not all!
WORD tag • Closed class: a small fixed membership
the DET – Prepositions: of, in, by, …
koala N
– Auxiliaries: may, can, will had, been, …
put V
– Pronouns: I, you, she, mine, his, them, …
the DET
– Usually function words (short common words which play a
keys N
role in grammar)
on P
the DET
table N
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Closed Class Words Tagsets

Examples: • A set of all POS tags used in a corpus is called


– prepositions: on, under, over, …
a tag set
– particles: up, down, on, off, …
– determiners: a, an, the, … • There are various standard tag sets to choose
– pronouns: she, who, I, .. from; some have a lot more tags than others
– conjunctions: and, but, or, …
– auxiliary verbs: can, may should, …
• The choice of tagset is based on the
– numerals: one, two, three, third, … application
• Accurate tagging can be done with even large
tag sets

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Penn tree bank Penn TreeBank POS Tagset

❖ Background
• From the early 90s
• Developed at the University of Pennsylvania
• (Marcus,Santorinin and Marcinkiewicz 1993)
❖ Size
• 40000 training sentences
• 2400 test sentences
• Genre
• Mostly wall street journal news stories and some spoken
conversations
❖ Importance
• Helped launch modern automatic parsing methods.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Just for Fun… Just for Fun…

• Using Penn Treebank tags, tag the following • Using Penn Treebank tags, tag the following
sentence from the Brown Corpus: sentence from the Brown Corpus:

• The grand jury commented on a number of • The/DT grand/JJ jury/NN commented/VBD


other topics. on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


POS Tagging- Approaches to POS Tagging
Choosing a Tag set
❖ Rule-based Approach
❖ There are so many parts of speech, potential distinctions we
• Uses handcrafted sets of rules to tag input
can draw sentences
❖ To do POS tagging, we need to choose a standard set of tags
to work with ❖ Statistical approaches
❖ Could pick very coarse tagsets • Use training corpus to compute probability of a tag
❖ N, V, Adj, Adv. in a context-HMM tagger
❖ More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags ❖ Hybrid systems (e.g. Brill’s transformation-based
❖ PRP$, WRB, WP$, VBG learning)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Rule based POS tagging

❖First stage − In the first stage, it uses a


dictionary to assign each word a list of
potential parts-of-speech.
❖Second stage − In the second stage, it uses Hidden Markov model
large lists of hand-written disambiguation
rules to sort down the list to a single part-of-
speech for each word.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


MARKOV CHAIN: WEATHER
Markov model /Markov chain EXAMPLE

A Markov process is a process that generates a ❖Design a Markov Chain to


sequence of outcomes in such a way that the predict the weather of
probability of the next outcome depends only tomorrow using previous
on the current outcome and not on what information of the past days.
happened earlier. ❖ Our model has only 3 states:
𝑆 = 𝑆1, 𝑆2, 𝑆3
❖𝑆1 = 𝑆𝑢𝑛𝑛𝑦 , 𝑆2 = 𝑅𝑎𝑖𝑛𝑦, 𝑆3
= 𝐶𝑙𝑜𝑢𝑑𝑦.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Contd..

❖state sequence notation: 𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5, …


. ., where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.

❖ Markov Property

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Example Example2

Given that today is Sunny, what’s the probability Assume that yesterday’s weather was Rainy, and
that tomorrow is Sunny and the next day Rainy? today is Cloudy, what is the probability that
tomorrow will be Sunny?

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

WHAT IS A HIDDEN MARKOV


Markov chain Vs HMM
MODEL (HMM)?

❖A Hidden Markov Model, is a stochastic model where Markov chain HMM


the states of the model are hidden. Each state can
emit an output which is observed.
❖Imagine: You were locked in a room for several days
and you were asked about the weather outside. The
only piece of evidence you have is whether the
person who comes into the room bringing your daily
meal is carrying an umbrella or not.
• What is hidden? Sunny, Rainy, Cloudy
• What can you observe? Umbrella or Not
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Models (Formal) First-Order HMM Assumptions

• States Q = q1, q2…qN; • Markov assumption: probability of a state depends


• Observations O= o1, o2…oN; only on the state that precedes it
• Transition probabilities
– Transition probability matrix A = {aij} P(qi | q1...qi−1) = P(qi | qi−1)
aij = P(qt = j | qt-1 = i) 1£ i, j £ N
• Emission Probability /Output probability
– Output probability matrix B={bi(k)} ฀
bi (k) = P(X t = ok | qt = i)
• Special initial probability vector 
p i = P(q1 = i) 1£ i £ N

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

How to build a second-order HMM? Markov Chain for Weather

• What is the probability of 4 consecutive warm


• Second-order HMM days?
– Current state only depends on previous 2 states
• Example • Sequence is
– Trigram model over POS tags warm-warm-warm-warm
𝑛
– 𝑃 𝒕 = Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1 , 𝑡𝑖−2 • And state sequence is
𝑛 3-3-3-3
– 𝑃 𝒘, 𝒕 = Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1 , 𝑡𝑖−2 𝑃(𝑤𝑖 ∣ 𝑡𝑖 )
• P(3,3,3,3) =
– 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


POS Tagging as Sequence
Probabilistic Sequence Models
Classification

• We are given a sentence (an “observation” • Probabilistic sequence models allow


or “sequence of observations”) integrating uncertainty over multiple,
– Secretariat is expected to race tomorrow interdependent classifications and collectively
• What is the best sequence of tags that determine the most likely global assignment.
corresponds to this sequence of • standard model
observations?
– Hidden Markov Model (HMM)
• Probabilistic view
– Consider all possible sequences of tags
– Out of this universe of sequences, choose the
tag sequence which is most probable given the
observation sequence of n words w1…wn.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

How you predict the tags? Statistical POS Tagging

• We want, out of all sequences of n tags t1…tn


• Two types of information are useful the single tag sequence such that
– Relations between words and tags P(t1…tn|w1…wn) is highest.
– Relations between tags and tags
• DT NN, DT JJ NN…

• Hat ^ means “our estimate of the best one”


• Argmaxx f(x) means “the x such that f(x) is
maximized”
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Statistical POS Tagging
Using Bayes Rule

❖This equation should give us the best tag = PROB(T1,…Tn | w1,…wn)


sequence
Estimating the above takes far too much data. Need to
do some reasonable approximations.
❖But how to make it operational? How to
compute this value? Bayes Rule:
PROB(A | B) = PROB(B | A) * PROB(A) / PROB(B)
❖Intuition of Bayesian inference:
• Use Bayes rule to transform this equation into Rewriting:
a set of probabilities that are easier to PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) / PROB(w1,…wn)
compute (and give the right answer)
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Independent assumptions

=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) / So, we want to find the sequence of tags that maximizes
PROB(T1,…Tn) * PROB(w1,…wn | T1,…Tn)
PROB(w1,…wn)
=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) ❖ For Tags – use bigram probabilities
PROB(T1,…Tn) ≈ πi=1,n PROB(Ti | Ti-1)
PROB(ART N V N) ≈ PROB(ART | Φ) * PROB(N | ART) * PROB(V |
N) * PROB(N | V)

❖ For second probability: assume word tag is independent of


words around it:
PROB(w1,…wn | T1,…Tn) ≈ πi=1,n PROB(wi | Ti)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


POS formula POS Tagging using HMM
• States T = t1, t2…tN;
• Find the sequence of tags that maximizes: • Observations W= w1, w2…wN;
– Each observation is a symbol from a vocabulary V =
πi=1,n PROB(Ti | Ti-1) * PROB(wi | Ti) {v1,v2,…vV}
• Transition probabilities
– Transition probability matrix A = {aij}
𝑎𝑖𝑗 = 𝑃 𝑡𝑖 = 𝑗 𝑡𝑖−1 = 𝑖 1 ≤ 𝑖, 𝑗 ≤ 𝑁
• Observation likelihoods
– Output probability matrix B={bi(k)}
𝑏𝑖 (𝑘) = 𝑃 𝑤𝑖 = 𝑣𝑘 𝑡𝑖 = 𝑖

• Special initial probability vector 


𝜋𝑖 = 𝑃 𝑡1 = 𝑖 ≤𝑖≤𝑁
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Two Kinds of Probabilities Two Kinds of Probabilities

1. State transition probabilities -- p(ti|ti-1) 1. Tag transition probabilities -- p(ti|ti-1)


– State-to-state transition probabilities – Determiners likely to precede adjs and nouns
• That/DT flight/NN
• The/DT yellow/JJ hat/NN
• So we expect P(NN|DT) and P(JJ|DT) to be high
2. Observation/Emission probabilities -- p(wi|ti) – Compute P(NN|DT) by counting in a labeled
corpus:
– Probabilities of observing various values at a
given state

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Sample Probabilities
Two Kinds of Probabilities
Bigram Probabilities
2. Word likelihood/emission probabilities • Bigram(Ti, Tj) Count(i, i + 1) Prob(Tj|Ti) Tag Frequencies
• φ,ART 213 .71 (213/300) Φ ART N V P
p(wi|ti) 300 633 1102 358 366
• φ,N 87 .29 (87/300)
– VBZ (3sg Pres Verb) likely to be “is” • φ,V 10 .03 (10/300)
– Compute P(is|VBZ) by counting in a labeled • ART,N 633 1
• N,V 358 .32
corpus:
• N,N 108 .10
• N,P 366 .33
• V,N 134 .37
• V,P 150 .42
• V,ART 194 .54
• P,ART 226 .62
• P,N 140 .38
• V,V 30 .08
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Sample Lexical Generation Probabilities Two types of stochastic POS tagging

• P(an | ART) .36 ❖Word frequency tagging


• P(an | N) .001 • based on the probability that a word occurs
• P(flies | N) .025 with a particular tag.
• P(flies | V) .076 ❖Tag sequence probabilities
• P(time | N) .063 • calculates the probability of a given sequence
of tags occurring.
• P(time | V) .012
• P(arrow | N) .076
• P(like | N) .012 53

• P(like | V) .10 BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Example1-Some Data on race Disambiguating to race tomorrow

• Secretariat/NNP is/VBZ expected/VBN to/TO


race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag for race in new
data?

1/18/2025 53 1/18/2025 54
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Example2:Statistical POS tagging-


Look Up the Probabilities Whole tag sequence

• P(NN|TO) = .00047 • What is the most likely sequence of tags for


• P(VB|TO) = .83 the given sequence of words w
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading P( DT JJ NN | a smart dog)
= P(DD JJ NN a smart dog) / P (a smart dog)
= P(DD JJ NN) P(a smart dog | DD JJ NN )
1/18/2025 55
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Word likelihood /Emission
Tag Transition Probability
Probability

• Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕) • Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)


• 𝑃 𝒕 = 𝑃 𝑡1 , 𝑡2 , … 𝑡𝑛 • Assume words only depend on their POS-tag
= 𝑃 𝑡1 𝑃 𝑡2 ∣ 𝑡1 𝑃 𝑡3 ∣ 𝑡2 , 𝑡1 … 𝑃 𝑡𝑛 𝑡1 … 𝑡𝑛−1
∼ P t1 P t 2 𝑡1 𝑃 𝑡3 𝑡2 … 𝑃(𝑡𝑛 ∣ 𝑡𝑛−1 ) • 𝑃 𝒘 𝒕 ∼ 𝑃 𝑤1 𝑡1 𝑃 𝑤2 𝑡2 … 𝑃(𝑤𝑛 ∣
𝑛
= Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1 𝑡𝑛 ) Independent assumption
= P(DD | start) P(JJ | DD) P(NN | JJ ) 𝑛
Markov assumption = Π𝑖=1 𝑃 𝑤𝑖 𝑡𝑖
• Bigram model over POS tags!
(similarly, we can define a n-gram model over POS tags, usually we i.e., P(a smart dog | DD JJ NN )
called high-order HMM)
= P(a | DD) P(smart | JJ ) P( dog | NN )
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Put them together Put them together

• Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕) • Two independent assumptions


• 𝑃 𝒕, 𝒘 – Approximate P(t) by a bi(or N)-gram model
= P t1 P t 2 𝑡1 𝑃 𝑡3 𝑡2 … 𝑃 𝑡𝑛 𝑡𝑛−1 – Assume each word depends only on its POStag
𝑃 𝑤1 𝑡1 𝑃 𝑤2 𝑡2 … 𝑃(𝑤𝑛 ∣ 𝑡𝑛 )
𝑛
= Π𝑖=1 𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1
e.g., P(a smart dog , DD JJ NN )
= P(a | DD) P(smart | JJ ) P( dog | NN ) initial probability
𝑝(𝑡1 )

P(DD | start) P(JJ | DD) P(NN | JJ )

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Example1-HMM Emission probabilities

Will can spot Mary


M V N N
Will Can spot Mary
N M V N

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Transition probabilities

Correct sentence is Will/N can/M Spot/V Mary /N

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Thank You… ☺

• Q&A
• Suggestions / Feedback

Natural Language Processing

Dr. Vijayalakshmi Anand


BITS Pilani
Pilani Campus BITS Pilani

BITS Pilani, Pilani Campus

Neural network -Introduction Neural network -Introduction

• Revision –Neural Network (Forward and backward pass) • What is neural network
Similar to the human brain that has neurons interconnected to one another,
• Application – Sentiment Analysis artificial neural networks also have neurons that are interconnected to one another
– Using logistic regression in various layers of the networks. These neurons are known as nodes.
– Using Neural Network

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Architecture of neural networks Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get
off the couch and start dancing . It sucked me in , and it'll do the same to you .

BITS Pilani, Pilani Campus 5


BITS Pilani, Pilani Campus

Classification: Sentiment Analysis


Classifying sentiment using logistic regression

Sentiment Features

BITS Pilani, Pilani Campus 7


BITS Pilani, Pilani Campus
Feedforward nets for simple classification Feedforward networks for NLP: Classification
(each xi is a hand-designed feature)
• Just adding a hidden layer to logistic regression allows the network to use non-
linear interactions between features which may (or may not) improve
performance

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

it’
Neural Network-How it works
we’
Intuition: Training a 2-layer network
we’
Output value y
· a • For every training tuple (𝑥, 𝑦)
defined σ – Run forward computation to find our estimate 𝑦ො
Non-linear activation function
z – Run backward computation to update weights:
Weighted sum ∑ • For every output node
final we’
defined bias – Compute loss 𝐿 between true 𝑦 and the estimated 𝑦ො
Weights w1 w2 w3 b
Input layer x1 x2 x3 +1 – For every weight 𝑤 from hidden layer to the output layer
e’
» Update the weight
rectified it’ • For every hidden node
Sigmoid 10 – Assess how much blame it deserves for the current answer
1
y = s (z) =
1+ e− z
– For every weight 𝑤 from input layer to the hidden layer
» Update the weight
it’
BITS Pilani, Pilani Campus 11
BITS Pilani, Pilani Campus
Reminder: Loss Function for binary logistic regression Reminder: gradient descent for weight updates

• A measure for how far off the current answer is to the right answer • Use the derivative of the loss function with respect to weights
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦)
𝑑𝑤
• Cross entropy loss for logistic regression:
• To tell us how to adjust weights for each training item
– Move them in the opposite direction of the gradient

d
wt+ 1 = wt − h L( f (x; w), y)
dw
– For logistic regression
let’
12 don’

BITS Pilani, Pilani Campus we’ BITS Pilani, Pilani Campus

Backprop

• For training, we need the derivative of the loss with respect to each
weight in every layer of the network
• But the loss is computed only at the very end of the network!
• Solution: error backpropagation (Rumelhart, Hinton, Williams, 1986)
• Backprop is a special case of backward differentiation
• Which relies on computation graphs.

14
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Forward Propagation Contd..

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

contd Error calculation –log loss function

• E=-log( P(w_t|w_c))
• wt = Target word
• wc = Context word

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Back propagation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Contd.. Contd..

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Step2: Updating the weight of w/11: Now Updating the first weight

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Contd.. Example
Corpus = "Ned Stark is the most honourable man"
initial embedding matrix
Current word-context pair = ("Ned", "Stark") ned -0.018 0.404 -0.317
stark 0.204 -0.007 -0.733
Current negative words = "pimples", "zebra", "idiot"
pimples -0.706 0.745 0.002
zebra -0.492 -0.709 0.133
idiot -0.628 -0.073 0.502

one hot vector


ned 1 0 0 0 0 initial context matrix
stark 0 1 0 0 0 ned -0.572 -0.588 -0.501
pimples 0 0 1 0 0
stark 0.116 0.723 -0.689
zebra 0 0 0 1 0
idiot 0 0 0 0 1 pimples -0.94 0.601 0.146
zebra -0.622 0.811 0.64
idiot -0.077 -0.375 -0.056

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Initial embeddings Forward Propagation: Sigmoid output layer

input embeddings initial


ned -0.018 0.404 -0.317
stark 0.204 -0.007 -0.733
pimples -0.706 0.745 0.002
zebra -0.492 -0.709 0.133
idiot -0.628 -0.073 0.502

context embeddings initial


ned -0.572 -0.588 -0.501
old i/p word dot
stark 0.116 0.723 -0.689 old context embeddings embedding product sigmoid()
pimples -0.94 0.601 0.146 stark 0.116 0.723 -0.689 -0.018 0.508 0.624
zebra -0.622 0.811 0.64 pimples -0.94 0.601 0.146 0.404 0.213 0.553
idiot -0.077 -0.375 -0.056
zebra -0.622 0.811 0.64 -0.317 0.136 0.534
idiot -0.077 -0.375 -0.056 -0.132 0.467

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Backward Propagation: Prediction Error Derivative of loss w.r.t input word embeddings

our current positive word is Stark, tj=1 for Stark and tj=0 for other negative words (pimples,
zebra, idiot).

sig-t (pred context embeddings initial sig-t(pred derivative of loss w.r.t input
Sigmoid() t error) ned -0.572 -0.588 -0.501 error) C*(sig-t) word embeddings
0.624 1 -0.376 stark 0.116 0.723 -0.689 -0.376 -0.04362 -0.271848 0.259064 -0.93154 0.318454 0.65541
0.553 0 0.553 pimples -0.94 0.601 0.146 0.553 -0.51982 0.332353 0.080738
0.534 0 0.534 zebra -0.622 0.811 0.64 0.534 -0.33215 0.433074 0.34176
0.467 0 0.467 idiot -0.077 -0.375 -0.056 0.467 -0.03596 -0.175125 -0.02615

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Input word embedding update Derivative of loss w.r.t context word embeddings

sigmoid-t (pred i/p word


error) embedding derivative of loss w.r.t context word embeddings
-0.376 -0.018 0.007 -0.152 0.119
new i/p word embedding
0.553 0.404 -0.01 0.223 -0.175
word old input word embedding LR LR* derivative w.r.t input
0.534 -0.317 -0.01 0.216 -0.169
ned -0.018 0.404 -0.317 0.05 -0.04658 0.0159230.032771 0.029 0.388 -0.35
0.467 -0.008 0.189 -0.148

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Context word embedding update Glove

• GloVe stands for global vectors for word representation.


• Unsupervised learning algorithm developed by Stanford for generating word
embeddings
• GloVe captures both global statistics and local statistics of a corpus, in order to
come up with word vectors
• Aggregates global word-word co-occurrence matrix from a corpus
• Idea is learning should be with ratios of co-occurrence probabilities rather than
the probabilities themselves.
old contextt word
word embedding LR LR* derivative w.r.t context new context word embedd • The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on
stark 0.116 0.723 -0.689 0.05 0.00035 -0.0076 0.00595 0.116 0.731 -0.695 local statistics (local context information of words), but incorporates global
pimples -0.94 0.601 0.146 0.05 -0.0005 0.01115 -0.00875 -0.94 0.59 0.155
statistics (word co-occurrence) to obtain word vectors
zebra -0.622 0.811 0.64 0.05 -0.0005 0.0108 -0.00845 -0.622 0.8 0.648
idiot -0.077 -0.375 -0.056 0.05 -0.0004 0.00945 -0.0074 -0.077 -0.384 -0.049

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Matrix of word-word co-occurrence count Intuition behind GloVe

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


References

• https://aegis4048.github.io/optimize_computational_efficiency_of_skip-
gram_with_negative_sampling#pred_error

• Ch-6 : Speech and Language Processing Daniel Jurafsky and James H. Martin

Thank you

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Neural Language Model


Natural Language Processing

Dr. Vijayalakshmi Anand


BITS Pilani
Pilani Campus BITS Pilani

BITS Pilani, Pilani Campus


Types of language model Probabilistic Language Models
• Probability of a sequence of words: P(W ) = P( w1 , w2 ,..., wt −1 , wT )
• Statistical model
• Neural language model
• Conditional probability of an upcoming word: P( wT w1 , w2 ,..., wt −1 )

• Chain rule of probability: P( w1 , w2 ,..., wt −1 , wT ) = P( w1 ) P( w2 | w1 ) P( w3 | w1 , w2 )...P( wT | w1 , w2 ,.., wT )

• (n-1)th order Markov assumption P( w1 , w2 ,..., wt −1 , wT ) =  P( wt | w1 , w2 ,..., wt −1 )


T

t =1

• Each p(w i |w i−4 , w i − 3 , w i − 2 , w i−1 ) may not have enough statistics to estimate
• we back off to p(w i |w i−3 , w i − 2 , w i−1 ), p(w i |w i−2 , w i−1 ), etc., all the way to p(wi)
BITS Pilani, Pilani Campus 4
BITS Pilani, Pilani Campus

Example Curse of dimensionality

Consider this sentences


Have a good day
Have a great day

What happens in smoothing ?

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


How to generalize better Neural Probabilistic Language Models

• Learn distributed representation of words • Instead of treating words as tokens, exploit semantic similarity
• What is distributed representation? – Learn a distributed representation of words that will allow sentences like these to be
➢ also known as embedding. seen as similar
➢Eg: The cat is walking in the bedroom.
A dog was walking in the room.
The cat is running in a room.
The dog was running in the bedroom.
etc.
– Use a neural net to represent the conditional probability function
𝑷(𝒘𝒕 |𝒘𝒕−𝒏 , 𝒘𝒕−𝒏+𝟏 , … , 𝒘𝒕−𝟏 )
– Learn the word representation and the probability function simultaneously

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Simple feedforward Neural Language Models Word Embeddings


• one-hot vector representation , e.g., dog = (0,0,0,0,1,0,0,0,0,....), cat = (0,0,0,0,0,0,0,1,0,....)
• Task: predict next word wt • Represent each of the N previous words as a one-hot vector of length |V| one-hot vector , i.e., with one
given prior words wt-1, wt-2, wt-3, … dimension for each word in the vocabulary
• word “toothpaste”, supposing it is V5, i.e., index 5 in the vocabulary, x5 = 1, and xi = 0 i  5,
• Output : probability distribution over possible next words.

Embedding matrix One-hot vector

• Problem: Now we’re dealing with sequences of arbitrary length.


• Solution: Sliding windows (of fixed length)
9

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Why Neural LMs work better than N-gram LMs
Training the neural language model
embeddings
• Neural language models represent words in this prior context by their embeddings, • Two ways:
rather than just by their word identity as used in n-gram language models. • Freeze the embedding layer E with initial word2vec values.
• Using embeddings allows neural language models to generalize better to unseen data.
– Freezing means we use word2vec or some other pretraining algorithm to
Training data: compute the initial embedding matrix E, and then hold it constant while we only
• We've seen: I have to make sure that the cat gets fed. Modify W, U, and b, i.e., we don’t update E during language model training
• Never seen: dog gets fed • Learn the embeddings simultaneously with training the network.
Test data:
– Useful when the task the network is designed for (like sentiment classification,
• I forgot to make sure that the dog gets ___ translation, or parsing) places strong constraints on what makes a good
• N-gram LM can't predict "fed"! representation for words.
• Neural LM can use similarity of "cat" and "dog" embeddings to generalize and predict
“fed” after dog

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Neural Language Model Training the neural language model

Equations:

= E,W,U, b.

one single embedding dictionary E that’s


shared among one-hot vectors

concatenate 3 embeddings for the 3 context words to


produce the embedding layer e Training the parameters to minimize loss
will result both in an algorithm for
language modeling (a word predictor)
13 but also a new set of embeddings E
that can be used as word representations
for other tasks.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Training the neural language model References
• Input a very long text, concatenating all the sentences
• Starting with random weights
• Iteratively moving through the text predicting each word wt
• For language modeling, the classes are the words in the vocabulary, CH-7 - Speech and Language Processing by Daniel Jurafsky
yi – hat = 𝑷(𝒘𝒕 |𝒘𝒕−𝒏 , 𝒘𝒕−𝒏+𝟏 , … , 𝒘𝒕−𝟏 )
• At each word wt , we use the cross-entropy (negative log likelihood) loss

• Gradient Descent update:

• This gradient can be computed in any standard neural network framework which will then backpropagate
through  = E, W, U, b.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Natural Language Processing


DSECL ZG530 Session 4-Part-of-Speech Tagging (Viterbi ,Maximum entropy)
Date – 6th July 2023
Prof.Vijayalakshmi Anand Time – 1.40pmto 3.40pm
BITS Pilani BITS-Pilani
Pilani Campus These slides are prepared by the instructor, with grateful acknowledgement of James Allen and
many others who made their course materials freely available online.
Session4-POS tagging(Viterbi
,Maximum entropy model)
• The Hidden Markov Model
❖What is POS tagging
• Likelihood Computation:
❖Application of POS tagging
▪ The Forward Algorithm
❖Tag sets –standard tag set
• Decoding: The Viterbi Algorithm
❖Approaches of POS tagging
• Bidirectionalty
❖Introduction to HMM
❖How HMM is used in POS tagging • Maximum entropy model

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Hidden Markov Models Hidden Markov Model (HMM)


It is a sequence model. Oftentimes we want to know what produced the sequence
– the hidden sequence for the observed sequence.
Assigns a label or class to each unit in a sequence, For example,
thus mapping a sequence of observations to – Inferring the words (hidden) from acoustic signal (observed) in speech
recognition
a sequence of labels. – Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS
Probabilistic sequence model: given a sequence of tagging.
– Assigning named entity categories (hidden) to a sentence (sequence of words) –
units (e.g. words, letters, morphemes, Named Entity Recognition.

sentences), compute a probability distribution


over possible sequences of labels and choose
the best label sequence.
This is a kind of generative model.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Definition of HMM Three Problems
States Q = q1, q2…qN; Given this framework there are 3 problems that we can
Observations O= o1, o2…oN; pose to an HMM
– Each observation is a symbol from a vocabulary V = {v1,v2,…vV} 1. Given an observation sequence, what is the
Transition probabilities probability of that sequence given a model?
– Transition probability matrix A = {aij} 2. Given an observation sequence and a model, what
aij = P(qt = j | qt-1 = i) 1£ i, j £ N
is the most likely state sequence?
Observation likelihoods 3. Given an observation sequence, find the best model
– Output probability matrix B={bi(k)} parameters for a partially specified model
bi (k) = P(X t = ok | qt = i)

Special initial probability vector 


p i = P(q1 = i) 1£ i £ N
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Problem 1: Problem 2:
Observation Likelihood
• The probability of a observation sequence given • Most probable state sequence given a model
a model and state sequence and an observation sequence
• Evaluation problem • Decoding problem

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Problem 3:
Solutions
• Infer the best model parameters, given a partial model Problem 1: Forward (learn observation sequence)
and an observation sequence... Problem 2: Viterbi (learn state sequence)
– That is, fill in the A and B tables with the right numbers --
Problem 3: Forward-Backward (learn probabilities)
• the numbers that make the observation sequence most likely
– An instance of EM (Expectation Maximization)
• This is to learn the probabilities!

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Example :HMMs for Ice Cream


Problem 1

You are a climatologist in the year 2799 studying global


warming
You can’t find any records of the weather in Baltimore for
summer of 2007
But you find Jason Eisner’s diary which lists how many ice-
creams Jason ate every day that summer
Your job: figure out how hot it was each day
❖ 1.Consider all possible 3-day weather sequences [H, H, H],
[H, H, C], [H, C, H], …….
2. For each 3-day weather sequence, consider the probability
of the ice cream Consumption sequences [131,]
3.Add all the probability
❖ Not efficient
❖ Forward algorithm
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Problem2 Problem3
.

▪ 1.Consider all possible 3-day weather sequences [H, H, H], [H, H, C], [H, C,
H], ▪ Find :
The start probabilities
2. For each 3-day weather sequence, consider the probability of the ice The transition probabilities
cream Consumption sequences [0,1,2] Emission probabilities
3. Pick out the sequence that has the highest probability from step #2. ▪ Forward –backward algorithm
▪ Not efficient
▪ Viterbi algorithm BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Getting to HMMs Using Bayes Rule


This equation should give us the best tag sequence
Know this.

But how to make it operational? How to compute this


value?
Intuition of Bayesian inference:
– Use Bayes rule to transform this equation into a set of probabilities that are
easier to compute (and give the right answer)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Likelihood and Prior HMM for Ice Cream
Given
– Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:
– Hidden Weather Sequence:
H,C,H,H,H,C, C…

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

HMM for Ice Cream Decoding


• Given an observation sequence
❑ 123
• And an HMM
• The task of the decoder
❑ To find the best hidden state sequence most likely to
have produced the observed sequence
• Given the observation sequence
O=(o1o2…oT), and an HMM model Φ = (A,B),
Transition probability
Emission probabilities
how do we choose a corresponding state
H C 1 2 3 sequence Q=(q1q2…qT) that is optimal in
H 0.7 0.3 H 0.2 0.4 0.4 some sense (i.e., best explains the
C 0.4 0.6 C 0.5 0.4 0.1 observations)
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Contd.. Viterbi intuition
• One possibility to find the best sequence is : • We want to compute the joint probability
- For each hidden state sequence Q of the observation sequence together
- HHH, HHC, HCH, with the best state sequence
- Compute P(O|Q)
- Pick the highest one
• Why not?
-
• Instead:
- The Viterbi algorithm
- Is a dynamic programming algorithm

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Hidden Markov Model Hidden Markov Model


Example -1 – Viterbi Algorithm - Initialization Example -1 – Viterbi Algorithm - Recursion

P(C)*P(3|C) = 0.2*0.1 = 0.02 P(C)*P(C|C)*P(1|C) = 0.02*0.5*0.5 = 0.005


C P(H)*P(C|H)*P(1|C) = 0.32*0.4*0.5 = 0.064
C C
0.02

*
*
Hot Cold Hot Cold
<S 0.8 0.2 <S 0.8 0.2
> >
A Hot Col A Hot Col
d d
H
Hot 0.6 0.4 Hot 0.6 0.4
Col 0.5 0.5 0.32 H H
Col 0.5 0.5
P(H)*P(3|H) = 0.8*0.4 = 0.32 d d
B 1 2 3
P(C)*P(H|C)*P(1|H) = 0.02*0.5*0.2 = 0.002 B 1 2 3
Hot .2 .4 .4 P(H)*P(H|H)*P(1|H) = 0.32*0.6*0.2 = 0.0384 Hot .2 .4 .4
Col .5 .4 .1 Col .5 .4 .1
d d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
Example -1 – Viterbi Algorithm – Termination through Back Trace Example -1 – Viterbi Algorithm

P(C)*P(C|C)*P(3|C) = 0.064*0.5*0.1 = 0.032


P(H)*P(C|H)*P(3|C) = 0.0384*0.4*0.1 = 0.0015
C C C
0.02 0.064

* Best Sequence:
Hot฀ Cold฀
Hot Cold
<S 0.8 0.2
>
A Hot Col
d
H H H
Hot 0.6 0.4
0.32 0.0384 Col 0.5 0.5
d
P(C)*P(H|C)*P(3|H) = 0.064 *0.5*0.2 = 0.0064 B 1 2 3
P(H)*P(H|H)*P(3|H) = 0.0384 *0.6*0.2 = 0.0046 Hot .2 .4 .4
Col .5 .4 .1 Source Credit : Speech and Language Processing - Jurafsky and Martin
d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Hidden Markov Model Hidden Markov Model


POS Tagging – Viterbi Algorithm POS Tagging – Example -2 – Viterbi Algorithm

Source Credit : Speech and Language Processing - Jurafsky and Martin Source Credit : Speech and Language Processing - Jurafsky and Martin

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Hidden Markov Model Hidden Markov Model
POS Tagging – Example -2 – Viterbi Algorithm - Initialization POS Tagging – Example -2 – Viterbi Algorithm - Recursion

VB P(VB | Start)*P(I | VB) = 0.019*0 = 0 VB VB


VB TO NN PPSS
VB TO NN PPSS
<S> 0.019 0.004 0.041 0.067 0
3 <S> 0.019 0.004 0.041 0.067
3
TO P(TO | Start)*P(I | TO) = 0.0043*0 = 0 A VB TO NN PPSS TO TO
A VB TO NN PPSS
VB 0.003 0.035 0.047 0.0070 0 VB 0.0038 0.035 0.047 0.0070
8
S TO 0.83 0 0.0004 0 S TO 0.83 0 0.00047 0
7 NN
NN P(NN | Start)*P(I | NN) = 0.041*0 = 0 NN
NN 0.004 0.016 0.087 0.0045 NN 0.0040 0.016 0.087 0.0045
0 0 PPSS 0.23 0.00079 0.0012 0.00014
PPS 0.23 0.0007 0.0012 0.00014
S 9
PP PP PP B I want to race
SS
P(PPSS | Start)*P(I | PPSS) = 0.067*0.37 B I want to race SS SS VB 0 0.0093 0 0.00012
= 0.0248 VB 0 0.0093 0 0.00012
0.0248 TO 0 0 0.99 0
TO 0 0 0.99 0
P(VB) * P(VB | VB)*P(want | VB) = 0*0.0038*0.0093= 0 NN 0 0.00005 0 0.00057
NN 0 0.000054 0 0.00057 4
P(TO) * P(VB | TO)*P(want | VB) = 0*0.83*0.0093= 0
PPS 0.37 0 0 0 PPS 0.37 0 0 0
P(NN) * P(VB | NN)*P(want | VB) = 0*0.004*0.0093= 0
S S
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
P(PPSS) * P(VB | PPSS)*P(want | VB) = 0.0248*0.23*0.0093= 0.0005

Hidden Markov Model Hidden Markov Model


POS Tagging – Example -2 – Viterbi Algorithm - Recursion POS Tagging – Example -2 – Viterbi Algorithm - Recursion
0.0005
VB TO NN PPSS VB TO NN PPSS
VB VB <S> 0.019 0.004 0.041 0.067 VB VB VB <S> 0.019 0.004 0.041 0.067
3 3
0 0
0
A VB TO NN PPSS A VB TO NN PPSS
TO TO VB 0.003 0.035 0.047 0.0070 TO TO TO VB 0.0038 0.035 0.047 0.0070
8
0 TO 0.83 0 0.0004 0 0 TO 0.83 0 0.00047 0
7 0.16 * 10-7 NN 0.0040 0.016 0.087 0.0045
S S
NN 0.004 0.016 0.087 0.0045
NN NN NN NN NN PPSS 0.23 0.00079 0.0012 0.00014
0
PPS 0.23 0.0007 0.0012 0.00014
0 S 9 0 B I want to race
0 VB 0 0.0093 0 0.00012

PP PP B I want to race PP PP PP TO 0 0 0.99 0


SS SS VB 0 0.0093 0 0.00012 SS SS SS NN 0 0.00005 0 0.00057
4
0.0248 TO 0 0 0.99 0 0.0248
P(VB) * P(NN | VB)*P(want | NN) = 0*0.047*0.000054= 0 P(VB) * P(TO | VB)*P(to | TO) = 0.0005*0.035*0.99= 0.17 * 10-4 PPS 0.37 0 0 0
NN 0 0.00005 0 0.00057 S
P(TO) * P(NN | TO)*P(want | NN) = 0*0.0047*0.000054= 0 4 P(TO) * P(TO | TO)*P(to | TO) = 0*0*0.99= 0
P(NN) * P(NN | NN)*P(want | NN) = 0*0.087*0.000054= 0 PPS 0.37 0 0 0 P(NN) * P(TO | NN)*P(to | TO) = 0.16 * 10-7 *0.016*0.99= 0.13 * 10-10
P(PPSS) * P(NN | PPSS)*P(want | NN) = 0.0012*0.23*0.000054= 0.16 * 10-7 S P(PPSS) * P(TO | PPSS)*P(to | TO) = 0*0.00079*0.99= 0
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
POS Tagging – Example -2 – Viterbi Algorithm - Recursion POS Tagging – Example -2 – Viterbi Algorithm - Recursion
0.0005 0 0.0005 0
VB TO NN PPSS
VB VB VB VB <S> 0.019 0.004 0.041 0.067 VB VB VB VB
3
0 0.17 * 10-4
0 0.17 * 10-4
0 0 VB TO NN PPSS
A VB TO NN PPSS

TO TO TO TO VB 0.003 0.035 0.047 0.0070 TO TO TO TO <S> 0.019 0.004 0.041 0.067


8 3
0 TO 0.83 0 0.0004 0 0 A VB TO NN PPSS
0.16 * 10-7 0 7 0.16 * 10-7 0 VB 0.003 0.035 0.047 0.0070
S NN 0.004 0.016 0.087 0.0045 S 8
NN NN NN NN 0 NN NN NN NN
TO 0.83 0 0.0004 0
PPS 0.23 0.0007 0.0012 0.00014 7
0 S 9 0 NN 0.004 0.016 0.087 0.0045
0 0 0 0 0
PP PP PP PP B I want to race PP PP PP PP PPSB 0.23I 0.0007
want 0.0012
to 0.00014
race
SS SS SS SS SS SS SS SS S 9
VB 0 0.0093 0 0.00012 VB 0 0.0093 0 0.00012
P(VB) * P(VB | VB)*P(race | VB) = 0*0.0038*0.00012 = 0 TO 0 0 0.99 0 P(VB) * P(NN | VB)*P(race | NN) = 0*0.047*0.00057 TO 0 0 0.99 0
0.0248 0.0248
P(TO) * P(VB | TO)*P(race | VB) = 0.17 * 10-4 *0.83*0.00012 NN 0 0.00005 0 0.00057 P(TO) * P(NN | TO)*P(race | NN) = 0.17 * 10-4 *0.00047*0.00057NN 0 0.00005 0 0.00057
= 0.17 * 10-8 = 0.46 * 10-11
4 4
P(NN) * P(VB | NN)*P(race | VB) =0*0.0040*0.00012=0 PPS 0.37 0 0 0 P(NN) * P(NN | NN)*P(race | NN) =0*0.087*0.00057=0 PPS 0.37 0 0 0
P(PPSS) * P(VB | PPSS)*P(race | VB) = 0*0.23*0.00012= 0 S P(PPSS) * P(NN | PPSS)*P(race | NN) = 0*0.0012*0.00057= 0 S
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Hidden Markov Model Hidden Markov Model


POS Tagging – Example -2 – Viterbi Algorithm - Termination POS Tagging – Example -3 – Viterbi Algorithm – Practice
0.0005 0 Transition matrix: P(ti|ti-1) Emission matrix: P(wi|ti)

VB VB VB VB NOUN Verb Det Prep ADV STO a cat doctor in is the very
P
0 0.17 * 10-4
0.17 * 10-8 Noun 0 .5 .4 0 0.1 0 0
0 VB TO NN PPSS <S> .3 .1 .3 .2 .1 0
Verb 0 0 .1 0 .9 0 0
Noun .2 .4 .01 .3 .04 .05
TO TO TO TO <S> 0.019 0.004 0.041 0.067 Det .3 0 0 0 0 .7 0
3 Verb .3 .05 .3 .2 .1 .05
Prep 0 0 0 1.0 0 0 0
0 0
A VB TO NN PPSS Det .9 .01 .01 .01 .07 0
Adv 0 0 0 .1 0 0 .9
0.16 * 10-7 0 VB 0.003 0.035 0.047 0.0070 Prep .4 .05 .4 .1 .05 0
S 8 Adv .1 .5 .1 .1 .1 .1
NN NN NN NN
TO 0.83 0 0.0004 0
7 w1- w2=docto w3=i w4=in STOP
0 0.46 * 10-11 NN 0.004 0.016 0.087 0.0045 =the r s V(Noun,the) = P(Noun|<S>)P(the|Noun) = .3 X 0=0
0 0 0 Noun 0 V(Verb,the) = P(Verb|<S>)P(the|Verb) = .1 X 0=0
V(Det,the) = P(Det|<S>)P(the|Det) = .3 X .7=.21
PP PP PP PP PPSB 0.23I 0.0007
want 0.0012
to 0.00014
race Verb 0
V(Prep,the) = P(Prep|<S>)P(the|Prep) = .2 X .0=0
SS SS SS SS S 9 Det .21
VB 0 0.0093 0 0.00012 V(Adv,the) = P(Adv|<S>)P(the|Adv) = .2 X .0=0
0.0248 TO 0 0 0.99 0 Prep 0
0
NN 0 0.00005 0 0.00057 Adv 0
4
PPS 0.37 0 0 0
S
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
POS Tagging – Example -3 – Viterbi Algorithm – Practice Example -4 – Naïve Search

V(Noun, doctor) = maxt’ V(t’,the)XP(Noun|t′)X P(doctor|Noun)


= max {0, 0, .21 (.9 X.4),0, 0} = .0756
V(Verb, doctor) = maxt’ V(t’,the)XP(Verb|t′)X P(doctor|Verb)
= max {0, 0, .21(.01X.1), 0, 0} = .00021
w1- w2=docto w3=i w4=in STOP Hot Cold
=the r s <S 0.8 0.2
Noun 0 .0756 >
Verb 0 .00021 HHH P(H)*P(1|H)*P(H|H)*P(3|H)*P(H|H)*P(1|H)
A Hot Col
Det .21 0 HHC d
Prep 0 0 w1- w2=docto w3=is w4=in STOP HCC Hot 0.7 0.3
=the r
Adv 0 0 Col 0.4 0.6
Noun 0 .0756 .00151 0 CCC
d
2 CHC P(C)*P(1|C)*P(H|C)*P(3|H)*P(C|H)*P(1|C) 0.0024
Verb 0 .00021 .02721 0 =0.2*0.5*0.4*0.4*0.3*0.5 B 1 2 3
6 .000027 Hot .2 .4 .4
2 CCH
Det .21 0 0 0 Col .5 .4 .1
CHH d
Prep 0 0 0 .00544
Det Noun Verb Prep
3 HCH
Adv 0 0 0 BITS Pilani, Pilani Campus
.00027 BITS Pilani, Pilani Campus
2

Hidden Markov Model Hidden Markov Model


Example -5 – Observation Likelihood by Forward Propagation Example -5 – Observation Likelihood by Forward Propagation - Recursion

P(C)*P(3|C) = 0.2*0.1 = 0.02 P(C)*P(C|C)*P(1|C) = 0.02*0.5*0.5 = 0.005


C P(H)*P(C|H)*P(1|C) = 0.32*0.4*0.5 = 0.064
C C
Efficiently computes the probability of an 0.02
observed sequence given a model
P(sequence | model)
*
Identical to Viterbi; replace the MAX with a
SUM *
Hot Cold
A Hot Col
Hot Cold d <S 0.8 0.2
<S 0.8 0.2 Hot 0.6 0.4 >
A Hot Col
> Col 0.5 0.5 d
H d
Hot 0.6 0.4
0.32 H H
Col 0.5 0.5
P(H)*P(3|H) = 0.8*0.4 = 0.32 d
B 1 2 3
P(C)*P(H|C)*P(1|H) = 0.02*0.5*0.2 = 0.002 B 1 2 3
Hot .2 .4 .4 P(H)*P(H|H)*P(1|H) = 0.32*0.6*0.2 = 0.0384 Hot .2 .4 .4
Col .5 .4 .1 Col .5 .4 .1
d d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
Example -5 – Observation Likelihood by Forward Propagation Observation Likelihood – Forward Propagation

P(C)*P(C|C)*P(R|C) = 0.069*0.5*0.1 = 0.0035


P(H)*P(C|H)*P(R|C) = 0.0404*0.4*0.1 = 0.0016
C C C C
0.02 0.069 0.0051

* Observation Likelihood
= 0.0168
Hot Cold
<S 0.8 0.2
>
A Hot Col
d
H H H C Hot 0.6 0.4
0.0404 0.0117 Col 0.5 0.5
0.32
d
P(C)*P(H|C)*P(R|H) = 0.069 *0.5*0.2 = 0.0069 B 1 2 3
P(H)*P(H|H)*P(R|H) = 0.0404 *0.6*0.2 = 0.0048 Hot .2 .4 .4
Col .5 .4 .1 Source Credit : Speech and Language Processing - Jurafsky and Martin
d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Hidden Markov Model


Observation Likelihood – Forward Propagation Forward-Backward
Baum-Welch = Forward-Backward Algorithm (Baum
1972)
Is a special case of the EM or Expectation-Maximization
algorithm
The algorithm will let us learn the transition probabilities
A= {aij} and the emission probabilities B={bi(ot)} of the
HMM

Source Credit : Speech and Language Processing - Jurafsky and Martin


39
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Bidirectionality Bidirectionality

• One problem with the HMM models as • Any sequence model can be turned into a bidirectional
model by using multiple passes.
presented is that they are exclusively run • For example, the first pass would use only part-of-speech
left-to-right. features from already-disambiguated words on the left. In
• Viterbi algorithm still allows present the second pass, tags for all words, including those on the
decisions to be influenced indirectly by right, can be used.
• Alternately, the tagger can be run twice, once left-to-right
future decisions, and once right-to-left.
• It would help even more if a decision about • In Viterbi decoding, the classifier chooses the higher
scoring of the two sequences (left-to-right or right-to-
word wi could directly use information left).
about future tags ti+1 and ti+2. • Modern taggers are generally run bidirectionally.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Maximum Entropy Markov


Some limitation of HMMS
Model
• Unknown words • Turn logistic regression into a discriminative sequence
• First order HMM model simply by running it on successive words, using
Eg: the class assigned to the prior word as a feature in the
classification of the next word.
Is clearly marked
• When we apply logistic regression in this way, it’s called
He clearly marked maximum entropy Markov model or MEMM

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Maximum Entropy Markov Maximum Entropy Markov
Model Model
• Let the sequence of words be W = wn 1 and the • In an MEMM, by contrast, we compute the posterior
sequence of tags T = tn1. P(T|W) directly, training it to discriminate among the
• In an HMM to compute the best tag sequence that possible tag sequences:
maximizes P(T|W) we rely on Bayes’ rule and the
likelihood P(W|T):

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Maximum Entropy Markov


Learning MEMM
Model
• Consider tagging just one word. A multinomial logistic regression classifier • Learning in MEMMs relies on the same supervised
could compute the single probability P(ti|wi,ti−1) in a different way than an
HMM learning algorithms we presented for logistic regression.
• HMMs compute likelihood (observation word conditioned on tags) but • Given a sequence of observations, feature functions,
MEMMs compute posterior (tags conditioned on observation words). and corresponding hidden states, we use gradient
descent to train the weights to maximize the log-
likelihood of the training corpus.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Maximum Entropy Markov
MEMM
Model
• Reason to use a discriminative sequence model is that • Janet/NNP will/MD back/VB the/DT bill/NN, when wi is
it’s easier to incorporate a lot of features the word back, would generate the following features

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

How to decode to find this


Training MEMMs
optimal tag sequence ˆ T?
• The most likely sequence of tags is then computed by combining these • Simplest way to turn logistic regression into a sequence
features of the input word wi, its neighbors within l words wi+l i−l, and the
previous k tags ti−1 i−k as follows (using θ to refer to feature weights instead model is to build a local classifier that classifies each
of w to avoid the confusion with w meaning words): word left to right, making a hard classification on the first
word in the sentence, then a hard decision on the
second word, and so on.
• This is called a greedy decoding algorithm

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Issue with greedy algorithm MEMM with Viterbi algorithm

• The problem with the greedy algorithm is that by making • Finding the sequence of part-of-speech tags that is
a hard decision on each word before moving on to the optimal for the whole sentence. Viterbi value of time t for
next word, the classifier can’t use evidence from future state j
decisions.
• Although the greedy algorithm is very fast, and
occasionally has sufficient accuracy to be useful, in • In HMM
general the hard decision causes too great a drop in
performance, and we don’t use it.
• MEMM with the Viterbi algorithm just as with the HMM,
Viterbi finding the sequence of part-of-speech tags that
is optimal for the whole sentence

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Q1.
Introduction- 3 marks
Open Source POS Tagger N-gram LM- 4 marks

• Stanford PoS Tagger python bind- Java based, but can be used in python but Q2 Neural LM- 3 marks
difficult to install Word embeddings- - 4/5 marks
• Flair - POS tagger available for python.
Q3Vector semantics- 6 marks
• NLTK implementation to be very precise, around 97% and its quite fast.
• SPACY Q4. POS tagging- 5 marks

Q5 HMM/Viterbi - 4/5 marks


References
• https://www.nltk.org/
• https://likegeeks.com/nlp-tutorial-using-python-nltk/
• https://www.guru99.com/pos-tagging-chunking-nltk.html
• https://medium.com/greyatom/learning-pos-tagging-
chunking-in-nlp-85f7f811a8cb
• https://nlp.stanford.edu/software/tagger.shtml
• https://www.forbes.com/sites/mariyayao/2020/01/22/what-
are--important-ai--machine-learning-trends-for-
2020/#601ce9623239
• https://medium.com/fintechexplained/nlp-text-processing-in-
data-science-projects-f083009d78fc

BITS Pilani, Pilani Campus

You might also like