NLP Merged
NLP Merged
Pilani Campus
Agenda
Objective of course
• To learn the fundamental concepts and techniques of
• Course Objectives
natural language processing (NLP) including Language
• Course & Evaluation Plan Models, Word Embedding, Part of speech Tagging, Parsing
• Introduction to Natural Language Processing • To learn computational properties of natural languages and
• Application Areas the commonly used algorithms for processing linguistic
information
• Few Terminologies
• To introduce basic mathematical models and methods used
• Frequently applied Data Preparation Process for NLP in NLP applications to formulate computational solutions.
• To introduce research and development work in Natural
language Processing
4
BITS Pilani, Pilani Campus 18 January 2025 BITS Pilani, Pilani Campus
Course plan
Text books and Reference books
5
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
8
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
What is it.. Natural Language processing
12
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Example
13
14
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Contd.. Contd..
Hal: I can tell from the tone of your voice, Dave, that you’re upset.
Why don’t you take a stress pill and get some rest.
To attain the levels of performance we attribute to
[Dave has just drawn another sketch of Dr. Hunter]. HAL, we need to be able to define, model, acquire and
HAL: Can you hold it a bit closer? manipulate
[Dave does so].
HAL: That’s Dr. Hunter, isn’t it? • Knowledge of the world and of agents in it,
Dave: Yes. • Text meaning,
• Intention
15 16
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Main components of NLP Natural language understanding
17 18
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
19 20
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Morphology analysis Lexical analysis
21 22
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
➢Syntax analysis checks the text for ➢It draws the exact meaning or the dictionary
meaningfulness comparing to the rules of meaning from the text. The text is checked for
formal meaningfulness.
➢Eg: ➢Eg:Colourless blue idea
“the girl go to the school “
Agra goes to the Poonam
23 24
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Discourse integeration Pragmatic analysis
➢The meaning of any sentence depends upon ➢ how sentences are used in different situations
the meaning of the sentence just before it how use affects the interpretation of sentence
➢Eg: ➢Eg:
she wanted it ➢Close the window
➢She cuts banana with a pen
25 26
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
-
Examples:
Syntax, Semantics, and Pragmatics ➢producing text from computer data
➢Steps
1. Green frogs have large noses.
Pragmatically wrong ▪ discourse planning
2. Green ideas have large noses.
Semantics, and Pragmatics
▪ Surface realizer
3. Large have green ideas nose.
▪ Lexical selection
All three wrong
29 30
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Contd..
31 32
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Some of the tasks in NLP Contd..
➢Stop Words
• contribute little to overall meaning
• Eg:He is a good boy.
NP – Noun Phrases
VP – Verb Phrases
PP – Prepositional Phrases
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Evaluating Language Understanding Systems Contingency Table
Anaconda and Jupiter are best and popular data science tools
In Jupiter, the console commands can be executed by the ‘!’ sign before the
command within the cell.
43
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
NLP Tools NLP Tools
Some commercial tools Open Source Tools
IBM Watson | A pioneer AI platform for businesses Stanford Core NLP is a popular Java library built and maintained by Stanford University.
SpaCy - One of the newest open-source Natural Language Processing with Python libraries
Google Cloud NLP API | Google technology applied to NLP
Amazon Comprehend | An AWS service to get insights from text
Gensim is a highly specialized Python library that largely deals with topic modeling tasks using
algorithms like Latent Dirichlet Allocation (LDA)
AllenNLP: Powerful tool for prototyping with good text processing capabilities. Automates some of the
tasks which are essential for almost every deep learning model. It provides a lot of modules like
Seq2VecEncoder, Seq2SeqEncoder.
Berkeley Neural Parser (Python). It is a high-accuracy parser with models for 11 languages. It cracks
the syntactic structure of sentences into nested sub phrases. This tool enables the easy extraction of
information from syntactic constructs
•
• Q&A
https://emerj.com/partner-content/nlp-current-applications-and-future-possibilities/
• https://venturebeat.com/2019/04/05/why-nlp-will-be-big-in-2019/
• https://www.nltk.org/book/
• Suggestions / Feedback
• https://www.coursera.org/learn/python-text-mining/home/week/1
• https://openai.com/api/
• https://analyticssteps.com/blogs/top-nlp-tools
• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
• https://www.cstr.ed.ac.uk/emasters/course/natural_lang.html
• https://web.stanford.edu/class/cs224u/2016/materials/cs224u-2016-intro.pdf
• https://www.mygreatlearning.com/blog/trending-natural-language-processing-
applications/
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
1.Machine Translation:
A model that computes either of these:
Machine translation system is used to translate one
Probability of a sentence ( P(W) ) or language to another language .For example Chinese to
Probability of an upcoming word (P(wn|w1,w2…wn- English or German to English etc.
1)) is called a language model.
I eat lunch
Simply we can say that I am eating
我在吃” Me am eating
A language model learns to predict the probability of a
Eating am I
sequence of words.
• For ex:
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is | its water)
× P(so|its water is) × P( transparent |its water is so)
• P(most biologists and specialist believe that in fact the
mythical unicorn horns derived from the narwhal)
• Simplifying assumption:
limit history of fixed number of wordsN1
Andrei Markov
or
P(the | its water is so transparent that) »P(the | transparent that)
Example:
P(I want to eat Chinese food)≈ P(I)P(want)
P(to)P(eat)P(Chinese)P(food)
An example Evaluation
– Time-consuming; can take days or weeks – How well can we predict the next word?
mushrooms 0.1
– Bad approximation pepperoni 0.1
I always order pizza with cheese and….
• unless the test data looks just like the training data The president of India is……
anchovies 0.01
Perplexity Example1
Chain rule:
For bigrams:
• Bigrams with zero probability • Pretend we saw each word one more time than we
– mean that we will assign 0 probability to the test did
set! • Just add one to all the counts!
• And hence we cannot compute perplexity
(can’t divide by 0)! • MLE estimate: c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
• Add-1 estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
52
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
• Efficiency ïï count(wi-k+1 ) i
if count(wi-k+1 )>0
S(wi | wi-k+1 ) = í count(wi-k+1 )
i-1 i-1
– Efficient data structures like tries
– Bloom filters: approximate language models ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise
– Store words as indexes, not strings
• Use Huffman coding to fit large numbers of words into two bytes
count(wi )
S(wi ) =
N
Exercise 1 Exercise2
• Q&A
• Suggestions / Feedback
Time – 1.40pm to 3.40pm M11 Encoder-Decoder Models, Attention and Contextual Embedding
These slides are prepared by the instructor, with grateful acknowledgement M12 Word sense disambiguation
of James Allen and many others who made their course materials freely M13 Semantic web ontology and Knowledge Graph
available online. M14 Introduction to NLP Applications
BITS Pilani, Pilani Campus
Session contents
• Lexical Semantics
• Vector Semantics
• Word embeddings Lexical Semantics
– Frequency based
– Prediction based
Question answering: Plagiarism detection • N-gram or text classification methods we've seen so far
Q: “How tall is Mt. Everest?” – Words are just strings (or indices wi in a vocabulary
Ans: “The official height of list)
Mount Everest is 29029 feet”
– That's not very satisfactory!
allow us to draw inferences to address meaning- Two lexemes are synonyms if they can be successfully
related tasks like question-answering or dialogue. substituted for each other in all situations
There are no examples of perfect synonymy Consider the words big and large
– Why should that be? Are they synonyms?
– Even if many aspects of meaning are identical – How big is that plane?
– Still may not preserve the acceptability based on notions of politeness, – Would I be flying on a large or small plane?
slang, register, genre, etc. How about here:
Example: – Miss Nelson, for instance, became a kind of big sister to Benjamin.
– Water and H20 – ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
Why?
– big has a sense that means being older, or grown up
– large lacks this sense
Senses that are opposites with respect to one feature of • Words with similar meanings.
their meaning
Otherwise, they are very similar! word1 word2 similarity
–
vanish disappear 9.8
dark / light
behave obey 7.3
– short / long belief impression 5.95
– hot / cold muscle bone 3.65
– up / down modest flexible 0.98
• A semantic frame is a set of words that denote • Words have affective meanings
perspectives or participants in a particular type of event. – Positive connotations (happy)
– E.g. verbs like buy (the event from the perspective of the buyer), sell (from the – Negative connotations (sad)
perspective of the seller), pay (focusing on the monetary aspect), or nouns like
buyer.
• Evaluation (sentiment!)
– Positive evaluation (great, love)
– Frames have semantic roles (like buyer, seller, goods, money), and words in a
sentence can take on these roles. – Negative evaluation (terrible, hate)
Connotation
• And you've also seen these: Vector semantics is the standard way to represent word
– …spinach sautéed with garlic over rice meaning in NLP, helping vector semantics us model
– Chard stems and leaves are delicious
many of the aspects of word meaning
– Collard greens and other salty leafy greens
• Conclusion:
– Ongchoi is a leafy green like spinach, chard, or collard greens
– We could conclude this based on words like "leaves" and "delicious" and
"sauteed"
• Vectors for representing words are called embeddings Consider sentiment analysis:
• Each word = a vector (not just "good" or "worse") – With words, a feature is a word identity
• Defining meaning as a point in space based on • Feature 5: 'The previous word was "terrible"'
distribution • requires exact same word to be in training and test
• Similar words are "nearby in semantic space"
– With embeddings:
• Feature is a word vector
• 'The previous word was vector [35,22,17…]
• Now in the test set we might see a similar vector
[34,21,14]
• We can generalize to similar but unseen words!!!
Each cell: count of word w in a document d: Two documents are similar if their vectors are similar
Each document is a count vector in ℕv: a column below
10 37
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Frequency in document
vs. frequency in Desired weight for rare
collection terms
▪In addition, to term frequency (the frequency of the term in ▪Rare terms are more informative than frequent terms.
the document) . . . ▪Consider a term in the query that is rare in the collection
▪. . .we also want to use the frequency of the term in the (e.g., ARACHNOCENTRIC).
collection for weighting and ranking. ▪A document containing this term is very likely to be
relevant.
▪→ We want high weights for rare terms like
ARACHNOCENTRIC.
17 17 17 18
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Desired weight for frequent Document
terms frequency
▪Frequent terms are less informative than rare terms. ▪We want high weights for rare terms like
▪Consider a term in the query that is frequent in the ARACHNOCENTRIC.
collection (e.g., GOOD, INCREASE, LINE). ▪We want low (positive) weights for frequent words like
▪A document containing this term is more likely to be GOOD, INCREASE and LINE.
relevant than a document that doesn’t . . . ▪We will use document frequency to factor this into
▪. . . but words like GOOD, INCREASE and LINE are not computing the matching score.
sure indicators of relevance. ▪The document frequency is the number of documents in
▪→ For frequent terms like GOOD, INCREASE and LINE, the collection that the term occurs in.
we want positive weights . . .
▪. . . but lower weights than for rare terms.
17 19 17 20
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
17 21 17 22
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Effect of idf on tf-idf
ranking weighting
▪The tf-idf weight of a term is the product of its tf weight and
▪idf affects the ranking of documents for queries with at its idf weight.
least two terms.
▪For example, in the query “arachnocentric line”, idf
weighting increases the relative weight of
ARACHNOCENTRIC and decreases the relative weight of
▪tf-weight
LINE.
▪idf-weight
▪idf has little effect on ranking for one-term queries.
▪Best known weighting scheme in information retrieval
▪Note: the “-” in tf-idf is a hyphen, not a minus sign!
▪Alternative names: tf.idf, tf x idf
17 23 17 25
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Thank You… ☺
• Q&A
• Suggestions / Feedback
Time – 1.40pm to 3.40pm M11 Encoder-Decoder Models, Attention and Contextual Embedding
These slides are prepared by the instructor, with grateful acknowledgement M12 Word sense disambiguation
of James Allen and many others who made their course materials freely M13 Semantic web ontology and Knowledge Graph
available online. M14 Introduction to NLP Applications
BITS Pilani, Pilani Campus
http://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations] Andrew Ng
• Sim(w,c) ≈ w ∙ c
• To turn this into a probability
This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
• We'll use the sigmoid from logistic regression:
• Given the set of positive and negative training instances, and an initial set of
embedding vectors
• The goal of learning is to adjust those word vectors such that we:
– Maximize the similarity of the target word, context word pairs (w , cpos) drawn
from the positive data
– Minimize the similarity of the (w , cneg) pairs drawn from the negative data.
1/18/2025 25
• Maximize the similarity of the target with the actual context • How to learn?
words, and minimize the similarity of the target with the k – Stochastic gradient descent!
negative sampled non-neighbor words.
• We’ll adjust the word weights to
– make the positive pairs more likely
– and the negative pairs less likely,
– over the entire training set.
we’
Update equation in SGD
The derivatives of the loss function
1.Corpus:
"cats like milk"
"dogs like bones”
2.V = ["cats", "like", "milk", "dogs", "bones“]
3: Initialize Embedding Matrices[assumed=2]
4Initialize two matrices 𝑊 and 𝐶 with random values:
𝑊=[cats like milk dogs bones] =
https://jalammar.github.io/illustrated-word2vec/
BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus
BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus
Contd..
BITS Pilani,
BITS Pilani, PilaniPilani Campus
Campus BITS Pilani, Pilani Campus
Training model Model architecture
Weighed matrix
Corpus
The product is really good
The product is wonderful
The product is awful
Step1:
combine the sentences
The product is really good The product is wonderful The product is
awful
Step2:
Select window size
Contd.. Contd..
Step3: Get one hot encoding for each word
Step4:Training data
Thank you
Natural Language
Processing
DSECL ZG5ZG530 Session 6: POS tagging
Prof.Vijayalakshmi
Date – 29-06-2024
BITS Pilani Time – 1.40 pm to 3.40pm
Pilani Campus BITS-Pilani
These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materials freely available online.
BITS Pilani, Pilani Campus
❖List of all possible tag for each word in ❖First step in many applications
sentence ❖POS tell us a lot about word
❖Eg: ❖Pronunciation depends on POS
❖To find named entities
❖Stemming
❖ Background
• From the early 90s
• Developed at the University of Pennsylvania
• (Marcus,Santorinin and Marcinkiewicz 1993)
❖ Size
• 40000 training sentences
• 2400 test sentences
• Genre
• Mostly wall street journal news stories and some spoken
conversations
❖ Importance
• Helped launch modern automatic parsing methods.
• Using Penn Treebank tags, tag the following • Using Penn Treebank tags, tag the following
sentence from the Brown Corpus: sentence from the Brown Corpus:
Contd.. Contd..
❖ Markov Property
Given that today is Sunny, what’s the probability Assume that yesterday’s weather was Rainy, and
that tomorrow is Sunny and the next day Rainy? today is Cloudy, what is the probability that
tomorrow will be Sunny?
=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) / So, we want to find the sequence of tags that maximizes
PROB(T1,…Tn) * PROB(w1,…wn | T1,…Tn)
PROB(w1,…wn)
=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) ❖ For Tags – use bigram probabilities
PROB(T1,…Tn) ≈ πi=1,n PROB(Ti | Ti-1)
PROB(ART N V N) ≈ PROB(ART | Φ) * PROB(N | ART) * PROB(V |
N) * PROB(N | V)
• P(like | V) .10 BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
•
Example1-Some Data on race Disambiguating to race tomorrow
1/18/2025 53 1/18/2025 54
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Transition probabilities
• Q&A
• Suggestions / Feedback
• Revision –Neural Network (Forward and backward pass) • What is neural network
Similar to the human brain that has neurons interconnected to one another,
• Application – Sentiment Analysis artificial neural networks also have neurons that are interconnected to one another
– Using logistic regression in various layers of the networks. These neurons are known as nodes.
– Using Neural Network
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get
off the couch and start dancing . It sucked me in , and it'll do the same to you .
Sentiment Features
it’
Neural Network-How it works
we’
Intuition: Training a 2-layer network
we’
Output value y
· a • For every training tuple (𝑥, 𝑦)
defined σ – Run forward computation to find our estimate 𝑦ො
Non-linear activation function
z – Run backward computation to update weights:
Weighted sum ∑ • For every output node
final we’
defined bias – Compute loss 𝐿 between true 𝑦 and the estimated 𝑦ො
Weights w1 w2 w3 b
Input layer x1 x2 x3 +1 – For every weight 𝑤 from hidden layer to the output layer
e’
» Update the weight
rectified it’ • For every hidden node
Sigmoid 10 – Assess how much blame it deserves for the current answer
1
y = s (z) =
1+ e− z
– For every weight 𝑤 from input layer to the hidden layer
» Update the weight
it’
BITS Pilani, Pilani Campus 11
BITS Pilani, Pilani Campus
Reminder: Loss Function for binary logistic regression Reminder: gradient descent for weight updates
•
• A measure for how far off the current answer is to the right answer • Use the derivative of the loss function with respect to weights
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦)
𝑑𝑤
• Cross entropy loss for logistic regression:
• To tell us how to adjust weights for each training item
– Move them in the opposite direction of the gradient
d
wt+ 1 = wt − h L( f (x; w), y)
dw
– For logistic regression
let’
12 don’
Backprop
• For training, we need the derivative of the loss with respect to each
weight in every layer of the network
• But the loss is computed only at the very end of the network!
• Solution: error backpropagation (Rumelhart, Hinton, Williams, 1986)
• Backprop is a special case of backward differentiation
• Which relies on computation graphs.
14
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Forward Propagation Contd..
• E=-log( P(w_t|w_c))
• wt = Target word
• wc = Context word
Contd.. Contd..
our current positive word is Stark, tj=1 for Stark and tj=0 for other negative words (pimples,
zebra, idiot).
sig-t (pred context embeddings initial sig-t(pred derivative of loss w.r.t input
Sigmoid() t error) ned -0.572 -0.588 -0.501 error) C*(sig-t) word embeddings
0.624 1 -0.376 stark 0.116 0.723 -0.689 -0.376 -0.04362 -0.271848 0.259064 -0.93154 0.318454 0.65541
0.553 0 0.553 pimples -0.94 0.601 0.146 0.553 -0.51982 0.332353 0.080738
0.534 0 0.534 zebra -0.622 0.811 0.64 0.534 -0.33215 0.433074 0.34176
0.467 0 0.467 idiot -0.077 -0.375 -0.056 0.467 -0.03596 -0.175125 -0.02615
Input word embedding update Derivative of loss w.r.t context word embeddings
• https://aegis4048.github.io/optimize_computational_efficiency_of_skip-
gram_with_negative_sampling#pred_error
• Ch-6 : Speech and Language Processing Daniel Jurafsky and James H. Martin
Thank you
t =1
• Each p(w i |w i−4 , w i − 3 , w i − 2 , w i−1 ) may not have enough statistics to estimate
• we back off to p(w i |w i−3 , w i − 2 , w i−1 ), p(w i |w i−2 , w i−1 ), etc., all the way to p(wi)
BITS Pilani, Pilani Campus 4
BITS Pilani, Pilani Campus
• Learn distributed representation of words • Instead of treating words as tokens, exploit semantic similarity
• What is distributed representation? – Learn a distributed representation of words that will allow sentences like these to be
➢ also known as embedding. seen as similar
➢Eg: The cat is walking in the bedroom.
A dog was walking in the room.
The cat is running in a room.
The dog was running in the bedroom.
etc.
– Use a neural net to represent the conditional probability function
𝑷(𝒘𝒕 |𝒘𝒕−𝒏 , 𝒘𝒕−𝒏+𝟏 , … , 𝒘𝒕−𝟏 )
– Learn the word representation and the probability function simultaneously
Equations:
= E,W,U, b.
• This gradient can be computed in any standard neural network framework which will then backpropagate
through = E, W, U, b.
BITS Pilani
Pilani Campus
Problem 1: Problem 2:
Observation Likelihood
• The probability of a observation sequence given • Most probable state sequence given a model
a model and state sequence and an observation sequence
• Evaluation problem • Decoding problem
▪ 1.Consider all possible 3-day weather sequences [H, H, H], [H, H, C], [H, C,
H], ▪ Find :
The start probabilities
2. For each 3-day weather sequence, consider the probability of the ice The transition probabilities
cream Consumption sequences [0,1,2] Emission probabilities
3. Pick out the sequence that has the highest probability from step #2. ▪ Forward –backward algorithm
▪ Not efficient
▪ Viterbi algorithm BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
*
*
Hot Cold Hot Cold
<S 0.8 0.2 <S 0.8 0.2
> >
A Hot Col A Hot Col
d d
H
Hot 0.6 0.4 Hot 0.6 0.4
Col 0.5 0.5 0.32 H H
Col 0.5 0.5
P(H)*P(3|H) = 0.8*0.4 = 0.32 d d
B 1 2 3
P(C)*P(H|C)*P(1|H) = 0.02*0.5*0.2 = 0.002 B 1 2 3
Hot .2 .4 .4 P(H)*P(H|H)*P(1|H) = 0.32*0.6*0.2 = 0.0384 Hot .2 .4 .4
Col .5 .4 .1 Col .5 .4 .1
d d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
Example -1 – Viterbi Algorithm – Termination through Back Trace Example -1 – Viterbi Algorithm
* Best Sequence:
Hot Cold
Hot Cold
<S 0.8 0.2
>
A Hot Col
d
H H H
Hot 0.6 0.4
0.32 0.0384 Col 0.5 0.5
d
P(C)*P(H|C)*P(3|H) = 0.064 *0.5*0.2 = 0.0064 B 1 2 3
P(H)*P(H|H)*P(3|H) = 0.0384 *0.6*0.2 = 0.0046 Hot .2 .4 .4
Col .5 .4 .1 Source Credit : Speech and Language Processing - Jurafsky and Martin
d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Source Credit : Speech and Language Processing - Jurafsky and Martin Source Credit : Speech and Language Processing - Jurafsky and Martin
VB VB VB VB NOUN Verb Det Prep ADV STO a cat doctor in is the very
P
0 0.17 * 10-4
0.17 * 10-8 Noun 0 .5 .4 0 0.1 0 0
0 VB TO NN PPSS <S> .3 .1 .3 .2 .1 0
Verb 0 0 .1 0 .9 0 0
Noun .2 .4 .01 .3 .04 .05
TO TO TO TO <S> 0.019 0.004 0.041 0.067 Det .3 0 0 0 0 .7 0
3 Verb .3 .05 .3 .2 .1 .05
Prep 0 0 0 1.0 0 0 0
0 0
A VB TO NN PPSS Det .9 .01 .01 .01 .07 0
Adv 0 0 0 .1 0 0 .9
0.16 * 10-7 0 VB 0.003 0.035 0.047 0.0070 Prep .4 .05 .4 .1 .05 0
S 8 Adv .1 .5 .1 .1 .1 .1
NN NN NN NN
TO 0.83 0 0.0004 0
7 w1- w2=docto w3=i w4=in STOP
0 0.46 * 10-11 NN 0.004 0.016 0.087 0.0045 =the r s V(Noun,the) = P(Noun|<S>)P(the|Noun) = .3 X 0=0
0 0 0 Noun 0 V(Verb,the) = P(Verb|<S>)P(the|Verb) = .1 X 0=0
V(Det,the) = P(Det|<S>)P(the|Det) = .3 X .7=.21
PP PP PP PP PPSB 0.23I 0.0007
want 0.0012
to 0.00014
race Verb 0
V(Prep,the) = P(Prep|<S>)P(the|Prep) = .2 X .0=0
SS SS SS SS S 9 Det .21
VB 0 0.0093 0 0.00012 V(Adv,the) = P(Adv|<S>)P(the|Adv) = .2 X .0=0
0.0248 TO 0 0 0.99 0 Prep 0
0
NN 0 0.00005 0 0.00057 Adv 0
4
PPS 0.37 0 0 0
S
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Hidden Markov Model Hidden Markov Model
POS Tagging – Example -3 – Viterbi Algorithm – Practice Example -4 – Naïve Search
* Observation Likelihood
= 0.0168
Hot Cold
<S 0.8 0.2
>
A Hot Col
d
H H H C Hot 0.6 0.4
0.0404 0.0117 Col 0.5 0.5
0.32
d
P(C)*P(H|C)*P(R|H) = 0.069 *0.5*0.2 = 0.0069 B 1 2 3
P(H)*P(H|H)*P(R|H) = 0.0404 *0.6*0.2 = 0.0048 Hot .2 .4 .4
Col .5 .4 .1 Source Credit : Speech and Language Processing - Jurafsky and Martin
d
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
• One problem with the HMM models as • Any sequence model can be turned into a bidirectional
model by using multiple passes.
presented is that they are exclusively run • For example, the first pass would use only part-of-speech
left-to-right. features from already-disambiguated words on the left. In
• Viterbi algorithm still allows present the second pass, tags for all words, including those on the
decisions to be influenced indirectly by right, can be used.
• Alternately, the tagger can be run twice, once left-to-right
future decisions, and once right-to-left.
• It would help even more if a decision about • In Viterbi decoding, the classifier chooses the higher
scoring of the two sequences (left-to-right or right-to-
word wi could directly use information left).
about future tags ti+1 and ti+2. • Modern taggers are generally run bidirectionally.
• The problem with the greedy algorithm is that by making • Finding the sequence of part-of-speech tags that is
a hard decision on each word before moving on to the optimal for the whole sentence. Viterbi value of time t for
next word, the classifier can’t use evidence from future state j
decisions.
• Although the greedy algorithm is very fast, and
occasionally has sufficient accuracy to be useful, in • In HMM
general the hard decision causes too great a drop in
performance, and we don’t use it.
• MEMM with the Viterbi algorithm just as with the HMM,
Viterbi finding the sequence of part-of-speech tags that
is optimal for the whole sentence
Q1.
Introduction- 3 marks
Open Source POS Tagger N-gram LM- 4 marks
• Stanford PoS Tagger python bind- Java based, but can be used in python but Q2 Neural LM- 3 marks
difficult to install Word embeddings- - 4/5 marks
• Flair - POS tagger available for python.
Q3Vector semantics- 6 marks
• NLTK implementation to be very precise, around 97% and its quite fast.
• SPACY Q4. POS tagging- 5 marks