CSE440: NATURAL LANGUAGE
PROCESSING II
Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 5: Sequence Learning
Outline
- Sequence tagging (SLP 8)
- Markov models (SLP Appendix A)
- Recurrent neural networks (SLP 9)
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
- Speech recognition
- Group acoustic signal into phonemes
- Group phonemes into words
- Natural language processing
- Part of speech tagging
- our running example
- Named entity recognition
- Information extraction
- Question answering
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.
Because part-of-speech changes with the surrounding sequence:
- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.
How big is this ambiguity issue?
Part-of-speech ambiguity
Most words in the English vocabulary are unambiguous.
Part-of-speech ambiguity
But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?
10 cut NN
25 cut VB
13 cut VBD
7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Maximum entropy Markov models
- Parameters estimated by logistic regression
- Recurrent neural networks
Hidden Markov Models
- Maximum entropy Markov models (MEMM)
- (Visible) Markov models for PoS tagging
- Training by counting
- Smoothing probabilities
- Handling unknown words
- Viterbi algorithm
Why POS Tagging Must Model Sequences
Our running example:
Secretariat is expected to race tomorrow.
Secretariat is ________
Race is ________
To understand context, we will predict all tags together.
Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules
You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?
id=218367
Approach 1: Maximum entropy Markov models
- Maximum entropy = logistic regression
- Markov models
- Discovered by Andrey Markov
- Limited horizon
- How would you implement sequence models in the logistic regression
algorithm that we know?
- Let’s assume we scan the text left to right.
Approach 1 continued
- Add the previously seen tags as features!
- Use gold tags in training
- Use predicted tags in testing
- Other common features
- Words, lemmas in a window [-k, +k]
- Casing info, prefixes, suffixes of these words
- Bigrams containing the current word
See also:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/pr
ocessors/clu/sequences/PartOfSpeechTagger.scala
Approach 1: bidirectional MEMMs
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Approach 2: Hidden (visible) Markov Models
- Let’s put the probability theory we covered in the previous lecture to use!
- The resulting approach is called (visible) Markov model
- “Visible” to distinguish it from the hidden Markov models, where the tags are
unknown
- Imagine implementing a POS tagger for an unstudied language without POS annotations
Approach 2: Hidden (visible) Markov Models
• Sentence 1 contains n words
• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this
We have four probabilities: likelihood, prior, posterior and marginal likelihood.
- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags
- A tag is dependent only on the previous tag
Replace in the original equation
Word Tag transition
likelihoods probabilities
Computing Tag Transition Probabilities
In the Brown corpus (1M words)
- DT occurs 116,454 times
- DT is followed by NN 56,509 times
Computing Word Likelihoods
In the Brown corpus (1M words)
- VBZ occurs 21,627 times
- VBZ is the tag for “is” 10,073 times
Example
Let’s see why VB is preferred in the first case
Example
Example
The first tag transition
- P(NN|TO) = 0.00047
- P(VB|TO) = .83
The word likelihood for “race”
- P(race|NN) = 0.00057
- P(race|VB) = 0.00012
The second tag transition
- P(NR|VB) = 0.0027
- P(NR|NN) = 0.0012
Example
P(VB|TO)P(NR|VB)P(race|VB) = 0.00000027
P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032
VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities
B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm
Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space
This is
called a
One row for
trellis
each state
(tag)
One column for each observation (word)
Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O
Output
- Most probable state sequence Q together with its probability
Both A and B are matrices with probabilities
Example of A and B matrices
A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070
B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054
Example Trace
Summary of Viterbi Algorithm
• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Extending the HMM Algorithm to Trigrams
This is pretty limiting for POS tagging
Let’s extend it to trigrams of tags!
This is better
• tn+1 – end of sentence tag
• We also need virtual tags, t0 and t-1, to be set to the beginning of sentence value.
TnT
- This is what the TnT (Trigrams’n’Tags) tagger does
- Probably the fastest POS tagger in the world
- Not the best, but pretty close (96% acc)
- http://www.coli.uni-saarland.de/~thorsten/tnt/
Problems with TnT
Very
sparse!
Backoff model: linear interpolation
P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )
λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.
Other Types of Smoothing
• Add one:
–
– Where K is the number of words with POS tag t
• Variant of add one (Charniak’s):
–
– Not a proper probability distribution!
Another Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:
- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.
It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation, but they take longer to train
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example
Evaluation
- Accuracy does not work. Why?
- We need precision, recall, F1:
- P = TP/(TP + FP)
- R = TP/(TP + FN)
- F1 = 2PR/(P + R)
- Micro vs. macro F1 measures