0% found this document useful (0 votes)

30 views50 pages

5 Sequence Learning

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views50 pages

5 Sequence Learning

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

CSE440: NATURAL LANGUAGE

PROCESSING II
Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 5: Sequence Learning
Outline
- Sequence tagging (SLP 8)
- Markov models (SLP Appendix A)
- Recurrent neural networks (SLP 9)
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
- Speech recognition
- Group acoustic signal into phonemes
- Group phonemes into words
- Natural language processing
- Part of speech tagging
- our running example
- Named entity recognition
- Information extraction
- Question answering
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.

Because part-of-speech changes with the surrounding sequence:

- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.

How big is this ambiguity issue?

Part-of-speech ambiguity

Most words in the English vocabulary are unambiguous.

Part-of-speech ambiguity

But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?

10 cut NN

25 cut VB

13 cut VBD

7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Maximum entropy Markov models
- Parameters estimated by logistic regression
- Recurrent neural networks
Hidden Markov Models
- Maximum entropy Markov models (MEMM)
- (Visible) Markov models for PoS tagging
- Training by counting
- Smoothing probabilities
- Handling unknown words
- Viterbi algorithm
Why POS Tagging Must Model Sequences
Our running example:

Secretariat is expected to race tomorrow.

Secretariat is ________

Race is ________

To understand context, we will predict all tags together.

Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules

You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?

id=218367
Approach 1: Maximum entropy Markov models
- Maximum entropy = logistic regression
- Markov models
- Discovered by Andrey Markov
- Limited horizon

- How would you implement sequence models in the logistic regression

algorithm that we know?
- Let’s assume we scan the text left to right.
Approach 1 continued
- Add the previously seen tags as features!
- Use gold tags in training
- Use predicted tags in testing
- Other common features
- Words, lemmas in a window [-k, +k]
- Casing info, prefixes, suffixes of these words
- Bigrams containing the current word

See also:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/pr
ocessors/clu/sequences/PartOfSpeechTagger.scala
Approach 1: bidirectional MEMMs
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Approach 2: Hidden (visible) Markov Models
- Let’s put the probability theory we covered in the previous lecture to use!
- The resulting approach is called (visible) Markov model
- “Visible” to distinguish it from the hidden Markov models, where the tags are
unknown
- Imagine implementing a POS tagger for an unstudied language without POS annotations
Approach 2: Hidden (visible) Markov Models

• Sentence 1 contains n words

• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags

- A tag is dependent only on the previous tag

Replace in the original equation

Let’s see why VB is preferred in the first case

Example
Example
The first tag transition
- P(NN|TO) = 0.00047
- P(VB|TO) = .83

The word likelihood for “race”

- P(race|NN) = 0.00057
- P(race|VB) = 0.00012

The second tag transition

P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032

VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities

B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm

Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space

This is
called a
One row for
trellis
each state
(tag)

One column for each observation (word)

Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O

Output
- Most probable state sequence Q together with its probability

Both A and B are matrices with probabilities

Example of A and B matrices

A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

Example Trace
Summary of Viterbi Algorithm

• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Extending the HMM Algorithm to Trigrams

This is pretty limiting for POS tagging

Let’s extend it to trigrams of tags!

This is better

• tn+1 – end of sentence tag

• We also need virtual tags, t0 and t-1, to be set to the beginning of sentence value.
TnT
- This is what the TnT (Trigrams’n’Tags) tagger does
- Probably the fastest POS tagger in the world
- Not the best, but pretty close (96% acc)
- http://www.coli.uni-saarland.de/~thorsten/tnt/
Problems with TnT
Very
sparse!

Backoff model: linear interpolation

P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )

λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.

Other Types of Smoothing
• Add one:
–

– Where K is the number of words with POS tag t

• Variant of add one (Charniak’s):
–

– Not a proper probability distribution!

Another Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:

- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.

It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation, but they take longer to train
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example
Evaluation
- Accuracy does not work. Why?
- We need precision, recall, F1:
- P = TP/(TP + FP)
- R = TP/(TP + FN)
- F1 = 2PR/(P + R)
- Micro vs. macro F1 measures

Lecture Notes On Syntactic Processing
No ratings yet
Lecture Notes On Syntactic Processing
14 pages
POS HMM Viterbi Algo 2025
No ratings yet
POS HMM Viterbi Algo 2025
52 pages
Unit-3.Word Level Analysis AIML
No ratings yet
Unit-3.Word Level Analysis AIML
5 pages
Assignment 3
No ratings yet
Assignment 3
12 pages
L8-10 Intro POS HMM
No ratings yet
L8-10 Intro POS HMM
22 pages
Lecture Part of Speech Tagging
No ratings yet
Lecture Part of Speech Tagging
41 pages
19CSE453 - Natural Language Processing: Part of Speech Tagging
No ratings yet
19CSE453 - Natural Language Processing: Part of Speech Tagging
59 pages
Multi-Tagging For Transition-Based Dependency Parsing
No ratings yet
Multi-Tagging For Transition-Based Dependency Parsing
10 pages
PoS Tagging and HMM in NLP
No ratings yet
PoS Tagging and HMM in NLP
50 pages
L4 Tagging
No ratings yet
L4 Tagging
107 pages
Ai TXT Unit5
No ratings yet
Ai TXT Unit5
7 pages
Module 3 NLP
No ratings yet
Module 3 NLP
97 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
46 pages
Lec 10
No ratings yet
Lec 10
77 pages
Module-5 (Markov Model and Pos Tagging)
No ratings yet
Module-5 (Markov Model and Pos Tagging)
66 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
9.chapter7 POS Tagging
No ratings yet
9.chapter7 POS Tagging
37 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP Session 6
No ratings yet
NLP Session 6
5 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
13 pages
POS Tagging and HMM in NLP
No ratings yet
POS Tagging and HMM in NLP
93 pages
Lec PoS Tagging 2022
No ratings yet
Lec PoS Tagging 2022
67 pages
Corpus Analysis
No ratings yet
Corpus Analysis
8 pages
2021 25 Pos Tagging NLP
No ratings yet
2021 25 Pos Tagging NLP
8 pages
HMM Model
No ratings yet
HMM Model
11 pages
POS Tagging HMM Notes With Diagrams
No ratings yet
POS Tagging HMM Notes With Diagrams
4 pages
POS Tagging and NER Methods
No ratings yet
POS Tagging and NER Methods
51 pages
POS Tagging
No ratings yet
POS Tagging
5 pages
PoSTagging-HMM
No ratings yet
PoSTagging-HMM
24 pages
NLP 4
No ratings yet
NLP 4
83 pages
NLP Programming en 04 HMM
No ratings yet
NLP Programming en 04 HMM
24 pages
L6POS Tagging
No ratings yet
L6POS Tagging
19 pages
Week 9
No ratings yet
Week 9
36 pages
Cme4408 p6 Pos Tagging
No ratings yet
Cme4408 p6 Pos Tagging
33 pages
2.1 Rule Based POS Tagging
No ratings yet
2.1 Rule Based POS Tagging
5 pages
Natural Language Processing: Parts of Speech Tagging - Pos
No ratings yet
Natural Language Processing: Parts of Speech Tagging - Pos
20 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
3.1 Chap NLP Pos - Tagging - Lecture3
No ratings yet
3.1 Chap NLP Pos - Tagging - Lecture3
38 pages
POS Tagging for NLP Enthusiasts
No ratings yet
POS Tagging for NLP Enthusiasts
47 pages
2 cs626 Pos Tagging Week of 1aug22
No ratings yet
2 cs626 Pos Tagging Week of 1aug22
57 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
3 pages
Lecture 20-23 Part of Speech Tagging
No ratings yet
Lecture 20-23 Part of Speech Tagging
36 pages
Classical NLP Optimization Techniques
No ratings yet
Classical NLP Optimization Techniques
23 pages
Part of Speech Tagging and Hidden Markov Models
No ratings yet
Part of Speech Tagging and Hidden Markov Models
24 pages
Lecture 16-17-18-19
No ratings yet
Lecture 16-17-18-19
42 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
47 pages
NLPChapter 3
No ratings yet
NLPChapter 3
14 pages
NLP Algorithms for Students
No ratings yet
NLP Algorithms for Students
5 pages
NLP Assignment 5
No ratings yet
NLP Assignment 5
5 pages
NLP 2
No ratings yet
NLP 2
5 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
94 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
2025-NLP-Lecture 05 - Sequence Labeling For Parts of Speech and Name Entities
No ratings yet
2025-NLP-Lecture 05 - Sequence Labeling For Parts of Speech and Name Entities
69 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
11 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
7 pages
Parts of Speech
No ratings yet
Parts of Speech
26 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
NLP Lecture: Neural Nets & RNNs
No ratings yet
NLP Lecture: Neural Nets & RNNs
55 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
01 Introduction
No ratings yet
01 Introduction
13 pages
Prime Time 4 SB PDF
No ratings yet
Prime Time 4 SB PDF
205 pages
Math Addition Blended Lesson Plan
No ratings yet
Math Addition Blended Lesson Plan
3 pages
The Texting Game
No ratings yet
The Texting Game
7 pages
Re Rswuwfml2 - Short Text
No ratings yet
Re Rswuwfml2 - Short Text
5 pages
kenzUNV 104 RS T5 FirstDraftSelf EvaluationandReflection
No ratings yet
kenzUNV 104 RS T5 FirstDraftSelf EvaluationandReflection
9 pages
5 Mitsubishi-OUTLANDER-2006-BODY-REPAIR-MANUAL-SYNTHETIC-RESIN-PARTS
No ratings yet
5 Mitsubishi-OUTLANDER-2006-BODY-REPAIR-MANUAL-SYNTHETIC-RESIN-PARTS
4 pages
Stephen Krashen's
No ratings yet
Stephen Krashen's
22 pages
Parametros Frame Graber
No ratings yet
Parametros Frame Graber
17 pages
Important Life Events - Dr. Babasaheb Ambedkar
No ratings yet
Important Life Events - Dr. Babasaheb Ambedkar
5 pages
IRIS Radar Manual
No ratings yet
IRIS Radar Manual
120 pages
Artificial Intelligence Question Bank 2021
100% (1)
Artificial Intelligence Question Bank 2021
28 pages
A BIG List of Prefixes and Suffixes and Their Meanings
No ratings yet
A BIG List of Prefixes and Suffixes and Their Meanings
7 pages
Food Genes and Culture Eating Right For Your Origins 2nd Edition Gary Paul Nabhan Online PDF
100% (1)
Food Genes and Culture Eating Right For Your Origins 2nd Edition Gary Paul Nabhan Online PDF
101 pages
Tag Questions and Grammar Exercises
100% (1)
Tag Questions and Grammar Exercises
2 pages
DAY 1 NO 2 (LTE Protocol Stacks) v1.1
No ratings yet
DAY 1 NO 2 (LTE Protocol Stacks) v1.1
21 pages
Jumanji TG
No ratings yet
Jumanji TG
4 pages
Creativity Test
No ratings yet
Creativity Test
8 pages
Religion Thesis
100% (2)
Religion Thesis
4 pages
Salesforce Social Studio Instructions
No ratings yet
Salesforce Social Studio Instructions
14 pages
Full Blast Plus 5 Ukr Test-2
100% (1)
Full Blast Plus 5 Ukr Test-2
6 pages
Unit 6
No ratings yet
Unit 6
3 pages
Class 10 Maths Syllabus 2021-22
No ratings yet
Class 10 Maths Syllabus 2021-22
5 pages
HTF Fadi
No ratings yet
HTF Fadi
18 pages
Project File Guideline - 1519120828
No ratings yet
Project File Guideline - 1519120828
5 pages
Tamil History in 2025
No ratings yet
Tamil History in 2025
3 pages
UML - Deployment Diagrams - Tutorialspoint PDF
No ratings yet
UML - Deployment Diagrams - Tutorialspoint PDF
3 pages
Basavaraj Donur
No ratings yet
Basavaraj Donur
3 pages
6.00 Introduction To Computer Science and Programming: Mit Opencourseware
No ratings yet
6.00 Introduction To Computer Science and Programming: Mit Opencourseware
6 pages
Coursera Compdata 2017
No ratings yet
Coursera Compdata 2017
1 page
Leading To A Prosperous Life
No ratings yet
Leading To A Prosperous Life
4 pages

5 Sequence Learning

Uploaded by

5 Sequence Learning

Uploaded by

CSE440: NATURAL LANGUAGE

Because part-of-speech changes with the surrounding sequence:

How big is this ambiguity issue?

Most words in the English vocabulary are unambiguous.

Secretariat is expected to race tomorrow.

To understand context, we will predict all tags together.

You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?

- How would you implement sequence models in the logistic regression

• Sentence 1 contains n words

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- A tag is dependent only on the previous tag

Word Tag transition

Let’s see why VB is preferred in the first case

The word likelihood for “race”

The second tag transition

One column for each observation (word)

Both A and B are matrices with probabilities

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

This is pretty limiting for POS tagging

• tn+1 – end of sentence tag

Backoff model: linear interpolation

P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )

λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.

– Where K is the number of words with POS tag t

– Not a proper probability distribution!

You might also like