0% found this document useful (0 votes)
10 views20 pages

1.pos Tagging 1

The document discusses Part-Of-Speech (POS) tagging in Natural Language Processing (NLP), detailing its importance for syntactic analysis and word sense disambiguation. It covers various POS tagsets, ambiguity in tagging, and different approaches to POS tagging, including rule-based and learning-based methods. Additionally, it introduces sequence labeling problems and probabilistic sequence models like Hidden Markov Models (HMM) for handling interdependent classifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

1.pos Tagging 1

The document discusses Part-Of-Speech (POS) tagging in Natural Language Processing (NLP), detailing its importance for syntactic analysis and word sense disambiguation. It covers various POS tagsets, ambiguity in tagging, and different approaches to POS tagging, including rule-based and learning-based methods. Additionally, it introduces sequence labeling problems and probabilistic sequence models like Hidden Markov Models (HMM) for handling interdependent classifications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Natural Language Processing:

Part-Of-Speech Tagging,
Sequence Labeling, and
Hidden Markov Models (HMMs)

•NLP - POS and HMM 1


Part Of Speech Tagging

• Annotate each word in a sentence with a


part-of-speech marker.
• Lowest level of syntactic analysis.
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN

• Useful for subsequent syntactic parsing and


word sense disambiguation.

•NLP - POS and HMM 2


English POS Tagsets

• Original Brown corpus used a large set of


87 POS tags.
• Most common in NLP today is the Penn
Treebank set of 45 tags.
– Tagset used in these slides.
– Reduced from the Brown set for use in the
context of a parsed corpus (i.e. treebank).
• The C5 tagset used for the British National
Corpus (BNC) has 61 tags.
•NLP - POS and HMM 3
English Parts of Speech
• Noun (person, place or thing)
– Singular (NN): dog, fork
– Plural (NNS): dogs, forks
– Proper (NNP, NNPS): John, Springfields
– Personal pronoun (PRP): I, you, he, she, it
– Wh-pronoun (WP): who, what
• Verb (actions and processes)
– Base, infinitive (VB): eat
– Past tense (VBD): ate
– Gerund (VBG): eating
– Past participle (VBN): eaten
– Non 3rd person singular present tense (VBP): eat
– 3rd person singular present tense: (VBZ): eats
– Modal (MD): should, can
– To (TO): to (to eat)
•NLP - POS and HMM 4
English Parts of Speech (cont.)
• Adjective (modify nouns)
– Basic (JJ): red, tall
– Comparative (JJR): redder, taller
– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)
– Basic (RB): quickly
– Comparative (RBR): quicker
– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with
• Determiner:
– Basic (DT) a, an, the
– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,
• Particle (RP): off (took off), up (put up)

•NLP - POS and HMM 5


Closed vs. Open Class

• Closed class categories are composed of a


small, fixed set of grammatical function
words for a given language.
– Pronouns, Prepositions, Modals, Determiners,
Particles, Conjunctions
• Open class categories have large number of
words and new ones are easily invented.
– Nouns (Googler, textlish), Verbs (Google),
Adjectives (geeky), Abverb (automagically)

•NLP - POS and HMM 6


Ambiguity in POS Tagging

• “Like” can be a verb or a preposition


– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or
adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.

•NLP - POS and HMM 7


POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger,
which was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific
word type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.

•NLP - POS and HMM 8


POS Tagging Approaches
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM),
Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)
– Neural networks: Recurrent networks like Long Short
Term Memory (LSTMs)
• Generally, learning-based approaches have been
found to be more effective overall, taking into
account the total amount of human expertise and
effort involved. •NLP - POS and HMM 9
Classification Learning
• Typical machine learning addresses the problem
of classifying a feature-vector description into a
fixed number of classes.
• There are many standard learning methods for this
task:
– Decision Trees and Rule Learning
– Naïve Bayes and Bayesian Networks
– Logistic Regression / Maximum Entropy (MaxEnt)
– Perceptron and Neural Networks
– Support Vector Machines (SVMs)
– Nearest-Neighbor / Instance-Based

•NLP - POS and HMM 10


Beyond Classification Learning
• Standard classification problem assumes
individual cases are disconnected and independent
(i.i.d.: independently and identically distributed).
• Many NLP problems do not satisfy this
assumption and involve making many connected
decisions, each resolving a different ambiguity,
but which are mutually dependent.
• More sophisticated learning and inference
techniques are needed to handle such situations in
general.

•NLP - POS and HMM 11


Sequence Labeling Problem
• Many NLP problems can viewed as sequence
labeling.
• Each token in a sequence is assigned a label.
• Labels of tokens are dependent on the labels of
other tokens in the sequence, particularly their
neighbors (not i.i.d).

foo bar blam zonk zonk bar blam

•NLP - POS and HMM 12


Information Extraction
• Identify phrases in language that refer to specific types of
entities and relations in text.
• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives
in Austin Texas.
• Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.

•NLP - POS and HMM 13


Semantic Role Labeling
• For each clause, determine the semantic role
played by each noun phrase that is an
argument to the verb.
agent patient source destination instrument
– John drove Mary from Austin to Dallas in his
Toyota Prius.
– The hammer broke the window.
• Also referred to a “case role analysis,”
“thematic analysis,” and “shallow semantic
parsing”
•NLP - POS and HMM 14
Bioinformatics

• Sequence labeling also valuable in labeling


genetic sequences in genome analysis.
extron intron
– AGCTAACGTTCGATACGGATTACAGCCT

•NLP - POS and HMM 15


Problems with Sequence Labeling as
Classification
• Not easy to integrate information from
category of tokens on both sides.
• Difficult to propagate uncertainty between
decisions and “collectively” determine the
most likely joint assignment of categories to
all of the tokens in a sequence.

•NLP - POS and HMM 16


Probabilistic Sequence Models

• Probabilistic sequence models allow


integrating uncertainty over multiple,
interdependent classifications and
collectively determine the most likely
global assignment.
• Two standard models
– Hidden Markov Model (HMM)
– Conditional Random Field (CRF)

•NLP - POS and HMM 17


Markov Model / Markov Chain

• A finite state machine with probabilistic


state transitions.
• Makes Markov assumption that next state
only depends on the current state and
independent of previous history.

•NLP - POS and HMM 18


Sample Markov Model for POS

0.1

Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
•NLP - POS and HMM 19
Refer POS and Basic HMM and proceed
this Example

•NLP - POS and HMM 20

You might also like