Knowledge Based NLP and Statistical NLP
Each has its place
Knowledge Based NLP
Linguist
rules
Computer
rules/probabilities
corpus
Statistical NLP
Science without religion is blind;
Region without science is lame:
Einstein
NLP=Computation+Linguistics
NLP without Linguistics is blind
And
NLP without Computation is lame
Key difference between Statistical/ML-
based NLP and Knowledge-based
/linguistics-based NLP
Stat NLP: speed and robustness are the
main concerns
KB NLP: Phenomena based
Example:
Boys, Toys, Toes
To get the root remove “s”
How about foxes, boxes, ladies
Understand phenomena: go deeper
Slower processing
Noisy Channel Model
w NoisytChannel
(wn, wn-1, … , w1) (tm, tm-1, … , t1)
Sequence w is transformed into
sequence t.
Bayesian Decision Theory and Noisy
Channel Model are close to each other
Bayes Theorem : Given the random variables A
and B,
P ( A) P ( B | A)
P( A | B)
P( B)
P( A | B) Posterior probability
P ( A) Prior probability
P ( B | A) Likelihood
Discriminative vs.
Generative Model
W* = argmax (P(W|SS))
W
Generative
Discriminative
Model
Model
Compute directly from Compute from
P(W|SS) P(W).P(SS|W)
Corpus
A collection of text called corpus, is used for
collecting various language data
With annotation: more information, but manual labor
intensive
Practice: label automatically; correct manually
The famous Brown Corpus contains 1 million tagged
words.
Switchboard: very famous corpora 2400
conversations, 543 speakers, many US dialects,
annotated with orthography and phonetics
Example-1 of Application of Noisy Channel Model:
Probabilistic Speech Recognition (Isolated Word)
[8]
Problem Definition : Given a sequence of speech
signals, identify the words.
2 steps :
Segmentation (Word Boundary Detection)
Identify the word
Isolated Word Recognition :
Identify W given SS (speech signal)
^
W arg max P(W | SS )
W
Identifying the word
^
W arg max P (W | SS )
W
arg max P (W ) P ( SS | W )
W
P(SS|W) = likelihood called “phonological
model “ intuitively more tractable!
P(W) = prior probability called “language
model”
# W appears in the corpus
P (W )
# words in the corpus
Pronunciation Dictionary
Pronunciation Automaton
Word s4
0.73 ae 1.0
1.0 1.0 1.0 1.0
Tomato t o m t o end
0.27 1.0
s1 s2 s3 aa s6 s7
s5
P(SS|W) is maintained in this way.
P(t o m ae t o |Word is “tomato”) = Product of arc
probabilities
Example Problem-2
Analyse sentiment of the text
Positive or Negative Polarity
Challenges:
Unclean corpora
Thwarted Expression: The movie has
everything: cast, drama, scene,
photography, story; the director has
managed to make a mess of all this
Sarcasm: The movie has everything:
cast, drama, scene, photography,
story; see at your own risk.
Sentiment Classification
Positive, negative, neutral – 3 class
Create a representation for the
document
Classify the representation
The most popular way of representing a
document is feature vector (indicator
sequence).
Established Techniques
Naïve Bayes Classifier (NBC)
Support Vector Machines (SVM)
Neural Networks
K nearest neighbor classifier
Latent Semantic Indexing
Decision Tree ID3
Concept based indexing
Successful Approaches
The following are successful
approaches as reported in
literature.
NBC – simple to understand and
implement
SVM – complex, requires
foundations of perceptions
Mathematical Setting
Indicator/feature
vectors to be formed
We have training set
A: Positive Sentiment Docs
B: Negative Sentiment Docs
Let the class of positive and negative
documents be C+ and C- , respectively.
Given a new document D label it
positive if P(C |D) > P(C |D)
+ -
Priori Probability
Docum Vector Classif
ent cation Let T = Total no of documents
And let |+| = M
So,|-| = T-M
D1 V1 + P(D being positive)=M/T
D2 V2 - Priori probability is calculated
without considering any features
D3 V3 +
of the new document.
.. .. ..
D4000 V4000 -
Apply Bayes Theorem
Steps followed for the NBC algorithm:
Calculate Prior Probability of the classes. P(C ) and P(C )
+ -
Calculate feature probabilities of new document -
P(D| C+ ) and P(D| C-)
Probability of a document D belonging to a class C can
be calculated by Baye’s Theorem as follows:
P(C|D) = P(C) * P(D|C)
P(D)
• Document belongs to C+ , if
P(C+ ) * P(D|C+) > P(C- ) * P(D|C-)
Calculating P(D|C+)
P(D|C+) is the probability of class C+ given D. This is calculated
as follows:
Identify a set of features/indicators to evaluate a document
and generate a feature vector (VD). VD =<x1 , x2 , x3 … xn >
Hence, P(D|C+) = P(VD|C+)
= P( <x1 , x2 , x3 … xn > | C+)
= |<x1,x2,x3…..xn>, C+ |
| C+ |
Based on the assumption that all features are
Independently Identically Distributed (IID)
= P( <x1 , x2 , x3 … xn > | C+ )
= P(x1 |C+) * P(x2 |C+) * P(x3 |C+) *…. P(xn |C+)
=∏ i=1
n
P(xi |C+)
P(xi |C+) can now be calculated as |xi |/|C+ |
Baseline Accuracy
Just on Tokens as features, 80%
accuracy
20% probability of a document
being misclassified
On large sets this is significant
To improve accuracy…
Clean corpora
POS tag
Concentrate on critical POS tags
(e.g. adjective)
Remove ‘objective’ sentences ('of'
ones)
Do aggregation
Use minimal to sophisticated NLP
Allied Disciplines
Philosophy Semantics, Meaning of “meaning”, Logic
(syllogism)
Linguistics Study of Syntax, Lexicon, Lexical Semantics etc.
Probability and Statistics Corpus Linguistics, Testing of Hypotheses,
System Evaluation
Cognitive Science Computational Models of Language Processing,
Language Acquisition
Psychology Behavioristic insights into Language Processing,
Psychological Models
Brain Science Language Processing Areas in Brain
Physics Information Theory, Entropy, Random Fields
Computer Sc. & Engg. Systems for NLP