0% found this document useful (0 votes)
7 views21 pages

Lec 2

The document discusses the distinctions between Knowledge-Based NLP and Statistical NLP, highlighting their respective strengths and weaknesses. It explains concepts such as the Noisy Channel Model, Bayesian Decision Theory, and various techniques for sentiment analysis, including the Naïve Bayes Classifier. Additionally, it emphasizes the importance of corpus data and allied disciplines that contribute to the field of Natural Language Processing.

Uploaded by

abhijit.ka001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

Lec 2

The document discusses the distinctions between Knowledge-Based NLP and Statistical NLP, highlighting their respective strengths and weaknesses. It explains concepts such as the Noisy Channel Model, Bayesian Decision Theory, and various techniques for sentiment analysis, including the Naïve Bayes Classifier. Additionally, it emphasizes the importance of corpus data and allied disciplines that contribute to the field of Natural Language Processing.

Uploaded by

abhijit.ka001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Knowledge Based NLP and Statistical NLP

Each has its place

Knowledge Based NLP

Linguist

rules
Computer

rules/probabilities

corpus
Statistical NLP
Science without religion is blind;
Region without science is lame:
Einstein

NLP=Computation+Linguistics

NLP without Linguistics is blind


And
NLP without Computation is lame
Key difference between Statistical/ML-
based NLP and Knowledge-based
/linguistics-based NLP
 Stat NLP: speed and robustness are the
main concerns
 KB NLP: Phenomena based
 Example:
 Boys, Toys, Toes

 To get the root remove “s”

 How about foxes, boxes, ladies

 Understand phenomena: go deeper

 Slower processing
Noisy Channel Model

w NoisytChannel

(wn, wn-1, … , w1) (tm, tm-1, … , t1)

Sequence w is transformed into


sequence t.
Bayesian Decision Theory and Noisy
Channel Model are close to each other

 Bayes Theorem : Given the random variables A


and B,
P ( A) P ( B | A)
P( A | B) 
P( B)

P( A | B) Posterior probability

P ( A) Prior probability

P ( B | A) Likelihood
Discriminative vs.
Generative Model

W* = argmax (P(W|SS))
W

Generative
Discriminative
Model
Model

Compute directly from Compute from


P(W|SS) P(W).P(SS|W)
Corpus
 A collection of text called corpus, is used for
collecting various language data
 With annotation: more information, but manual labor
intensive
 Practice: label automatically; correct manually
 The famous Brown Corpus contains 1 million tagged
words.
 Switchboard: very famous corpora 2400
conversations, 543 speakers, many US dialects,
annotated with orthography and phonetics
Example-1 of Application of Noisy Channel Model:
Probabilistic Speech Recognition (Isolated Word)
[8]
 Problem Definition : Given a sequence of speech
signals, identify the words.
 2 steps :
 Segmentation (Word Boundary Detection)

 Identify the word

 Isolated Word Recognition :


 Identify W given SS (speech signal)

^
W  arg max P(W | SS )
W
Identifying the word
^
W  arg max P (W | SS )
W

 arg max P (W ) P ( SS | W )
W

 P(SS|W) = likelihood called “phonological


model “  intuitively more tractable!
 P(W) = prior probability called “language
model”
# W appears in the corpus
P (W ) 
# words in the corpus
Pronunciation Dictionary
Pronunciation Automaton

Word s4
0.73 ae 1.0
1.0 1.0 1.0 1.0
Tomato t o m t o end
0.27 1.0
s1 s2 s3 aa s6 s7
s5

 P(SS|W) is maintained in this way.


 P(t o m ae t o |Word is “tomato”) = Product of arc
probabilities
Example Problem-2
 Analyse sentiment of the text
 Positive or Negative Polarity
 Challenges:
 Unclean corpora

 Thwarted Expression: The movie has

everything: cast, drama, scene,


photography, story; the director has
managed to make a mess of all this
 Sarcasm: The movie has everything:

cast, drama, scene, photography,


story; see at your own risk.
Sentiment Classification

 Positive, negative, neutral – 3 class


 Create a representation for the

document
 Classify the representation

The most popular way of representing a


document is feature vector (indicator
sequence).
Established Techniques

 Naïve Bayes Classifier (NBC)


 Support Vector Machines (SVM)
 Neural Networks
 K nearest neighbor classifier
 Latent Semantic Indexing
 Decision Tree ID3
 Concept based indexing
Successful Approaches

The following are successful


approaches as reported in
literature.

 NBC – simple to understand and


implement
 SVM – complex, requires
foundations of perceptions
Mathematical Setting
Indicator/feature
vectors to be formed
We have training set
A: Positive Sentiment Docs
B: Negative Sentiment Docs

Let the class of positive and negative


documents be C+ and C- , respectively.
Given a new document D label it
positive if P(C |D) > P(C |D)
+ -
Priori Probability
Docum Vector Classif
ent cation Let T = Total no of documents
And let |+| = M
So,|-| = T-M

D1 V1 + P(D being positive)=M/T

D2 V2 - Priori probability is calculated


without considering any features
D3 V3 +
of the new document.
.. .. ..

D4000 V4000 -
Apply Bayes Theorem
Steps followed for the NBC algorithm:
 Calculate Prior Probability of the classes. P(C ) and P(C )
+ -
 Calculate feature probabilities of new document -
P(D| C+ ) and P(D| C-)
 Probability of a document D belonging to a class C can
be calculated by Baye’s Theorem as follows:
P(C|D) = P(C) * P(D|C)
P(D)

• Document belongs to C+ , if

P(C+ ) * P(D|C+) > P(C- ) * P(D|C-)


Calculating P(D|C+)
P(D|C+) is the probability of class C+ given D. This is calculated
as follows:
 Identify a set of features/indicators to evaluate a document
and generate a feature vector (VD). VD =<x1 , x2 , x3 … xn >
 Hence, P(D|C+) = P(VD|C+)
= P( <x1 , x2 , x3 … xn > | C+)
= |<x1,x2,x3…..xn>, C+ |
| C+ |
 Based on the assumption that all features are
Independently Identically Distributed (IID)
= P( <x1 , x2 , x3 … xn > | C+ )
= P(x1 |C+) * P(x2 |C+) * P(x3 |C+) *…. P(xn |C+)
=∏ i=1
n
P(xi |C+)
 P(xi |C+) can now be calculated as |xi |/|C+ |
Baseline Accuracy
 Just on Tokens as features, 80%
accuracy
 20% probability of a document
being misclassified
 On large sets this is significant
To improve accuracy…

 Clean corpora
 POS tag

 Concentrate on critical POS tags

(e.g. adjective)
 Remove ‘objective’ sentences ('of'

ones)
 Do aggregation

Use minimal to sophisticated NLP


Allied Disciplines
Philosophy Semantics, Meaning of “meaning”, Logic
(syllogism)
Linguistics Study of Syntax, Lexicon, Lexical Semantics etc.

Probability and Statistics Corpus Linguistics, Testing of Hypotheses,


System Evaluation
Cognitive Science Computational Models of Language Processing,
Language Acquisition
Psychology Behavioristic insights into Language Processing,
Psychological Models
Brain Science Language Processing Areas in Brain

Physics Information Theory, Entropy, Random Fields

Computer Sc. & Engg. Systems for NLP

You might also like