100% found this document useful (1 vote)

30 views23 pages

Bag - of - Words NLP

The document discusses Bag of Words models in Natural Language Processing (NLP), highlighting their applications in information retrieval, opinion mining, and association mining. It explains the vector space model, cosine similarity, and Naive Bayes classification, emphasizing the importance of smoothing techniques and the limitations of bag-of-words approaches. While some tasks can be effectively managed with these models, many require deeper linguistic analysis for better results.

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

30 views23 pages

Bag - of - Words NLP

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Bag of Word Models

Bonan Min
bonanmin@gmail.com
Some slides are based on class materials from Ralph Grishman, Thien Huu Nguyen
Bag of Words Models
When do we need elaborate linguistic analysis?
Look at NLP applications
◦ document retrieval (a.k.a., information retrieval)
◦ opinion mining
◦ association mining

See how far we can get with document-level bag-of-words models

◦ and introduce some of our mathematical approaches
Application 1: Information Retrieval
Task: given query = list of keywords, identify and rank relevant
documents from collection
Basic idea:
◦ Find documents whose set of words most closely matches words in query
Vector Space Model
Suppose the document collection has n distinct words, w1, …, wn
Each document is characterized by an n-dimensional vector whose ith
component is the frequency of word wi in the document
Example
◦ D1 = [The cat chased the mouse.]
◦ D2 = [The dog chased the cat.]
◦ W = [The, chased, dog, cat, mouse] (n = 5)
◦ V1 = [ 2 , 1 , 0 , 1 , 1 ]
◦ V2 = [ 2 , 1 , 1 , 1 , 0 ]
Weighting the Components
Unusual words like elephant determine the topic much more than
common words such as “the” or “have”
◦ can ignore words on a stop list or
◦ weight each term frequency tfi by its inverse document frequency idfi

wi tfi idfi

where N = size of collection and ni = number of documents containing

term i 
Cosine Similarity
Define a similarity metric between topic vectors
A common choice is cosine similarity (normalized dot product):

w2
 ai bi
sim(A, B) i
 ai2  bi2
i i
w1
 The cosine similarity metric is the cosine of the angle between the term
vectors
Verdict: a Success
For heterogeneous text collections, the vector space model, tf-idf
weighting, and cosine similarity have been the basis for successful
document retrieval for over 50 years
◦ stemming required for some languages
◦ limited resolution: returns documents, not answers
Application 2: Opinion Mining
Task: judge whether a document expresses a positive or negative opinion
(or no opinion) about an object or topic
◦ classification task
◦ valuable for producers/marketers of all sorts of products

Simple strategy: rule-based approach

◦ make lists of positive and
negative words
◦ see which predominate in a
given document
(and mark as ‘no opinion’ if there
are few words of either type
◦ problem: hard to make such lists
◦ hard to switch to different
domains/labels/languages
Training a Classification Model
Supersede the rule-based approach
◦ A generic (task-independent) learning algorithm to train a
classifier/function/model from a set of labeled examples
◦ The classifier learns, from these labeled examples, the characteristics of a
new text should have in order to be assign to some label

Advantages
◦ Annotating/locating training examples is cheaper than writing rules
◦ Easier updates to changing conditions (annotate more data with new labels
for new domains)
Naive Bayes Classification
Identify most likely class
s = argmax P ( t | W)
t є {pos, neg}
Use Bayes’ rule

argmaxt P(W |t)P(t)

argmaxt P(t |W)
P(W)
 argmaxt P(W |t)P(t)
Doesn’t change if
 argmaxt P(w1,...,wn |t)P(t)
changing t, so we’re
going to drop it  argmaxt  P(wi |t)P(t)
i
Based on the naïve assumption of
independence
 of the word probabilities
Training
Estimate probabilities from the training corpus (N documents) using
maximum likelihood estimators

P ( t ) = count (docs labeled t) / N

P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
Text Classification: Flavors
Bernoulli model: use presence (/ absence) of a term in a
document as feature
◦ formulas on previous slide

Multinomial model: based on frequency of terms in

documents:
◦ P ( t ) = total length of docs labeled t
total size of corpus

◦ P ( wi | t ) = count (instances of wi in docs labeled t)

total length of docs labeled t

Better performance on long documents

Text Classification: Flavors
Bernoulli model: use presence (/ absence) of a term in a
document as feature
◦ formulas on previous slide

Multinomial model: based on frequency of terms in

documents:
◦ P ( t ) = total length of docs labeled t
total size of corpus

◦ P ( wi | t ) = count (instances of wi in docs labeled t)

total length of docs labeled t

Better performance on long documents

The Importance of Smoothing
Suppose a glowing review SLP2 (with lots of positive words) includes
one word, “mathematical”, previously seen only in negative reviews
P ( positive | SLP2 ) = 0
because P ( “mathematical” | positive ) = 0

The maximum likelihood estimate is poor when there is very little data
We need to ‘smooth’ the probabilities to avoid this problem
Add-One (Laplace) Smoothing
A simple remedy is to add 1 to each count
◦ for the conditional probabilities P( w | t ): Add 1 to each c(w, t)
◦ Increase the denominator by number of unique words (|V|). That is,
add |V| to c(t) to keep them as probabilities (sum up to 1)
෍𝑝 𝑤𝑡 =1
𝑤∈𝑉
An Example
t

t
t

←0
←0
←0

P(t ) P( w | t )
Some Useful Resources Using NLTK
Sentiment Analysis with Python NLTK Text Classification

◦ http://text-processing.com/demo/sentiment/

NLTK Code (simplified classifier)

◦ http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-
classifier
Problems with Bag-of-Words Models
Ambiguous terms: is “low” a positive or a negative term?
◦ “low” can be positive: “low price”
◦ or negative: “low quality”

Negation: How to handle “the equipment never failed”? A trick:

◦ modify words following negation
“the equipment never NOT_failed”
◦ treat them as a separate ‘negated’ vocabulary

How far can this trick go?

e.g., “the equipment never failed and was cheap to run”
→ “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to
NOT_run”
have to determine scope of negation
Verdict: Mixed
A simple bag-of-words strategy with a NB model works quite well for
simple reviews referring to a single item
◦ Very fast, low storage requirements
◦ Robust to irrelevant features
◦ Irrelevant features cancel each other without affecting results
◦ Very good in domains with many equally important features
◦ Optimal if the independence assumptions hold
◦ If assumed independence is correct, then it is the Bayes Optimal Classifier for problem

but fails
◦ for ambiguous terms
◦ for negation
◦ for comparative reviews
◦ to reveal aspects of an opinion
◦ the car looked great and handled well, but the wheels kept falling off
Application 3: Association Mining
Goal: find interesting relationships among attributes of an object in a
large collection …
Objects with attribute A also have attribute B
◦ e.g., “people who bought A also bought B”

For text: documents with term A also have term B

◦ widely used in scientific and medical literature
Bag-of-Words
Simplest approach
◦ look for words x and y for which
frequency (x and y in same document) >> frequency of x * frequency of y
◦ Or use Mutual Information:

Doesn’t work well

◦ want to find names (of companies, products, genes), not individual words
◦ interested in specific types of terms
◦ want to learn from a few examples
◦ need contexts to avoid noise
Beyond Bag-of-Words Models
Effective Text Association Mining Needs
◦ Name recognition
◦ Term classification
◦ Ability to learn patterns (lexical sequence or syntactic)

Semantic and syntactic analyzers at varying levels can help

the duration of diabetes mellitus was

the significant risk factor for cataracts
Conclusion
We have reviewed bag-of-words models in the context of three tasks
◦ Document retrieval
◦ Opinion mining
◦ Association mining

Some tasks can be handled effectively (and very simply) by bag-of-

words models,
but most benefit from an analysis of language structure

Naive Bayes
No ratings yet
Naive Bayes
56 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
Multimedia Application L7 - For
No ratings yet
Multimedia Application L7 - For
46 pages
Text Classification
No ratings yet
Text Classification
53 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Lect 5
No ratings yet
Lect 5
40 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
NB 24 Aug
No ratings yet
NB 24 Aug
85 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Text Classification
No ratings yet
Text Classification
7 pages
NLP NB
No ratings yet
NLP NB
52 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
64 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
Topic 8
No ratings yet
Topic 8
55 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
Module 3 NLP
No ratings yet
Module 3 NLP
17 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
NB 24 Aug
No ratings yet
NB 24 Aug
79 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
DM05 Text Mining
No ratings yet
DM05 Text Mining
44 pages
Week 4
No ratings yet
Week 4
45 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Unit 4
No ratings yet
Unit 4
207 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Text Classification
No ratings yet
Text Classification
60 pages
MR4103 - Week 4
No ratings yet
MR4103 - Week 4
27 pages
Qta Lse Day4 PDF
No ratings yet
Qta Lse Day4 PDF
59 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Classification
No ratings yet
Classification
81 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
In4080 2022 Lecture 03
No ratings yet
In4080 2022 Lecture 03
62 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
10 Estimators Pre Lecture
No ratings yet
10 Estimators Pre Lecture
109 pages
ch07 Consistency Replication
No ratings yet
ch07 Consistency Replication
30 pages
2DI90 chID190-CH5
No ratings yet
2DI90 chID190-CH5
62 pages
New Trends For Authentication
No ratings yet
New Trends For Authentication
5 pages
Lect33 Textcat
No ratings yet
Lect33 Textcat
70 pages
2DI90 ch11
No ratings yet
2DI90 ch11
54 pages
Primes
No ratings yet
Primes
39 pages
Tut4 - WordEmb NLP
No ratings yet
Tut4 - WordEmb NLP
30 pages
Reduction Proofs
No ratings yet
Reduction Proofs
9 pages
Slides08 LR Parsing
No ratings yet
Slides08 LR Parsing
25 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
4 - Slides Regualer Expression
No ratings yet
4 - Slides Regualer Expression
75 pages
2DI90 ch9
No ratings yet
2DI90 ch9
83 pages
Jarrar LectureNotes Ch1 Introduction
No ratings yet
Jarrar LectureNotes Ch1 Introduction
18 pages
3 - Slides Corpus3
No ratings yet
3 - Slides Corpus3
88 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
126 pages
13-Oo-Opolymorphism PLC
0% (1)
13-Oo-Opolymorphism PLC
15 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
13-Neuralcrf Pos Tagging
No ratings yet
13-Neuralcrf Pos Tagging
40 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
01-Bayes-All-Handout Prob
No ratings yet
01-Bayes-All-Handout Prob
28 pages
Imc Shift-Cipher
No ratings yet
Imc Shift-Cipher
17 pages
07 Covariance Answers Hidden Lecture
No ratings yet
07 Covariance Answers Hidden Lecture
62 pages
POS Tagging
No ratings yet
POS Tagging
63 pages
01-Introduction PLC
No ratings yet
01-Introduction PLC
53 pages
2 Corpora and Smoothing
No ratings yet
2 Corpora and Smoothing
85 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
DBMS Quick Guide
No ratings yet
DBMS Quick Guide
75 pages
NetBackup Command Reference Guide
No ratings yet
NetBackup Command Reference Guide
4 pages
Relational Database 2
No ratings yet
Relational Database 2
5 pages
Biometric-Based Students Attendance System
No ratings yet
Biometric-Based Students Attendance System
21 pages
User Documentation: Midwest Network Management Database
No ratings yet
User Documentation: Midwest Network Management Database
10 pages
Ashwini Nerkar-Data Analyst
No ratings yet
Ashwini Nerkar-Data Analyst
1 page
Database Programming With PL/SQL 6-1: Practice Activities: Introduction To Explicit Cursors
No ratings yet
Database Programming With PL/SQL 6-1: Practice Activities: Introduction To Explicit Cursors
4 pages
Sybase Jconnect For JDBC PDF
No ratings yet
Sybase Jconnect For JDBC PDF
118 pages
Cap3 SMS
No ratings yet
Cap3 SMS
1 page
BDA Questions
No ratings yet
BDA Questions
8 pages
DBMS (R20) Unit - 2
No ratings yet
DBMS (R20) Unit - 2
32 pages
Manual Installation Guide ParaDM Document Management
No ratings yet
Manual Installation Guide ParaDM Document Management
113 pages
Performance Comparison of Django Querysets and Elasticsearch
No ratings yet
Performance Comparison of Django Querysets and Elasticsearch
20 pages
Oracle JR - DBA Interview Questions and Answers
100% (1)
Oracle JR - DBA Interview Questions and Answers
18 pages
Basics of Computers - Office Tools
No ratings yet
Basics of Computers - Office Tools
3 pages
QVNU2.E76126 - Protectors, Supplementary - Component - UL Product IQ
No ratings yet
QVNU2.E76126 - Protectors, Supplementary - Component - UL Product IQ
6 pages
31 MySQL Questions
No ratings yet
31 MySQL Questions
13 pages
Database Management System
No ratings yet
Database Management System
11 pages
Asad Ahmad: Summary
No ratings yet
Asad Ahmad: Summary
2 pages
DBMS Practice Questions
No ratings yet
DBMS Practice Questions
11 pages
Hadoop for Tech Professionals
No ratings yet
Hadoop for Tech Professionals
31 pages
6890736-HR Schema MySQL
No ratings yet
6890736-HR Schema MySQL
45 pages
Hindi To English and Marathi To English Cross Lang
No ratings yet
Hindi To English and Marathi To English Cross Lang
9 pages
3 Eer 05 01 2024
No ratings yet
3 Eer 05 01 2024
38 pages
Mahatma Gandhi Institute of Technology: Chaitanya Bharathi P.O., Gandipet, Hyderabad-500 075
No ratings yet
Mahatma Gandhi Institute of Technology: Chaitanya Bharathi P.O., Gandipet, Hyderabad-500 075
7 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
MongoDB Database Management Guide
No ratings yet
MongoDB Database Management Guide
9 pages
Oracle PL/SQL Developer Resume
No ratings yet
Oracle PL/SQL Developer Resume
7 pages
Mj514 - Basis Data
No ratings yet
Mj514 - Basis Data
6 pages

Bag - of - Words NLP

Uploaded by

Bag - of - Words NLP

Uploaded by

Bag of Word Models

See how far we can get with document-level bag-of-words models

where N = size of collection and ni = number of documents containing

Simple strategy: rule-based approach

argmaxt P(W |t)P(t)

P ( t ) = count (docs labeled t) / N

Multinomial model: based on frequency of terms in

◦ P ( wi | t ) = count (instances of wi in docs labeled t)

Better performance on long documents

Multinomial model: based on frequency of terms in

◦ P ( wi | t ) = count (instances of wi in docs labeled t)

Better performance on long documents

NLTK Code (simplified classifier)

Negation: How to handle “the equipment never failed”? A trick:

How far can this trick go?

For text: documents with term A also have term B

Doesn’t work well

Semantic and syntactic analyzers at varying levels can help

the duration of diabetes mellitus was

Some tasks can be handled effectively (and very simply) by bag-of-

You might also like