DS 207: Introduction to
Natural Language Processing
Representing Words
Danish Pruthi
Quick feedback: how is the pace of teaching?
2
Quick feedback: how is the pace of teaching?
• Too slow 🥱
2
Quick feedback: how is the pace of teaching?
• Too slow 🥱
• Just about right 😎
2
Quick feedback: how is the pace of teaching?
• Too slow 🥱
• Just about right 😎
• Too fast 🤕
2
Representing words
3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word
3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word
• Synonyms (and related words) should be represented similarly
3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word
• Synonyms (and related words) should be represented similarly
• WordNet
3
WordNet: Hyponyms & Hypernyms
• Not scalable, subjective, labor intensive, can not compute similarities …
4
"You shall know a word by the company it keeps"
John Rupert Firth, 1957
• I offered her some random
• It took some time for my random to brew
• I can not live without drinking random in the morning
5
Word {vectors/representation/embedding}
-3.2 -3.8
-2.9 -2
coffee 1.0 tea 1.1
2.2 2.3
0.6 0.5
… …
6
Word2vec
7
Word2vec
• Method for learning word vectors
(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean. NeurIPS 2013)
7
Word2vec
• Method for learning word vectors
(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean. NeurIPS 2013)
• Key idea:
• Use a large corpus of text
• For center word vc and context ("outside") words be vo
• Use the similarity of word vectors vc and vo to compute the probability of the word
being used in the context (and vice versa)
• Keep adjusting (aka learning) the word vectors to optimize the probability
7
Skipgram model: predicting context
I can not live without drinking coffee in the morning as …
I offered her a cup of coffee to drink
It took some time for my coffee to brew
8
Skipgram model: predicting context
I can not live without drinking coffee in the morning as …
I offered her a cup of coffee to drink
It took some time for my coffee to brew
Free supervision (aka self supervision)!
8
Skipgram objective
∏
L(θ) = P(context | wt; θ)
t=1
T T
∏ ∏
L(θ) = P(wt+j | wt; θ)
t=1 −m≤j≤m
j≠0
9
Skipgram objective
P(wt+j | wt; θ) or … P(wo | wc) or … P(o | c)
T
exp(uo vc)
P(o | c) =
∑w∈V exp(uw vc)
T
10
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:
11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:
∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′

11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:
∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′
T T
∏ ∏
argmaxθ σ(uo vc) (1 − σ(uo vc))
c,o ∈ D c,o ∈ D′


11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:
∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′
T T
∏ ∏
argmaxθ σ(uo vc) (1 − σ(uo vc))
c,o ∈ D c,o ∈ D′
T T
∑ ∑
argmaxθ log σ(uo vc) log σ(−uo vc)
c,o ∈ D c,o ∈ D′



11
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
+ + +
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
+ + +
=
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
+ + +
=
W =
scores
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
+ + +
=
W = softmax
scores probs
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
lookup lookup lookup lookup
+ + +
= talk
W = softmax loss
scores probs
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word
talk
lookup
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word
talk
lookup
W =
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word
talk giving
lookup
a
W =
at
loss
the
After training … word analogies
• Man : Woman : : King : ??
• Big : Biggest : : Bad : ??
• Man: Programmer : : Woman : ??
14
After training …
15
Extensions to phrases
• w2v("British Airways") - w2v("Britain") + w2v("India") ≈ w2v("Air India")
• w2v("Steve Balmer") - w2v("Microsoft") + w2v("Google") ≈ w2v("Larry Page")
16
Also additive compositionality
17
Word similarities
WordSimilarity-353
18
Count-based words vectors
• We've studied two extremes so far,
• One hot vectors, and
• Dense word embeddings
19
Co-occurence matrices
Figure from Jurafsky & Martin
• Which similarity metric is a better choice? Dot product or cosine similarity?
sim(cherry, information) = 0.018
sim(digital, information) = 0.996
20
Point-wise Mutual Information
21
Point-wise Mutual Information
P(w, c)
PMI(w, c) = log2
P(w)P(c)
21
Point-wise Mutual Information
P(w, c)
PMI(w, c) = log2
P(w)P(c)
Positive Point-wise Mutual Information
P(w, c)
PPMI(w, c) = max(log2 , 0)
P(w)P(c)
21
Point-wise Mutual Information
Values from Jurafsky & Martin
Speech & Language Processing
22
Point-wise Mutual Information
23
Summary of word representations
• We started with word identities
• Manual efforts to write down meanings, relationships among words
• Dense representations based on the idea that meanings of words can be inferred from
the context in which they occur
• Capture interesting semantic and syntactic relationships, but also biases!
• Other count-based methods
24