0% found this document useful (0 votes)
54 views48 pages

Lecture 4 Word Representation

The document discusses various methods for representing words in Natural Language Processing, focusing on techniques like Word2Vec and its Skipgram model. It highlights the importance of context in understanding word meanings and introduces concepts like Point-wise Mutual Information for measuring word similarities. Additionally, it touches on the evolution from manual word relationships to dense representations that capture semantic and syntactic nuances.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views48 pages

Lecture 4 Word Representation

The document discusses various methods for representing words in Natural Language Processing, focusing on techniques like Word2Vec and its Skipgram model. It highlights the importance of context in understanding word meanings and introduces concepts like Point-wise Mutual Information for measuring word similarities. Additionally, it touches on the evolution from manual word relationships to dense representations that capture semantic and syntactic nuances.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

DS 207: Introduction to

Natural Language Processing

Representing Words
Danish Pruthi
Quick feedback: how is the pace of teaching?

2
Quick feedback: how is the pace of teaching?
• Too slow 🥱

2
Quick feedback: how is the pace of teaching?
• Too slow 🥱
• Just about right 😎

2
Quick feedback: how is the pace of teaching?
• Too slow 🥱
• Just about right 😎
• Too fast 🤕

2
Representing words

3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word

3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word
• Synonyms (and related words) should be represented similarly

3
Representing words
• Ideally, we want to represent the meaning (or idea) conveyed by that word
• Synonyms (and related words) should be represented similarly
• WordNet

3
WordNet: Hyponyms & Hypernyms

• Not scalable, subjective, labor intensive, can not compute similarities …


4
"You shall know a word by the company it keeps"
John Rupert Firth, 1957

• I offered her some random

• It took some time for my random to brew

• I can not live without drinking random in the morning

5
Word {vectors/representation/embedding}

-3.2 -3.8
-2.9 -2
coffee 1.0 tea 1.1
2.2 2.3
0.6 0.5
… …

6
Word2vec

7
Word2vec
• Method for learning word vectors
(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean. NeurIPS 2013)

7
Word2vec
• Method for learning word vectors
(Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean. NeurIPS 2013)

• Key idea:
• Use a large corpus of text
• For center word vc and context ("outside") words be vo
• Use the similarity of word vectors vc and vo to compute the probability of the word
being used in the context (and vice versa)

• Keep adjusting (aka learning) the word vectors to optimize the probability

7
Skipgram model: predicting context

I can not live without drinking coffee in the morning as …

I offered her a cup of coffee to drink

It took some time for my coffee to brew

8
Skipgram model: predicting context

I can not live without drinking coffee in the morning as …

I offered her a cup of coffee to drink

It took some time for my coffee to brew

Free supervision (aka self supervision)!

8
Skipgram objective


L(θ) = P(context | wt; θ)
t=1

T T

∏ ∏
L(θ) = P(wt+j | wt; θ)
t=1 −m≤j≤m
j≠0

9
Skipgram objective

P(wt+j | wt; θ) or … P(wo | wc) or … P(o | c)

T
exp(uo vc)
P(o | c) =
∑w∈V exp(uw vc)
T

10
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:

11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:

∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′


11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:

∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′

T T
∏ ∏
argmaxθ σ(uo vc) (1 − σ(uo vc))
c,o ∈ D c,o ∈ D′


11
Skipgram objective w/ negative sampling
• For efficiency, one can use negative sampling:

∏ ∏
argmaxθ P(D = 1 | c, o; θ) P(D = 0 | c, o; θ)
c,o ∈ D c,o ∈ D′

T T
∏ ∏
argmaxθ σ(uo vc) (1 − σ(uo vc))
c,o ∈ D c,o ∈ D′

T T
∑ ∑
argmaxθ log σ(uo vc) log σ(−uo vc)
c,o ∈ D c,o ∈ D′



11
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup


Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup

+ + +
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup

+ + +
=
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup

+ + +
=

W =

scores
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup

+ + +
=

W = softmax

scores probs
Continuous Bag-of-words
(Mikolov et al. 2013)
• Predict word based on sum of surrounding embeddings
giving a *** at the

lookup lookup lookup lookup

+ + +
= talk

W = softmax loss

scores probs
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word

talk
lookup
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word

talk
lookup

W =
Skip-gram
(Mikolov et al. 2013)
• Predict each word in the context given the word

talk giving
lookup

a
W =
at
loss
the
After training … word analogies
• Man : Woman : : King : ??
• Big : Biggest : : Bad : ??
• Man: Programmer : : Woman : ??

14
After training …

15
Extensions to phrases
• w2v("British Airways") - w2v("Britain") + w2v("India") ≈ w2v("Air India")
• w2v("Steve Balmer") - w2v("Microsoft") + w2v("Google") ≈ w2v("Larry Page")

16
Also additive compositionality

17
Word similarities

WordSimilarity-353

18
Count-based words vectors
• We've studied two extremes so far,
• One hot vectors, and
• Dense word embeddings

19
Co-occurence matrices

Figure from Jurafsky & Martin

• Which similarity metric is a better choice? Dot product or cosine similarity?


sim(cherry, information) = 0.018
sim(digital, information) = 0.996

20
Point-wise Mutual Information

21
Point-wise Mutual Information
P(w, c)
PMI(w, c) = log2
P(w)P(c)

21
Point-wise Mutual Information
P(w, c)
PMI(w, c) = log2
P(w)P(c)

Positive Point-wise Mutual Information


P(w, c)
PPMI(w, c) = max(log2 , 0)
P(w)P(c)

21
Point-wise Mutual Information

Values from Jurafsky & Martin


Speech & Language Processing
22
Point-wise Mutual Information

23
Summary of word representations
• We started with word identities

• Manual efforts to write down meanings, relationships among words

• Dense representations based on the idea that meanings of words can be inferred from
the context in which they occur

• Capture interesting semantic and syntactic relationships, but also biases!

• Other count-based methods

24

You might also like