0% found this document useful (0 votes)
37 views88 pages

Language Modeling Lecture Notes

Uploaded by

22520398
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views88 pages

Language Modeling Lecture Notes

Uploaded by

22520398
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

TTIC 31190: Natural Language Processing

Lecture 9: Language Modeling

Fall 2023
Announcements
• Freda’s office hour this week
• Thu 1:30-2:30 pm, TTIC 4th floor open space

• TA (Jiamin Yang) Tutorial Sessions & Office Hours


• Fridays 3 pm – 4 pm; TTIC Room 530
• This week: HMM & CRF
• Office hour 4 pm – 5 pm

• Assignment 2 due on Nov 2, 11:59 pm


Recap
• Neural Networks
• Multi Layer Perceptron (MLP)
• Convolutional neural network (CNN)
• Recurrent neural network (RNN)
• Transformer (Attention Is All You Need)

• Sequence Labeling (structured prediction)


• Hidden Markov Model (HMM)
• Conditional Random Field (CRF)
A bottle of tezgüino is on the table.
Everybody likes tezgüino. Tezgüino ?
Don’t have tezgüino before you drive.
We make tezgüino out of corn.
CBOW (Continuous Bag-of-Words): learn representations that
predict a word given context
word2vec
A bottle of tezgüino is on the table.

A bottle of tezgüino is on the table.


Everybody likes tezgüino.
Don’t have tezgüino before you drive.
We make tezgüino out of corn.
A bottle
A bottleofoftezgüino
_____ isis on
onthe
thetable.
table.
Everybody
Everybodylikes tezgüino.
likes _____.
Don’t have tezgüino before you drive.
Don’t have _____ before you drive.
We make tezgüino out of corn.
We make _____ out of corn.

Language Modeling
Language Modeling
• The Shannon game [Shannon 1951]:
How well can you predict the next
letter?
Language Modeling
• The Shannon game [Shannon 1951]:
How well can you predict the next
letter?
Language Modeling
“I challenge the claim that next-token prediction cannot
surpass human performance. On the surface, it looks like it
cannot. It looks like if you just learn to imitate, to predict what
people do, it means that you can only copy people. But here is
a counter argument for why it might not be quite so. If your
base neural net is smart enough, you just ask it — What would
a person with great insight, wisdom, and capability do?”

“It's actually a much deeper question than it seems. Predicting


the next token well means that you understand the underlying
reality that led to the creation of that token. It's not statistics.”

[Src: https://www.dwarkeshpatel.com/p/ilya-sutskever#details]
Language Models
• Language Model: a probability distribution over strings in a
language.
Language Models
• Language Model: a probability distribution over strings in a
language.
Language Models
• Language Model: a probability distribution over strings in a
language.
Language Modeling
• Language Modeling: the task of estimating this distribution
from data

• Define a statistical model with parameters

• Maximize likelihood
Language Modeling
• Language Modeling: assign probabilities to token sequences
• Why?
• machine translation:
• P(turn the camera off) > P(put the camera out)
• speech recognition:
• P(be back soonish) > P(be bassoon dish)
• spelling/grammar correction:
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
• assistive writing, dialogue systems, question answering, etc.!
Impact of size of language model training data (in words) on quality of
Arabic-English statistical machine translation system

ca. 2007
Nowadays: large language models
Language Models are Everywhere
Language Models are Everywhere
Language Modeling
• Goal: compute the probability of a sequence of words:

• Related task: probability of next word:

• A model that computes either of these:


or
is called a language model (LM)
[SLP3: Chapter 3]
Language Modeling
• Building language models
• Generating from a language model
• Evaluating a language model

• Count-based language models • Neural language models


• MLE estimation • Feed-forward models
• Smoothing • RNN models
• Attention models
Language Modeling

How do we model?
Chain Rule
• Chain rule of probability

• In general to a sequence
Chain Rule
• Factor joint probability into product of conditional probabilities:

• We have not yet made any independence assumptions

[SLP3: Chapter 3]
Chain Rule
• Factor joint probability into product of conditional probabilities:

• For example, “the cat sat on the mat”

[SLP3: Chapter 3]
Important Detail: Modeling Length
• a language model assigns probabilities to token sequences
• can be any length, so the probabilities should sum to 1 across all possible
sequences of all possible lengths
• usually length is modeled by including a “stop symbol” at the
end of the sequence and using “stopping probabilities”
• a “start symbol” is also assumed to be at the beginning

• our language model with start/stop symbols:


Why Stopping Probabilities?
• our language model:

• we need to ensure:

• consider removing stopping probabilities:


Without Stopping Probabilities
• without stopping probabilities, sums of probabilities for all possible
length-1 and length-2 sequences:

length = 1:

length = 2:

• uh oh…
With Stopping Probabilities
• With the stop symbol

• Signal to stop during generation


• E.g. machine translation, automatic summarization
Other Ways of Modeling Length
• alternatively, we can model the length n explicitly (e.g., using a zero-
truncated Poisson distribution):
Estimating Language Model Probabilities
<s> I do not like green eggs and ham </s>

• let’s use maximum likelihood estimation (MLE):

• problem: we’ll never have enough data!


Estimating Language Model Probabilities
• Suppose we have a vocabulary of size V, how many sequences of
length n do we have?

Typical English vocabulary ~ 40k words

Even sentences of length <= 11 results in more than 4 * 10^50 sequences.


Too many to count! (# of atoms in the earth ~ 10^50)
Markov Assumption
• Independence assumption: the next word only
depends on the most recent past
• Reduces the number of estimated parameters in
exchange for modeling capacity

Most recent k words

Andrey Markov
Markov Assumption
• Independence assumption: the next word only
depends on the most recent past

Most recent k words

1st order Markov: k = 1

2nd order Markov: k = 2


n-gram Language Models
• unigram language model

• bigram language model

• trigram language model


n-gram Language Models
• unigram language model

• Example sentences generated by a unigram model trained on


financial news:

fifth an of futures the an incorporated a a the inflation most dollars quarter in is


mass
thrift did eighty said hard ‘m july bullish
that or limited the
n-gram Language Models
• bigram language model

• Example sentences generated by a bigram model trained on financial


news:

texaco rose one in this issue is pursuing growth in a boiler house said mr. gurria
mexico ’s motion control proposal without permission from five hundred fifty five
yen
outside new car parking lot of the agreement reached
this would be a record november
n-gram Language Models

[SLP3: Chapter 3]
Estimating Bigram Probabilities
• maximum likelihood estimate (MLE)

[SLP3: Chapter 3]
An Example
training data: MLE estimator:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

a few estimated bigram probabilities:

[SLP3: Chapter 3]
Generating from a Language Model
Generating from a Language Model
• Bigram model

• Generate the first word


• Generate the second word Sampling
• Generate the third word
•…

[SLP3: Chapter 3]
Generating from a Language Model
• Trigram model

• Generate the first word


• Generate the second word
• Generate the third word
•…
Generating from a Language Model

• Typical LMs are not sufficient to handle long-range dependencies


“The computer(s) that I just put into the machine
room on the fifth floor is (are) crashing.”
Generating from a Language Model
• GPT-4 generations Prefix / Prompt

Modern language models can take much longer context!


Generating from a Language Model (more)
• Greedy search: choose the most likely word at every step
To predict the next word given the previous two words , :

[src: https://blog.allenai.org/a-guide-to-language-model-sampling-in-allennlp-3b1239274bc3]
Generating from a Language Model (more)
• Top-k vs. top-p sampling

Top-k sampling Top-p sampling

[src: https://blog.allenai.org/a-guide-to-language-model-sampling-in-allennlp-3b1239274bc3]
Evaluating Language Models
Evaluating Language Models
Extrinsic (task-based) evaluation
• use language model in a system for some task, see if performance
improves
• downsides:
• can be time-consuming depending on task/system
• changing the language model might require changing how it’s used in the
system in order to improve performance
Evaluating Language Models
Intrinsic evaluation
• compute probability of held-out data
• standard metric: perplexity
• downside:
• may not correlate with system performance on downstream tasks
Probability of Held-out Data
• probability of held-out sentences:

• let’s work with log-probabilities:

• divide by number of words M (including stop symbols) in held-out


sentences:
Probability → Perplexity
• average token log-probability of held-out data:
????
• perplexity:
Cross entropy

• the lower the perplexity, the better the model


Perplexity (PPL)
• Measure how well a language model (LM) predicts the true data

• What is the intuition behind it?


Perplexity as Branching Factor
• given a vocabulary , consider this bigram language model:

• perplexity of any sequence under this model?


Perplexity Example
• train: 38 million tokens (Wall Street Journal text)
• test: 1.5 million tokens
• vocabulary size: 19,979
n-gram order: unigram bigram Trigram
perplexity: 962 170 109
• though vocabulary size is ~20K, trigram model is (roughly) considering
109 choices per position on average

[SLP3: Chapter 3]
Perplexity Example

[Src: https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word]
training data:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

test data:
<s> I like green eggs and ham </s>

problem: !

probability of test sequence is 0, so log-probability is ,


so perplexity is !
Smoothing
Smoothing
• instead of MLE, which leads to zeros, use a different estimation
method that leads to “smoother” distributions (fewer zeros)

am
do
was
like
is
green
Smoothing
• instead of MLE, which leads to zeros, use a different estimation
method that leads to “smoother” distributions (fewer zeros)

am
do
was
like
is
green
Smoothing
• Handle sparsity by making sure all probabilities are non-zero in our
model

• Additive: Add a small amount to all probabilities

• Interpolation: Use a combination of different granularities of n-grams

• Discounting: Redistribute probability mass from observed n-grams to


unobserved ones
“Add-1” estimation
• just add 1 to all counts!
• also called Laplace smoothing
• MLE estimate:

• Add-1 estimate:

vocabulary size

• simple and avoids zeros, but doesn’t work as well as other methods

[SLP3: Chapter 3]
“Add-1” estimation
• (Berkeley restaurant corpus) Out of 9222 sentences
• Raw bigram counts

[SLP3: Chapter 3]
“Add-1” estimation
• (Berkeley restaurant corpus) Out of 9222 sentences
• Smoothed bigram counts

[SLP3: Chapter 3]
“Add-1” estimation
• (Berkeley restaurant corpus) Out of 9222 sentences
• Smoothed bigram probabilities

[SLP3: Chapter 3]
Backoff and Interpolation
• use multiple n-gram sizes in the same language model

• backoff:
• use trigram model if its probability is nonzero
• otherwise, use bigram model if its probability is nonzero
• otherwise, use unigram

• interpolation:
• mixture of unigram, bigram, and trigram models

• interpolation tends to work better

[SLP3: Chapter 3]
Linear Interpolation
• estimate unigram/bigram/trigram models using MLE, then combine
them:

• lambdas can be estimated using development data


• they can also be a function of the context

• intuitively, may want to be larger if is large


Kneser-Ney Smoothing
• widely used and effective

• a few components:
• absolute discounting
• interpolation with continuation probabilities

• best variant seems to be “modified Kneser-Ney” -- see Chen and


Goodman (1998)

[SLP3: Chapter 3]
Absolute Discounting

observed bigrams have counts that are overestimated


unobserved bigrams have counts that are underestimated
Absolute Discounting
• subtract d from each numerator count
• use original counts for denominator

• so there’s some “missing probability mass”


• lambda function is defined to make things normalize correctly

[SLP3: Chapter 3]
Continuation Probabilities
• “I can’t see without my reading _______”
• suppose we are interpolating bigram and unigram distributions here
• “Kong” is more common than “glasses”
• but “Kong” almost always follows “Hong”
• “glasses” is more likely to follow a variety of previous words!

• unigram probability is most useful when we haven’t seen bigram


• instead of unigram , use

How likely is x? How likely is x to appear as a novel


continuation?
Continuation Probabilities
• how likely is x to be a novel continuation?

number of word types that appeared before x


• normalize by total number of bigram types:
Kneser-Ney Smoothing
• Interpolated Kneser-Ney:

• again, lambda function is defined to make things normalize correctly

• this is the bigram version; recursive versions exist for higher orders

[SLP3: Chapter 3]
Huge Web-scale n-grams
• Google n-gram release, August 2006

https://blog.research.google/2006/08/all-our-n-gram-are-belong-to-you.html
Huge Web-scale n-grams
• Google n-gram release, August 2006

https://blog.research.google/2006/08/all-our-n-gram-are-belong-to-you.html
Huge Web-scale n-grams
• How to deal with, e.g., Google N-gram corpus
• Pruning
• Only store N-grams with count > threshold.
• Remove singletons of higher-order n-grams
• Entropy-based pruning
• Efficiency
• Efficient data structures like tries
• Bloom filters: approximate language models
• Store words as indexes, not strings
• Use Huffman coding to fit large numbers of words into two bytes
• Quantize probabilities (4-8 bits instead of 8-byte float)
Smoothing for Web-scale Models
• “Stupid backoff” (Brants et al., 2007):
Closed Vocabulary Open Vocabulary
• smoothing avoids zeros for unknown n- • create an unknown word symbol
grams (n > 1), not unknown words! “<UNK>”

• if there are unknown words in the test • at training time:


data, smoothing does not help – replace some rare words with <UNK>
• probability of test data is still zero – then estimate probabilities as though
<UNK> is a normal word

• we must know the full vocabulary


ahead of time (for both training and • at test time:
held-out data!) – replace unknown words with <UNK>
• when comparing open-vocabulary language models, make sure the
vocabularies match!

• world’s best language model (every word is <UNK>):


Language Modeling Toolkits
• SRILM
http://www.speech.sri.com/projects/srilm/

• KenLM
https://kheafield.com/code/kenlm/
Next Token Prediction Solves AI?
• Next Token Prediction SOLVES AI Says OpenAI Founder
?
Language Modeling
• Building language models
• Generating from a language model
• Evaluating a language model

• Count-based language models • Neural language models


• MLE estimation • Feed-forward models
• Smoothing • RNN models
• Attention models
Summary
• language modeling:
• compute probabilities of token sequences
• length of sequence must be modeled probabilistically (usually with a stop
symbol at the end)
• typically, use chain rule to factor joint into product of conditionals, one for each
token in order from left to right:
Summary
• n-gram language models:
• let each conditional probability depend on only the most recent n-1 tokens, e.g.,
trigram:

• we can use maximum likelihood estimation to estimate n-gram


probabilities from data, e.g., for a bigram model:
Summary
• evaluation of language models
• extrinsic: use model in a system for a downstream task
• intrinsic: compute probability of held-out data (standard metric: perplexity)

• perplexity:
• compute = average log-probability of held-out tokens, perplexity is
• lower perplexity  better language model
• can be interpreted as effective number of choices per position on average
Summary
Smoothing
• add-1 estimation: add 1 (or some small number) to all counts, then normalize

• backoff: if high order n-gram has been seen, use its probability, otherwise “back
off” to lower order n-grams

• interpolation: weighted mixture of n-gram models of various sizes:

• weights can depend on context


Summary
Smoothing
• absolute discounting:
• observed n-grams have counts that are overestimated
• unobserved n-grams have counts that are underestimated
• subtract a constant from counts, normalize using interpolation with a lower order n-
gram model
• continuation probabilities:
• captures how likely it is for a word to form a novel continuation of the preceding words
• likely more helpful than simple unigram probabilities when interpolating with a bigram model
• Kneser-Ney smoothing:
• combines absolute discounting and continuation probabilities via interpolation
• stupid backoff:
• simple, scales well to very large corpora
Summary
• closed vs. open vocabulary language modeling
• when comparing language models, be mindful of vocabularies!

You might also like