Chapter 2
Chapter 2
Let’s begin with one of the oldest yet effective techniques for converting text into usable data for
machine learning: bag of words.
Machines don’t process text like humans. To train a machine learning classifier, we first convert
documents into numbers. Each document becomes a feature vector, where each feature is a scalar
value.
A common and effective approach to convert a collection of documents into feature vectors is the
bag of words (BoW). Here’s how it works for a collection of 10 documents as an example:
ID Text
1 Movies are fun for everyone.
2 Watching movies is great fun.
3 Enjoy a great movie today.
4 Research is interesting and important.
5 Learning math is very important.
6 Science discovery is interesting.
7 Rock is great to listen to.
8 Listen to music for fun.
9 Music is fun for everyone.
10 Listen to folk music!
A collection of text documents used in machine learning is called a corpus. The bag of words
method applied to a corpus involves two key steps:
1. Create a vocabulary: List all unique words in the corpus to create the vocabulary.
2. Vectorize documents: Convert each document into a feature vector, where each
dimension represents a word from the vocabulary. The value indicates the word’s
presence, absence, or frequency in the document.
For the 10-document corpus, the vocabulary is built by listing all unique words in alphabetical
order. This involves removing punctuation, converting words to lowercase, and eliminating
duplicates. After processing, we get:
Splitting a document into small indivisible parts is called tokenization, and each part is a token.
There are different ways to tokenize. We tokenized our 10-document corpus by words. Sometimes,
it’s useful to break words into smaller units, called subwords to keep the vocabulary size
manageable. For instance, instead of including “interesting” in the vocabulary, we might split it into
“interest” and “-ing.” A common method for subword tokenization, which we’ll cover later in this
chapter, is byte-pair encoding. The choice of tokenization method depends on the language, dataset,
and model.
A count of all English word surface forms—like do, does, doing, and did—reveals
several million possibilities. Languages with more complex morphology have even
greater numbers. A Finnish noun alone can take 2,000–3,000 different forms to express
various case and number combinations. Using subwords offers a practical solution, as
storing every surface form in the vocabulary would consume excessive memory and
computational resources.
Words are a type of token, so “token” and “word” are often used interchangeably as the smallest
indivisible units of a document. When a distinction is important, context will make it clear. While
the bag-of-words approach can handle both words and subwords, it was originally designed for
words—hence the name.
Feature vectors can be organized into a document-term matrix. Here, rows represent documents,
and columns represent tokens. Below is a partial document-term matrix for a 10-document corpus.
It includes only a subset of tokens to fit within the page width:
In the document-term matrix above, a value of 1 means the token appears in the document, while
0 means it does not. For instance, the feature vector 𝐱2 for document 2 (“Watching movies is great
fun.”) is:
𝐱2 = [0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1]⊤ .
In natural languages, word frequencies follow Zipf’s Law. This law states that a word’s
frequency is inversely proportional to its rank in a frequency table—for instance, the
second most frequent word appears half as often as the most frequent one. As a result,
most words in a document-term matrix are rare, creating a sparse matrix with
numerous zeros.
A neural network can be trained to predict a document’s topic using these feature vectors. Let’s do
that. The first step is to assign labels to the documents, a process known as labeling. Labeling can
be done manually or assisted by an algorithm. When algorithms are used, human validation is often
needed to confirm accuracy. Here, we will manually label the documents by reading each one and
choosing the most suitable topic from the three options.
Advanced chat LMs enable highly accurate automated document labeling through a
panel of expert models. Using three LLMs, when two or more as-sign the same label to
a document, that label is adopted. If all three disagree, either a human can decide or a
fourth model can break the tie. In many business contexts, manual labeling is becoming
obsolete, as LLMs offer faster and more reliable labeling.
We have three classes: 1 for cinema, 2 for music, and 3 for science.4 While binary classifiers typically
use the sigmoid activation function with the binary cross-entropy loss, as discussed in
Section 1.7, tasks involving three or more classes generally employ the softmax activation function
paired with the cross-entropy loss.
Here, 𝐳 is a 𝐷-dimensional vector of logits, 𝑘 is the index for which the softmax is computed, and 𝑒
is Euler’s number. Logits are the raw outputs of a neural network, prior to applying an activation
function, as shown below:
(𝑘)
The figure shows the output layer of a neural network, labeled as 𝑜. The logits 𝑧𝑜 , for 𝑘 ∈ {1,2,3},
are the values in light green. These represent the outputs of the units before the activation function
(1) (2) (3) ⊤
is applied. The vector 𝐳 is expressed as 𝐳𝑜 = [𝑧𝑜 , 𝑧𝑜 , 𝑧𝑜 ] .
For instance, the softmax for unit 𝑜, 2 in the figure is calculated as:
4Class labels in classification are arbitrary and unordered. You can assign numbers to the classes in any way, and
the model’s performance won’t change as long as the mapping is consistent for all examples.
(2)
𝑒 𝑧𝑜
softmax(𝐳𝑜 , 2) = (1) (2) (3)
𝑒 𝑧𝑜 + 𝑒 𝑧𝑜 + 𝑒 𝑧𝑜
The softmax function transforms a vector into a discrete probability distribution (DPD),
ensuring that ∑𝐷𝑘=1 softmax (𝐳, 𝑘) = 1. A DPD assigns probabilities to values in a finite set, with their
sum equaling 1. A finite set contains a countable number of distinct elements. For instance, in a
classification task with classes 1, 2, and 3, these classes constitute a finite set. The softmax function
maps each class to a probability, with these probabilities summing to 1.
Let’s compute the probabilities step by step. Assume we have three logits, 𝐳 = [2.0,1.0,0.5]⊤ ,
representing a document’s classification into cinema, music, or science.
(𝑘)
First, calculate 𝑒 𝑧 for each logit:
(1)
𝑒𝑧 = 𝑒 2.0 ≈ 7.39,
𝑧 (2) 1.0
𝑒 =𝑒 ≈ 2.72,
(3)
𝑒𝑧 = 𝑒 0.5 ≈ 1.65
(𝑗)
Next, sum these values: ∑3𝑗=1 𝑒 𝑧 = 7.39 + 2.72 + 1.65 ≈ 11.76.
(𝑘)
𝑒𝑧
Now use the softmax formula, softmax(𝐳, 𝑘) = 𝑧(𝑗)
, to compute the probabilities:
∑3
𝑗=1 𝑒
The softmax outputs are better characterized as “probability scores” rather than true
probabilities. While these values sum to one and resemble class likelihoods, they don’t
represent statistical probabilities of classes. Neural networks tend to produce outputs
that diverge from true class probabilities. By contrast, models like logistic regression
and Naïve Bayes generate genuine probabilities. For simplicity, though, I will refer to
these outputs as “probabilities” throughout this book.
The cross-entropy loss measures how well predicted probabilities match the true distribution.
The true distribution is typically a one-hot vector with a single element equal to 1 (the correct
class) and zeros elsewhere. For example, a one-hot encoding with three classes looks like:
where 𝐶 is the number of classes, 𝐲 is the one-hot encoded true label, and 𝐲̃ is the predicted
probabilities. Here, 𝑦 (𝑘) and 𝑦̃ (𝑘) represent the 𝑘 th elements of 𝐲 and 𝐲̃, respectively.
Since 𝐲 is one-hot encoded, only the term corresponding to the correct class contributes to the
summation. The summation thus simplifies by retaining only that single term. Let’s simplify it.
Suppose the correct class is 𝑐, so 𝑦 (𝑐) = 1 and 𝑦 (𝑘) = 0 for all 𝑘 ≠ 𝑐. In the summation, only the term
where 𝑘 = 𝑐 will be non-zero. The equation simplifies to:
This simplified form indicates that the loss corresponds to the negative logarithm of the probability
assigned to the correct class. For 𝑁 examples, the average loss is:
𝑁
1 (𝑐 )
loss = − ∑ log (𝑦̂𝑖 𝑖 ),
𝑁
𝑖=1
When used with softmax in the output layer, cross-entropy loss guides the network to assign high
probabilities to correct classes while reducing probabilities for incorrect ones.
For a document classification example with three classes (cinema, music, and science), the network
generates three logits. These logits are passed through the softmax function to convert them into
probabilities for each class. The cross-entropy loss is then calculated between these scores and the
true one-hot encoded labels.
Let’s illustrate this by training a simple 2-layer neural network to classify documents into three
classes. We begin by importing dependencies, setting a random seed, and defining the dataset:
torch.manual_seed(42) ➊
docs = [
"Movies are fun for everyone.",
"Watching movies is great fun.",
...
"Listen to folk music!"
]
labels = [1, 1, 1, 3, 3, 3, 2, 2, 2, 2]
num_classes = len(set(labels))
Setting the random seed in line ➊ ensures consistent random number generation across PyTorch
runs. This guarantees reproducibility, allowing you to attribute performance changes to code or
hyperparameter modifications rather than random variations. Reproducibility is also essential for
teamwork, enabling collaborators to examine issues under identical conditions.
Next, we convert documents into a bag of words using two methods: tokenize, which splits input
text into lowercase words, and get_vocabulary, which constructs the vocabulary:
def tokenize(text):
return re.findall(r"\w+", text.lower()) ➊
def get_vocabulary(texts):
tokens = {token for text in texts for token in tokenize(text)} ➋
return {word: idx for idx, word in enumerate(sorted(tokens))} ➌
vocabulary = get_vocabulary(docs)
In line ➊, the regular expression \w+ extracts individual words from the text. A regular expression
is a sequence of characters used to define a search pattern. The pattern \w+ matches sequences of
“word characters,” such as letters, digits, and underscores.
The findall function from Python’s re module applies the regular expression and returns a list of
all matches in the input string. In this case, it extracts all words.
In line ➋, the corpus is converted into a set of tokens by iterating through each document and
extracting words using the same regular expression. In line ➌, these tokens are sorted
alphabetically and mapped to unique indices, forming a vocabulary.
Once the vocabulary is built, the next step is to define the feature extraction function that converts
a document into a feature vector:
The doc_to_bow function takes a document string and a vocabulary and returns the bag-of-words
representation of the document.
vectors = torch.tensor(
[doc_to_bow(doc, vocabulary) for doc in documents],
dtype=torch.float32
)
labels = torch.tensor(labels, dtype=torch.long) - 1 ➊
Read first, buy later 42
DRAFT The Hundred-Page Language Models Book DRAFT
The vectors tensor with shape (10, 26) represents 10 documents as rows and 26 vocabulary
tokens as columns, while the labels tensor of shape (10,) contains the class label for each
document. The labels use integer indices rather than one-hot encoding since PyTorch’s cross-
entropy loss function (nn.CrossEntropyLoss) expects this format.
Line ➊ uses torch.long to cast labels to 64-bit integers. The -1 adjustment converts our original
classes 1, 2, 3 to indices 0, 1, 2, which aligns with PyTorch’s expectation that class indices begin at
0 for models and loss functions like CrossEntropyLoss.
PyTorch provides two APIs for model definition: the sequential API and the module API. While
we used the straightforward nn.Sequential API to define our model in Section 1.8, we’ll now
explore building a multilayer perceptron using the more versatile nn.Module API:
input_dim = len(vocabulary)
hidden_dim = 50
output_dim = num_classes
class SimpleClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
The SimpleClassifier class implements a feedforward neural network with two layers. Its
constructor defines the network components:
1. A fully connected layer, self.fc1, maps the input of size input_dim (equal to the
vocabulary size) to 50 (hidden_dim) outputs.
2. A ReLU activation function to introduce non-linearity.
3. A second fully connected layer, self.fc2, reduces the 50 intermediate outputs to
output_dim, the number of unique labels.
The forward method describes the forward pass, where inputs flow through the layers:
• In line ➊, the input x of shape (10, 26) is passed to the first fully connected layer,
transforming it to shape (10, 50).
• In line ➋, output from this layer is fed through the ReLU activation function, keeping the
shape (10, 50).
• In line ➌, the result is sent to the second fully connected layer, reducing it from shape
(10, 50) to (10, 3), producing the model’s final output with logits.
The forward method is called automatically when you pass input data to the model instance, like
this: model(input).
With our model defined, the next steps, as outlined in Section 1.8, are to define the loss function,
choose the gradient descent algorithm, and set up the training loop:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
As you can see, the training loop is identical to the one in Section 1.8. Once the training is complete,
we can test the model on a new document:
new_docs = [
"Listening to rock music is fun.",
"I love science very much."
]
class_names = ["Cinema", "Music", "Science"]
new_doc_vectors = torch.tensor(
[doc_to_bow(new_doc, vocabulary) for new_doc in new_docs],
dtype=torch.float32
)
with torch.no_grad(): ➊
outputs = model(new_doc_vectors) ➋
predicted_ids = torch.argmax(outputs, dim=1) + 1 ➌
Output:
The torch.no_grad() statement in line ➊ disables the default gradient tracking. While gradients
are essential during training to update model parameters, they’re unnecessary during testing or
inference. Since these phases don’t involve parameter updates, disabling gradient tracking
conserves memory and speeds up computation. Note that the terms “inference” and “evaluation”
are often used interchangeably when referring to generating predictions on unseen data.
In line ➋, the model processes all inputs simultaneously during inference, just as it does during
training. This parallel processing approach leverages vectorized operations, substantially reducing
computation time compared to processing inputs one by one.
We only care about the final label, not the logits returned by the model. In line ➌, torch.argmax
identifies the highest logit’s index, corresponding to the predicted class. Adding 1 compensates for
the earlier shift from 1-based to 0-based indexing.
While the bag-of-words approach offers simplicity and practicality, it has notable limitations. Most
significantly, it fails to capture token order or context. Consider how “the cat chased the dog” and
“the dog chased the cat” yield identical representations, despite conveying opposite meanings.
N-grams provide one solution to this challenge. An n-gram consists of 𝑛 consecutive tokens from
text. Consider the sentence “Movies are fun for everyone”—its bigrams (2-grams) include “Movies
are,” “are fun,” “fun for,” and “for everyone.” By preserving sequences of tokens, n-grams retain
contextual information that individual tokens cannot capture.
However, using n-grams comes at a cost. The vocabulary expands considerably, increasing the
computational cost of model training. Additionally, the model requires larger datasets to effectively
learn weights for the expanded set of possible n-grams.
Word embeddings address this challenge by representing semantically similar words as similar
vectors.
BoW 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
enjoy 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BoW 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
great 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
movie 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
today 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
As you can see, a bag-of-word vector of a document is a sum of one-hot vectors of its words. Now,
let’s examine the one-hot vectors and the BoW vector for the text “Films are my passion.”:
BoW 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
films 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
are 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
my 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
passio 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
There are two key problems here. First, even when a word exists in the training data and
vocabulary, one-hot encoding reduces it to a single 1 in a vector of zeros, giving the classifier almost
no meaningful information to learn from.
Second, in the above document, most one-hot encoded word vectors add no value since three out
of four become zero vectors—representing words missing from the vocabulary.
A better approach would let the model understand that “films,” though unseen in training, shares
semantic meaning with “movies.” This would allow the feature vector for “films” to be processed
similarly to “movies.” Such an approach requires word representations that capture semantic
relationships between words.
Word embeddings overcome the limitations of the bag-of-words model by representing words as
dense vectors rather than sparse one-hot vectors. These lower-dimensional representations
contain mostly non-zero values, with similar words having embeddings that exhibit high cosine
similarity. The embeddings are learned from vast unlabeled datasets spanning millions to
hundreds of millions of documents.
Word2vec, a widely-used embedding learning algorithm, exists in two variants. We’ll examine the
skip-gram formulation.
Skip-grams are formed by breaking text into sequences where one word is missing. For instance,
in the sentence “Professor Alan Turing’s * advanced computer science,” the skipped word, marked as
*, could be guessed as “work,” “research,” “contributions,” “studies,” or “theories.” These words
aren’t exact synonyms but fit the context and share related meanings. A model can be trained to
predict the missing word from its surrounding context words, and through this process, it learns
the semantic relationships between words.
The process can also work in reverse: the skipped word can be used to predict its context words.
This is the basis of the skip-gram algorithm.
The skip-gram size specifies how many context words are included. For a size of five, this means
two words before and two after the skipped word. Here are examples of skip-grams of size five
from our sentence, with different words skipped (marked as *):
If the corpus vocabulary contains 10,000 words, the skip-gram model with an embedding layer of
300 units, is depicted below:
This is a skip-gram model with a skip-gram size of 5 and the embedding layer of 300 units. As you
can see, the model uses a one-hot encoded skipped word to predict a context word, processing the
input through two consecutive fully connected layers. It doesn’t predict all context words at once
but makes separate predictions for each.
Here’s how it works for the skip-gram `professor alan * research advanced” and the skipped word
“turing’s”. We transform the skip-gram into 4 training pairs:
For each pair of skipped and context words, say (turing’s, professor), the model:
For a given context word, the output layer produces a probability vector across the vocabulary.
Each value represents how likely that vocabulary word is to be the context word.
A curious reader might notice: if the input for each training pair remains constant—say,
“turing’s”—why would the output differ? That’s an great observation! The output will indeed be
identical for the same input. However, the loss calculation varies depending on each context word.
When using chat LMs, you may notice that the same question often yields different
answers. While this might suggest the model is non-deterministic, that’s not accurate.
An LLM is fundamentally a neural network, similar to a skip-gram model but with far
more parameters. The apparent randomness comes from the way these models are
used to generate text. During generation, words are sampled based on their predicted
probabilities. Though higher-probability words are more likely to be chosen, lower-
probability ones may still be selected. This sampling process creates the variations we
observe in responses. We will talk about sampling in Chapter 5.
The skip-gram model uses cross-entropy as its loss function, just as in the three-class text classifier
discussed earlier, but handles 10,000 classes—one for each word in the vocabulary. For each skip-
gram in the training set, the model computes losses separately for each context word, such as the
four words surrounding “turing’s,” then averages these losses to receive feedback on all context
word predictions simultaneously.
This training approach enables the model to capture meaningful word relationships, even when
working with the same input across different training pairs.
Here’s an example. For the input word “turing’s,” suppose the model assigns these probabilities to
different vocabulary words: professor (0.1), alan (0.15), research (0.2), advanced (0.05). When
training the model, each input-target word pair contributes to the loss function. For example, when
“turing’s” appears with “professor” in the training data, the loss works to increase the score of 0.1.
Similarly, when paired with “alan,” the loss works to increase 0.15, with “research” to increase 0.2,
and with “advanced” to increase 0.05.
During backpropagation, the model adjusts its weights to make these scores higher for the given
context words. For instance, the updated scores might be: professor: 0.11, alan: 0.17, research: 0.22,
advanced: 0.07, while the scores for other vocabulary words decrease slightly.
Once training is complete, the output layer is discarded. The embedding layer then serves as the
new output layer. When given a one-hot encoded input word, the model produces a 300-
dimensional vector—this is the word embedding.
Word2vec is just one method for learning word embeddings from large, unlabeled text corpora.
Other methods, such as GloVe and FastText, offer alternative approaches, focusing on capturing
global co-occurrence statistics or subword information to create more robust embeddings.
Using word embeddings to represent texts offers clear advantages over bag of words. One
advantage is dimensionality reduction, which compresses the word representation from the size
Read first, buy later 48
DRAFT The Hundred-Page Language Models Book DRAFT
of the vocabulary (as in one-hot encoding) to a small vector, typically between 100 and 1000
dimensions. This makes it feasible to process very large corpora in machine learning tasks.
Semantic similarity is another advantage of word embeddings. Words with similar meanings are
mapped to vectors that are close to each other in the embedding space. For example, consider word
embeddings trained by Google on a news corpus containing about 100 billion words. 5 In the graph
below, “Moscow” and “Beijing,” or “Russia” and “China,” are represented by points located near one
another. This reflects their semantic relationships:
The skip-gram model captures semantic similarity when words occur in similar contexts, even
without direct co-occurrence. For instance, if the model produces different probabilities for “films”
and “movies,” the loss function drives it to predict similar ones, since context words for these terms
frequently overlap. Through backpropagation, the embedding layer outputs for these words
converge.
5These embeddings can be found online by using the “GoogleNews-vectors-negative300.bin.gz” query. A backup is
available on the book’s wiki at thelmbook.com/data/word-vectors.
For resources on PCA and other dimensionality reduction methods, check the
recommended material listed on the book’s wiki.
Word embeddings capture the meaning of words and their relationships to other words. They are
fundamental to many natural language processing (NLP) tasks. Neural language models, for
example, encode documents as matrices of word embeddings. Each row corresponds to a word’s
embedding vector, and its position reflects the word’s position in the document.
Modern language models, though, often use subwords—tokens smaller than complete words.
Before moving on to language models—the main topic of this book—let’s first examine byte-pair
encoding, a widely used subword tokenization method.
Initially a data compression technique, BPE was adapted for NLP by treating words as sequences
of characters. It merges the most frequent symbol pairs—characters or subwords—into new
subword units. This continues until the vocabulary reaches the target size.
1. Initialization: Use a text corpus. Split each word in the corpus into individual characters.
For example, the word “hello” becomes “h e l l o”. The initial vocabulary consists of all
unique characters in the corpus.
2. Iterative merging:
o Select the most frequent symbol pair: Identify the symbol pair with the
highest count across the entire corpus. For instance, if “l l” occurs most
frequently, it will be selected.
o Merge the selected pair: Replace all occurrences of the most frequent symbol
pair with a new single merged symbol. For example, “l l” would be replaced
with a new merged symbol “ll”. The word “h e l l o” now becomes “h e ll o”.
o Update the vocabulary: Add the new merged symbol to the vocabulary. The
vocabulary now includes the original characters and the new symbol “ll”.
3. Repeat: Continue the iterative merging until the vocabulary reaches the desired size.
The algorithm is simple, but implementing it directly on large corpora is inefficient. Recomputing
symbol pairs or updating the entire corpus after each merge is computationally expensive.
A more efficient approach initializes the vocabulary with all unique words in the corpus and their
counts. Pair counts are calculated using these word counts, and the vocabulary is updated
iteratively by merging the most popular pairs. Let’s write the code:
def initialize_vocabulary(corpus):
vocabulary = defaultdict(int)
charset = set()
for word in corpus:
word_with_marker = '_' + word ➊
characters = list(word_with_marker) ➋
charset.update(characters) ➌
tokenized_word = ' '.join(characters) ➍
vocabulary[tokenized_word] += 1 ➎
return vocabulary, charset
The function generates a vocabulary that represents words as sequences of characters and tracks
their counts. Given a corpus (a list of words), it returns two outputs: vocabulary, a dictionary
mapping each word—tokenized with spaces between characters—to its count, and charset, a set
of all unique characters present in the corpus.
• Line ➊ adds a word boundary marker “_” to the start of each word to differentiate
subwords at the beginning from those in the middle. For example, “_re” in “restart” is
distinct from “re” in “agree.” This helps rebuild sentences from tokens generated using
the model. When a token starts with “_”, it marks the beginning of a new word, requiring
a space to be added before it.
• Line ➋ splits each word into individual characters.
• Line ➌ updates charset with any new characters encountered in the word.
• Line ➍ joins characters with spaces to create a tokenized version of the word. For
example, the word “hello” becomes _ h e l l o.
• Line ➎ adds tokenized_word to vocabulary with its count incremented.
After the initialization, BPE iteratively merges the most frequent pairs of tokens (bigrams) in the
vocabulary. By removing spaces between these pairs, it forms progressively longer tokens.
def get_pair_counts(vocabulary):
pair_counts = defaultdict(int)
for tokenized_word, count in vocabulary.items():
tokens = tokenized_word.split() ➊
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i + 1]) ➋
pair_counts[pair] += count ➌
return pair_counts
The function counts how often adjacent token pairs appear in the tokenized vocabulary words. The
input vocabulary maps tokenized words to their counts, and the output is a dictionary of token
pairs and their total counts.
For each tokenized_word in vocabulary, we split it into tokens in line ➊. A nested loop forms
adjacent token pairs in line ➋ and increments their count by the word’s count in line ➌.
The function merges the input token pair in all tokenized words from the vocabulary. It returns a
new vocabulary where every occurrence of the pair is merged into a single token. For example, if
the pair is ('e', 'l') and a tokenized word is "_ h e l l o", merging 'e' and 'l' removes
the space between them, resulting in "_ h el l o".
In line ➊, the re.escape function automatically adds backslashes to special characters in a string
(like ., *, or ?), so they are interpreted as literal characters rather than having their special meaning
in regular expressions.
The regular expression in line ➋ matches only whole token pairs. It ensures the bigram is not part
of a larger word by checking for the absence of non-whitespace characters immediately before and
Read first, buy later 52
DRAFT The Hundred-Page Language Models Book DRAFT
after the match. For instance "good morning" matches in "this is good morning", but not in
"thisisgood morning", where "good" is part of "thisisgood".
Finally, in line ➌, the function uses pattern.sub() to replace all occurrences of the matched
pattern with the joined pair, creating the new tokenized word.
The function below implements the BPE algorithm, merging the most frequent token pairs
iteratively until no merges remain or the target vocabulary size is reached:
This function processes a corpus to produce the components needed for a tokenizer. It initializes
the vocabulary and character set, creates an empty merges list for storing merge operations, and
sets tokens to the initial character set. Over time, tokens grows to include all unique tokens the
tokenizer will be able to generate.
The loop in line ➊ continues until the number of tokens supported by the tokenizer reaches
vocab_size or no pairs remain to merge. Line ➋ checks if there are no more valid pairs, in which
case the loop exits. Line ➌ finds the most frequent token pair, which is merged throughout the
vocabulary in line ➍ to create a new token in line ➎. This new token is added to the tokens set in
line ➏, and the merge is recorded in merges.
The function returns four outputs: the updated vocabulary, the list of merge operations, the original
character set, and the final set of unique tokens.
This function tokenizes a word using merges, vocabulary, and charset from
byte_pair_encoding. The word is first prefixed. If the prefixed word exists in the vocabulary, it
returns it as the only token. Otherwise, the word is split into characters, with any character not in
charset replaced by unk_token. These characters are then iteratively merged using the order of
rules in merges.
To tokenize a text, we first split it into words based on spaces and then tokenize each word
individually. The thelmbook.com/nb/2.1 notebook contains code for training a BPE tokenizer by
using a news corpus.
The tokenized version of the sentence “Let’s proceed to the language modeling chapter.” using the
tokenizer trained in the notebook, is:
Here, “let’s,” and “modeling,” were broken into subwords. This indicates their relative rarity in the
training data and a small target vocabulary size of 5000 tokens.
As you might have observed, the tokenize_word algorithm is inefficient, because it iterates over
all merges in line ➍ and checks every token pair in line ➎. This creates slow nested loops. However,
with vocabularies often exceeding 100,000 tokens in modern language models, most input words
are already in the vocabulary, avoiding subword tokenization altogether. The notebook includes an
optimized version of tokenize_word that uses caching and a precomputed data structure to
eliminate nested loops. Running this optimized version on the same input reduces tokenization
time from 0.0549 seconds to 0.0037 seconds. While the exact numbers depend on your system and
input, the optimized approach will consistently be faster.
For languages without spaces, like Chinese, or for multilingual models, the initial space-based
tokenization is typically skipped. Instead, the text is split into individual characters. From there,
BPE proceeds as usual, merging the most frequent character or token pairs to form subwords.
We’re now ready to explore the core ideas and methods of language modeling. We’ll start with
traditional count-based methods and cover neural network-based techniques in later chapters.
Here, Pr represents the conditional probability distribution over the vocabulary for the next token.
A conditional probability quantifies the likelihood of one event occurring given that another has
already occurred. In language models, it reflects the probability of a specific token being the next
one, given the preceding sequence of tokens. This sequence is often referred to as the input
sequence, context, or prompt.
We will select different notations, ranging from concise to detailed, based on the context.
For any token 𝑡 and sequence 𝐬, the conditional probability satisfies Pr(𝑡|𝐬) ≥ 0, meaning
probabilities are always non-negative. Furthermore, the probabilities for all possible next tokens
in the vocabulary 𝒱 must sum to 1: ∑𝑡∈𝒱 𝑃 (𝑡|𝐬) = 1. This ensures the model outputs a valid
probability distribution over the vocabulary.
To illustrate, let’s consider an example with a vocabulary 𝒱 containing 5 words: “are,” “cool,”
“language,” “models,” and “useless.” For the sequence 𝐬 = (language, models, are), a language
model could output the following probabilities for each possible next word in 𝒱:
The illustration demonstrates how the language model assigns probabilities across its vocabulary
for each potential next word, with “cool” receiving the highest probability. These probabilities sum
to 1, forming a valid discrete probability distribution.
This type of model is an autoregressive language model, also known as a causal language
model. Autoregression involves predicting an element in a sequence using only its predecessors.
Such models excel at text generation and include Transformer-based chat LMs, and all language
models discussed in this book.
Before neural networks became standard for language modeling, traditional methods relied on
statistical techniques. These count-based models, still used in tools like smartphone autocomplete
systems, estimate the probability of word sequences based on word or n-gram frequency counts
learned from a corpus. To understand these methods better, let’s implement a simple count-based
language model.
𝐶(𝑡𝑖−2 , 𝑡𝑖−1 , 𝑡𝑖 )
Pr(𝑡𝑖 |𝑡𝑖−2 , 𝑡𝑖−1 ) = , (2.4)
𝐶(𝑡𝑖−2 , 𝑡𝑖−1 )
where 𝐶(⋅) denotes the count of occurrences of an n-gram in the training data.
For instance, if the trigram “language models rock” appears 50 times in the corpus and “language
models” appears 200 times overall, then:
50
Pr(rock|language, models) = = 0.25
200
This means that “rock” follows “language models” 25% of the time in our training data.
However, a limited-size corpus poses a problem: some n-grams we may encounter in practice might
not appear in the training data. For instance, if the trigram “language models sing” never appears
in our corpus, its probability would be zero according to the MLE:
0
Pr(sing|language, models) = =0
200
This is problematic because it assigns a zero probability to any sequence containing an unseen n-
gram, even if it is a valid phrase. To solve this, several techniques have been developed, one of which
is backoff. The idea is simple: if a higher-order n-gram (e.g., trigram) is not observed, we “back off”
to a lower-order n-gram (e.g., bigram). The probability Pr(𝑡𝑖 |𝑡𝑖−2 , 𝑡𝑖−1 ) is given by one of the
following expressions, depending on whether the condition is true:
Here, 𝐶(𝑡𝑖−2 , 𝑡𝑖−1 , 𝑡𝑖 ) is the count of the trigram (𝑡𝑖−2 , 𝑡𝑖−1 , 𝑡𝑖 ), 𝐶(𝑡𝑖−2 , 𝑡𝑖−1 ) and 𝐶(𝑡𝑖−1 , 𝑡𝑖 ) are the
counts of the bigrams (𝑡𝑖−2 , 𝑡𝑖−1 ) and (𝑡𝑖−1 , 𝑡𝑖 ) respectively.
𝐶(𝑡𝑖−1 , 𝑡𝑖 ) 𝐶(𝑡𝑖 ) + 1
Pr(𝑡𝑖 |𝑡𝑖−1 ) = , Pr(𝑡𝑖 ) = ,
𝐶(𝑡𝑖−1 ) 𝑊 +𝑉
where 𝐶(𝑡𝑖 ) is the count of the token 𝑡𝑖 , 𝑊 is the total number of tokens in the corpus, and 𝑉 is the
vocabulary size.
Laplace smoothing solves this by adding 1 to each token count, ensuring all tokens, including
unseen ones, receive a small, non-zero probability. The denominator is adjusted by adding 𝑉 to
account for the extra counts introduced in the numerator.
Now, let’s implement a language model with backoff in the CountLanguageModel class:
class CountLanguageModel:
def __init__(self, n): ➊
self.n = n
self.ngram_counts = [{} for _ in range(n)] ➋
self.total_unigrams = 0
if len(context) >= n - 1: ➎
context_n = tuple(context[-(n - 1):]) ➏
counts = self.ngram_counts[n - 1].get(context_n)
if counts:
return max(counts.items(), key=lambda x: x[1])[0]
unigram_counts = self.ngram_counts[0].get(())
if unigram_counts:
return max(unigram_counts.items(), key=lambda x: x[1])[0]
return None
In line ➊, the model is initialized with an n parameter, defining the maximum n-gram order (e.g.,
n=3 for trigrams). The ngram_counts list in line ➋ stores n-gram frequency dictionaries for
unigrams, bigrams, trigrams, etc., populated during training. For n=3, given the corpus “Language
models are powerful. Language models are useful.” in lowercase with punctuation removed,
self.ngram_counts would contain:
The predict_next_token method uses backoff to predict the next token. Starting from the
highest n-gram order in line ➍, it checks if the context contains enough tokens for this n-gram order
in line ➎. If so, it extracts the relevant context in line ➏ and attempts to find a match in
ngram_counts. If no match is found, it backs off to lower-order n-grams or defaults to unigram
counts. For instance, given context=["language", "models", "are"] and n=3:
If a matching context is found, the method returns the token with the highest frequency for that
context. For input ["language", "models"] it will return "are", the token with highest count
among values for the key ("language", "models") in ngram_counts[2]. However, for the
input ["english", "language"] it will not find the key ("english", "language") in
ngram_counts[2], so it will backoff to ngram_counts[1] and return "models", the token with
highest count among values for the key ("language",).
counts = model.ngram_counts[n - 1]
for i in range(len(tokens) - n + 1):
context = tuple(tokens[i:i + n - 1]) ➋
next_token = tokens[i + n - 1] ➌
if context not in counts:
counts[context] = defaultdict(int)
counts[context][next_token] = counts[context][next_token] + 1
The train method takes a model (an instance of CountLanguageModel) and a list of tokens (the
training corpus) as input. It populates the n-gram counts in the model using these tokens.
In line ➊, the method iterates over n-gram orders from 1 to model.n (inclusive). For each n, it
generates n-grams of that order from the token sequence and counts their occurrences.
Lines ➋ and ➌ extract contexts and their subsequent tokens to build a nested dictionary where
each context maps to a dictionary of following tokens and their counts. These counts are stored in
model.ngram_counts, which the predict_next_token method later uses to make predictions
based on context.
set_seed(42)
n = set_hyperparameters()
data_url = "https://www.thelmbook.com/data/brown"
train_corpus, test_corpus = download_and_prepare_data(data_url)
model = CountLanguageModel(n)
train(model, train_corpus)
contexts = [
"i will build a",
"the best place to",
"she was riding a"
]
The full implementation of this model, including the methods to retrieve and process the training
data, can be found in the thelmbook.com/nb/2.2 notebook. Within the
download_and_prepare_data method, the corpus is downloaded, converted to lowercase,
tokenized into words, and divided into training and test partitions with a 90/10 split. Let’s take a
moment to understand why this last step is critical.
In machine learning, using the entire dataset for training leaves no way to evaluate whether the
model generalizes well. A frequent issue is overfitting, where the model excels on training data
but struggles to make accurate predictions on unseen, new data.
Partitioning the dataset into training and test sets is a standard practice to control overfitting. It
involves two steps: (1) shuffling the data and (2) splitting it into two subsets. The larger subset,
called the training data, is used for training the model, while the smaller subset, called the test data,
is used to evaluate the model’s performance on unseen examples.
The test set requires sufficient size to reliably estimate model performance. A test ratio
of 0.1 to 0.3 (10% to 30% of the entire dataset) is common, though this varies with
dataset size. For very large datasets, even a smaller test set can provide reliable
performance estimates.
The training data comes from the Brown Corpus, a collection of over 1 million words from
American English texts published in 1961. This corpus is frequently used in linguistic studies.
When you run the code, you will see the following output:
Ignore the perplexity number for now; we’ll discuss it shortly. Count-based language models can
produce reasonable immediate continuations, making them effective for autocomplete systems.
However, they have notable limitations. These models generally work with word-tokenized
corpora, as their n-gram size is typically small (up to 𝑛 = 5). Extending beyond this would require
too much memory and lead to slower processing. Subword tokenization, while potentially more
efficient, often results in many n-grams that represent only fragments of words, degrading the
quality of next-word predictions.
Word-level tokenization creates a significant drawback: count-based models cannot handle out-of-
vocabulary (OOV) words. This is similar to the issue seen in the bag-of-words approach discussed
in Section 2.1. For example, consider the context: “according to WHO, COVID-19 is a.” If “COVID-19”
wasn’t in the training data, the model would back off repeatedly until it relies only on “is a.” This
severely limits the context and makes meaningful predictions unlikely.
Short contexts make it impossible for count-based models to capture long-range dependencies in
language. While modern Transformer models also have context size limits, these can handle
thousands of tokens. In contrast, training a count-based model with a long context, such as 1000
tokens, would require storing counts for all n-grams from 𝑛 = 1 to 𝑛 = 1000, which is infeasible
due to memory constraints.
Another limitation is that count-based models cannot be adapted for downstream tasks after
training. Their n-gram counts are fixed, and any adjustment requires retraining on new data.
These limitations have led to the development of advanced methods, particularly neural network-
based language models, which solve these problems and have largely replaced count-based models
in modern natural language processing. Approaches like recurrent neural networks and
transformers, which we will discuss in the next two chapters, handle longer contexts effectively,
producing more coherent and context-aware text. Before exploring these methods, let’s look at how
to evaluate a language model’s quality.
2.6.1. Perplexity
Perplexity is a widely used metric for evaluating language models. It measures how well a model
predicts a text. Lower perplexity values indicate a better model—one that is more confident in its
predictions. Perplexity is defined as the exponential of the average negative log-likelihood per
token in the test set:
𝐷
1
Perplexity(𝒟, 𝑘) = exp (− ∑ log Pr(𝑡𝑖 |𝑡max(1,𝑖−𝑘) , … , 𝑡𝑖−1 )) (2.5)
𝐷
𝑖=1
Here, 𝒟 represents the test set, 𝐷 is the total number of tokens in it, 𝑡𝑖 is the 𝑖 th token, and
Pr(𝑡𝑖 |𝑡max(1,𝑖−𝑘 ) , … , 𝑡𝑖−1 ) is the probability the model assigns to 𝑡𝑖 given its preceding context
window of size 𝑘, where max(1, 𝑖 − 𝑘) ensures we start from the first token when there aren’t
enough preceding tokens to fill the context window. The notations exp(𝑥) and 𝑒 𝑥 , where 𝑒 is Euler’s
number, are equivalent.
Perplexity can be interpreted as the number of choices a model considers when predicting the next
token. For example, a perplexity of 10 means the model behaves as if choosing from 10 equally
likely options on average.
Let’s calculate perplexity using the example text with word-level tokenization: “We are evaluating
a language model for English.” To keep things simple, we assume a context of three words. We begin
by determining the probability of each word based on its preceding context of three words as
provided by the model. Here are the probabilities:
Pr(We) = 0.10
Pr(are ∣ We) = 0.20
Pr(evaluating ∣ We, are) = 0.05
Pr(a ∣ We, are, evaluating) = 0.50
Pr(language ∣ are, evaluating, a) = 0.30
Pr(model ∣ evaluating, a, language) = 0.40
Pr(for ∣ a, language, model) = 0.15
Pr(English ∣ language, model, for) = 0.25
Using the probabilities, we compute the negative log-likelihood for each word:
Next, we sum these values and divide the sum by the number of words (8) to get the average:
𝑒 1.63 ≈ 5.10
So, the model’s perplexity on this text, using a 3-word context, is about 5.10. This means that, on
average, the model acts as if it selects from roughly 5 equally likely options for each word in the
given context. A perplexity of 5.10 is relatively low, showing the model is confident in its predictions
for this sequence. But for a proper evaluation, perplexity must be measured on a much larger test
set.
Now, let’s calculate the perplexity of the count-based model from the previous section. To do this,
the model must be updated to return the probability of a token given a specific context. Add this
function to the CountLanguageModel we implemented earlier:
Unlike predict_next_token, which finds the most probable token, get_probability calculates
a token’s probability. In line ➌, total is the sum of counts for tokens following the context, acting
as the denominator. Line ➍ divides the token count by total to compute its probability. If no
higher-order match exists, line ➏ uses add-one smoothing with unigram counts.
In line ➊, the function iterates through each token in the sequence. For every token:
• Line ➋ extracts its context, using up to context_size tokens before it. The expression
max(0, i - context_size) ensures indices stay within bounds.
• In line ➌, the log of the token’s probability is added to the cumulative log-likelihood. The
get_probability method from the model handles the probability calculation.
Once all tokens are processed, line ➍ computes the average log-likelihood by dividing the total log-
likelihood by the number of tokens.
Finally, in line ➎, the perplexity is computed as the exponential of the negative average log-
likelihood, as described in Equation 2.5.
By applying this method to the test_corpus sequence from the thelmbook.com/nb/2.2 notebook,
we observe the following output:
This perplexity is very high. For example, GPT-2 has a perplexity of about 20, while modern LLMs
achieve values below 10. Later, we’ll compute perplexities for RNN and Transformer-based models
and compare them with the perplexity of this count-based model.
2.6.2. ROUGE
Perplexity is a standard metric used to evaluate language models trained on large, unlabeled
datasets by measuring how well they predict the next token in a given context. These models are
referred to as pretrained models or base models. As we will discuss in the chapter on large
language models, their ability to perform specific tasks or answer questions comes from
supervised finetuning. This additional training uses a smaller, labeled dataset where input
contexts are matched with target outputs, such as answers or task-specific results. This enables
capabilities like translation or summarization.
Perplexity is not an ideal metric for evaluating a finetuned model. Instead, metrics are needed that
compare the model’s output to reference texts, often called ground truth. A common choice is
ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is widely used for tasks like
summarization and machine translation. It evaluates text quality by measuring overlaps, such as
tokens or n-grams, between the generated text and the reference.
ROUGE has several variants, each focusing on different aspects of text similarity. Here, we’ll discuss
three widely used ones: ROUGE-1, ROUGE-N, and ROUGE-L.
ROUGE-N evaluates the overlap of n-grams between the generated and reference texts, with N
indicating the length of the n-gram. One of the most commonly used versions is ROUGE-1.
ROUGE-1 measures the overlap of unigrams (single tokens) between the generated and reference
texts. As a recall-focused metric (hence the “R” in ROUGE), it assesses how much of the reference
text is captured in the generated output.
Recall is the ratio of matching tokens to the total number of tokens in the reference text:
Here, 𝒟 is the dataset of (generated text, reference text) pairs, count(𝑡, 𝑔) counts how often a token
𝑡 from the reference text 𝑟 appears in the generated text 𝑔, and the denominator is the total token
count across all reference texts.
A ROUGE-1 score of 0.67 means roughly two-thirds of the words from the reference text appear in
the generated text. However, this number alone has little value. ROUGE scores are only useful for
comparing how different language models perform on the same test set, as they indicate which
model more effectively captures the content of the reference texts.
ROUGE-N extends the ROUGE metric from unigrams to n-grams while using the same formula.
ROUGE-L relies on the longest common subsequence (𝐋𝐂𝐒). This identifies the longest sequence
of tokens that appear in both the generated and reference texts in the same order, without needing
to be adjacent.
Let 𝐿𝑔 be the length of the generated text, and 𝐿𝑟 the length of the reference text. Then:
Here, LCS(𝐿𝑔 , 𝐿𝑟 ) represents the number of tokens in the LCS between the generated text 𝑔 and the
reference text 𝑟. The recall measures the proportion of the reference text captured by the LCS,
while the precision measures the proportion of the generated text that matches the reference.
Recall and precision are combined into a single metric as follows:
Here, 𝛽 controls the trade-off between precision and recall in the ROUGE-L score. Since ROUGE
favors recall, 𝛽 is usually set high. In Chin-Yew Lin’s 2004 paper6, 𝛽 was set to 8.
6Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” Proceedings of the Workshop on Text
Summarization Branches Out, 2004.
Let’s revisit the two texts used to illustrate ROUGE-1. For these sentences, there are two valid
longest common subsequences, each with a length of 5 words:
LCS 1 LCS 2
Large, language, models, are, text Large, language, models, are, processing
Both subsequences are the longest sequences of tokens that appear in the same order in both
sentences, but not necessarily consecutively. When multiple LCS options exist, ROUGE-L can use
any of them since their lengths are identical.
Here’s how the calculations work here. The length of the LCS is 5 words. The reference text is 9
words long, and the generated text is 8 words long. Thus, recall and precision are:
5 5
recallLCS = ≈ 0.56, precisionLCS = ≈ 0.63
9 8
With 𝛽 = 8, ROUGE-L is then given by:
(1 + 82 ) ⋅ 0.56 ⋅ 0.63
ROUGE-L = ≈ 0.56
0.56 + 82 ⋅ 0.63
ROUGE scores range from 0 to 1, where 1 means a perfect match between generated and reference
texts. However, even excellent summaries or translations rarely approach 1 in practice.
• ROUGE-1 and ROUGE-2 are standard starting points. ROUGE-1 checks overall content
similarity using unigram overlap, while ROUGE-2 evaluates local fluency and phrase
accuracy using bigram matches.
• For tasks like summarization or paraphrasing, where word order can vary, ROUGE-L is
helpful. It detects similarity despite rewording, making it suitable for linguistic variation.
A combination of metrics, such as ROUGE-1, ROUGE-2, and ROUGE-L, often gives a more balanced
evaluation, covering both content overlap and structural flexibility.
Keep in mind, though, that ROUGE has limitations. It measures lexical overlap but not semantic
similarity or factual correctness. To address these gaps, ROUGE is often paired with human
evaluations or other methods for a fuller assessment of text quality.
Automated metrics are useful, but human evaluation is still necessary to assess language models.
Humans can evaluate qualities that automated metrics often miss, like fluency and accuracy. Two
common approaches for human evaluation are Likert scale ratings and Elo ratings.
Likert scale ratings involve assigning scores to outputs using a fixed, typically symmetric scale.
Raters judge the quality by selecting a score, often from −2 to 2, where each scale point corresponds
to a descriptive label. For instance, −2 could mean “Strongly Disagree” or “Poor,” while 2 might
mean “Strongly Agree” or “Excellent.” The scale is symmetric because it includes equal levels of
agreement and disagreement around a neutral midpoint, making positive and negative responses
easier to interpret.
Likert scales are flexible for evaluating different aspects of language model outputs, such as fluency,
coherence, relevance, and accuracy. For example, a rater could separately rate a sentence’s
grammatical correctness and its relevance to a prompt, both on a scale from −2 to 2.
However, the method has limitations. One issue is central tendency bias, where some raters avoid
extreme scores and stick to the middle of the scale. Another challenge is inconsistency in how raters
interpret the scale—some may reserve a 2 for exceptional outputs, while others may assign it to
any high-quality response.
To mitigate these issues, researchers often involve multiple raters, phrase similar questions in
different ways for the same rater, and use detailed rubrics that clearly define each scale point.
Let’s illustrate Likert scale evaluation using a scenario where machine-generated summaries of
news articles are assessed.
Human raters compare a model-generated summary to the original article. They rate it on three
aspects using 5-point Likert scales: coherence, informativeness, and factual accuracy.
For example, consider the news article on the left and the generated summary on the right:
Raters assess the summary based on how effectively it meets these three criteria.
The scale for assessing coherence—that is, how well-organized, readable, and logically connected
the summary is—might look like this:
The scale for assessing informativeness, that is, how well the summary captures the essence and
main points of the original article, might look this way:
The scale for assessing factual accuracy—that is, how precisely the summary represents facts and
data from the original article—might look like this:
In this example, raters would select one option for each of the three aspects. The use of descriptive
anchors at each point on the scale helps standardize understanding among raters.
After collecting ratings from multiple raters across various summaries, researchers analyze the
data through several approaches:
• Calculate average scores for each aspect across all summaries to get an overall
performance measure.
• Compare scores across different versions of the model to track improvements.
• Analyze the correlation between different aspects (e.g., is high coherence associated with
high factual accuracy?).
Although Likert scale ratings were originally intended for humans, the rise of advanced
chat LMs means raters can now be either humans or language models.
Pairwise comparison is a method where two outputs are evaluated side-by-side, and the better
one is chosen based on specific criteria. This simplifies decision-making, especially when outputs
are of similar quality or changes are minor.
The method builds on the principle that relative judgments are easier than absolute ones. Binary
choices often produce more consistent and reliable results than absolute ratings.
In practice, raters compare pairs of outputs, such as translations, summaries, or answers, and
decide which is better based on criteria like coherence, informativeness, or factual accuracy.
For example, in machine translation evaluation, raters compare pairs of translations for the same
source sentence, selecting which one better preserves the original meaning in the target language.
By repeating this process across many pairs, evaluators can compare different models or model
versions.
Pairwise comparison helps rank models or model versions by having each rater evaluate multiple
pairs, with each model compared against others several times. This repetition minimizes individual
Read first, buy later 68
DRAFT The Hundred-Page Language Models Book DRAFT
biases, resulting in more reliable evaluations. A related approach is ranking, where a rater orders
multiple responses by quality. Ranking requires less effort than pairwise comparisons while still
capturing relative quality.
Results from pairwise comparisons are typically analyzed statistically to determine significant
differences between models. A common method for this analysis is the Elo rating system.
Elo ratings, originally created by Arpad Elo for ranking chess players, can be adapted for language
model evaluation. The system assigns ratings based on “wins” and “losses” in direct comparisons,
quantifying relative model performance.
In language model evaluation, all models typically start with an initial rating, often set to 1500.
When two models are compared, the probability of one model “winning” is calculated using their
current ratings. After each comparison, their ratings are updated based on the actual result versus
the expected result.
The probability of model 𝐴 with rating Elo(𝐴) winning against model 𝐵 with rating Elo(𝐵) is:
1
Pr(𝐴 wins) =
1+ 10(Elo(𝐵)−Elo(𝐴))/400
The value 400 in the Elo formula acts as a scaling factor, creating a logarithmic
relationship between rating differences and winning probabilities. Arpad Elo chose this
number ensuring that a 400-point rating difference reflects 10: 1 odds in favor of the
higher-rated chess player. While originally designed for chess, this scaling factor has
proven effective in other contexts, including language model evaluation.
where 𝑘 (typically between 4 and 32) controls the maximum rating change, and score(𝐴) reflects
the outcome: 1 for a win, 0 for a loss, and 0.5 for a draw.
Consider an example with three models: LM1 , LM2 , and LM3 . We’ll evaluate them based on their
ability to generate coherent text continuations. Assume their initial ratings are:
Elo(LM1 ) = 1500
Elo(LM2 ) = 1500
Elo(LM3 ) = 1500
Consider this prompt: “The scientists were shocked when they discovered…”
Continuation by LM1 : “…a new species of butterfly in the Amazon rainforest. Its wings were unlike
anything they had ever seen before.”
Continuation by LM2 : “…that the ancient artifact they unearthed was emitting a faint, pulsating light.
They couldn’t explain its source.”
Continuation by LM3 : “…the results of their experiment contradicted everything they thought they
knew about quantum mechanics.”
Let’s say we conduct pairwise comparisons and get the following results:
Elo(LM1 ) = 1499
Elo(LM2 ) = 1470
Elo(LM3 ) = 1531
Elo ratings quantify how models perform relative to each other. In this case, LM3 is the strongest,
followed by LM1 , with LM2 ranking last.
Performance isn’t judged from a single match. Instead, multiple pairwise matches are used. This
limits the effects of random fluctuations or biases in individual comparisons, giving a better
estimate of each model’s performance.
A variety of prompts or inputs ensures evaluation across different contexts and tasks. When human
raters are involved, several raters assess each comparison to reduce individual bias.
To avoid order effects, both the sequence of comparisons and the presentation of outputs are
randomized. Elo ratings are updated after every comparison.
How many matches are needed until the results are reliable? There’s no universal number that
applies to all cases. As a general guideline, some researchers suggest that each model should
participate in at least 100–200 comparisons before considering the Elo ratings stable and ideally
500+ comparisons for high confidence. However, for high-stakes evaluations or when comparing
very similar models, thousands of comparisons may be necessary.
Statistical methods can be used to calculate a confidence interval for a model’s Elo
rating. Explaining these techniques is beyond the scope of this book. For those
interested, the Bradley–Terry model and bootstrap resampling are good starting
points. Both are well-documented, with resources linked on the book’s wiki.
Elo ratings provide a continuous scale for ranking models, making it easier to track incremental
improvements. The system rewards wins against strong opponents more than wins against weaker
ones, and it can handle incomplete comparison data, meaning not every model needs to be
compared against every other model. However, the choice of 𝑘 significantly affects rating volatility;
a poorly chosen 𝑘 can undermine the evaluation’s stability.
To address these limitations, Elo ratings are often used alongside other evaluation methods. For
instance, researchers might use Elo ratings for ranking models in pairwise comparisons, while
collecting Likert scale ratings to assess absolute quality. This combined approach yields a more
comprehensive view of a language model’s performance.
Now that we’ve covered language modeling and evaluation methods, let’s explore a more advanced
model architecture: recurrent neural networks (RNNs). RNNs made significant progress in
processing text. They introduced the ability to maintain context over long sequences, allowing for
the creation of more powerful language models.