Course 2: Tokenization
1
What is tokenization?
Turning text...
I love playing soccer!
...into tokens
['I', 'love', 'play', 'ing', 'soccer', '!']
Course 2: Tokenization 2
Historical Notions
Course 2: Tokenization 3
Tokenization Origins
The word token comes from linguistics
“ non-empty contiguous sequence of graphemes or phonemes in a
document
“
Split text on blanks
Course 2: Tokenization 4
Tokenization Origins
old_tokenize("I love playing soccer!") = ['I', 'love', 'playing', 'soccer!']
Different from word-forms
damélo → da / mé/ lo (=give/ me/ it )
Course 2: Tokenization 5
Tokenization Origins
Natural language is split into...
Sentences, utterances, documents... ( macroscopical )
that are split into...
Tokens, word-forms... ( microscopical )
→ Used for linguistic tasks (POS tagging, syntax parsing,...)
Course 2: Tokenization 6
Tokenization & ML
ML-based NLP (mostly) relies on sub-word tokenization:
Gives better performance
Fixed-size vocabulary often required
Out-Of-Vocabulary (OOV) issue
Course 2: Tokenization 7
Tokenization & ML
Evolution of modeling complexity w.r.t. the sequence length n
Model Type Year Complexity
Tf-Idf 1972 O(1)
RNNs ~1985 O(n)
Transformers 2017 O(n2 )
→ Long sequences (e.g. character-level) are prohibitive
Course 2: Tokenization 8
Modern framework
Pre-tokenization
"I'm lovin' it!" -> ["i", "am", "loving", "it", "!"]
Normalization
Rules around punctuation ( _:_ , _! , ...)
Spelling correction ( "imo" -> "in my opinion" )
Named entities ( "covid" -> "COVID-19" )
...
Rule-based segmentation
Blanks, punctuation, ...
Course 2: Tokenization 9
Modern framework
Tokenization -> ["i", "am", "lov", "##ing", "it", "!"]
Split units at subword level
Fixed vocabulary
Trained on text samples
Used in inference mode at pre-processing time
Course 2: Tokenization 10
Sub-word Tokenization
Course 2: Tokenization 11
Granularity
Course 2: Tokenization 12
Granularity
→ Trade-off between short sequences and reasonable vocabulary size
Fertility
For a string sequence :
Course 2: Tokenization 13
Algorithms
Course 2: Tokenization 14
Byte-Pair Encoding (BPE)
Let's encode " aaabdaaabac" in an optimized way:
Observed pairs: { aa , ab , bd , da , ba , ac}
Observed occurences: { aa : 4, ab : 2, bd : 1, da : 1, ba : 1, ac: 1}
Set X = aa
Encode aaabdaaabac → XabdXabac
Start again from XabdXabac
Course 2: Tokenization 15
Byte-Pair Encoding (BPE)
(current rules: aa → X )
Let's encode " XabdXabac" in an optimized way:
Observed pairs: { Xa , ab , bd , dX , ba , ac}
Observed occurences: { Xa : 2, ab: 2, bd : 1, dX : 1, ba : 1, ac: 1}
Set Y = ab
Encode XabdXabac → XYdXYac
Start again from XYdXYac
Course 2: Tokenization 16
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y )
Let's encode " XYdXYac" in an optimized way:
Observed pairs: { XY , Yd , dX , Ya , ac}
Observed occurences: { XY : 2, Yd : 1, dX : 1, Ya : 1, ac: 1}
Set Z = XY
Encode XYdXYac → ZdZac
Start again from ZdZac
Course 2: Tokenization 17
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y , XY → Z )
Let's encode " ZdZac" in an optimized way:
Observed pairs: { Zd , dZ , Za , ac}
Observed occurences: { Zd : 1, dZ : 1, Za : 1, ac: 1}
All pairs are unique => END
Course 2: Tokenization 18
Byte-Pair Encoding (BPE)
Final encoding: aaabdaaabac → ZdZac
with merge rules:
1. aa → X
2. ab → Y
3. XY → Z
Decoding: follow merge rules in opposite order
Course 2: Tokenization 19
BPE Training - pre-tokenization
training_sentences = [
"Education is very important!",
"A cat and a dog live on an island",
"We'll be landing in Cabo Verde",
]
=>
pretokenized = ["education_", "is_", "very_", "important_", "!_", "a_",
"cat_", "and_", "a_", "dog_", "live_", "on_", "an_", "island_",
"we", "'", "ll_", "be_", "landing_", "in_", "cabo_" "Verde_"
]
Course 2: Tokenization 20
BPE Training - iteration 1
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'a', 'n', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['a', 'n', 'd', '_'], ..., ['o', 'n', '_'], ['a', 'n', '_'], ['i', 's', 'l', 'a', 'n', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'a', 'n', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]
→ Most common pair: "an"
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]
Course 2: Tokenization 21
BPE Training - iteration 2
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]
→ Most common pair: "ca"
tokenized = [
['e', 'd', 'u', 'ca', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['ca', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]
Course 2: Tokenization 22
BPE Training - iteration 14 (final)
tokenized = [
['e', 'd', 'u', 'cat', 'i', 'on_'], ['is', '_'], ['ver', 'y', '_'], ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a_'], ['cat', '_'], ['and_'], ['a_'], ..., ['on_'], ['an', '_'], ['is', 'l', 'and_'],
['w', 'e'], ["'"], ..., ['l', 'and', 'i', 'n', 'g_'], ['i', 'n_'], ['ca', 'b', 'o', '_'], ['ver', 'd', 'e_']
]
"Created" tokens:
['an', 'ca', 'n_', 've', 'and', 'cat', 'on_', 'is', 'ver', 'a_', 'and_', 'g_', 'e_']
→ English common words (a, and, on, is, ...)
→ and vs and_
Course 2: Tokenization 23
BPE - Granularity
Course 2: Tokenization 24
WordPiece
Based on merge rules too
Initial processing is different:
BPE:
["education", "is"] => [["e", "d", "u", ..., "n", "_"], ["i", "s", "_"]]
WordPiece:
["education", "is"] => [["e", "##d", "##u", "##c",...], ["i", "##s"]]
Course 2: Tokenization 25
WordPiece
Pairs are scored using this score function:
if and are common, less likely to merge
ex: dream/##ing → not merged
if and are rare but is common, more likely to merge
ex: pulv/##erise → pulverise
Course 2: Tokenization 26
Unigram
Unigram is working in the opposite direction:
Start from a (too) big subword vocabulary
Gradually eliminate tokens that won't be missed
Score all possible segmentations and take max:
Ex: brew
S( b / r / e/ w ) → P( b ) x P( r ) x P( e) x P( w ) = 0.024
S( br / e/ w ) → P( br ) x P( e) x P( w ) = 0.031
...
Course 2: Tokenization 27
Unigram - Inference
A string of length has possible segmentations
→ Unigram is using the Viterbi algorithm:
Observation:
for all and indexes:
if the optimal segmentation is known...
... then all segmentations of type where
are suboptimal
Course 2: Tokenization 28
Unigram - Inference
Example: email
Starting from letter e
For all ending letters, what is the best segmentation if last token
starts from e?
S( e) = 0.15
S( em ) = 0.02
...
S( email ) = 0.001
Course 2: Tokenization 29
Unigram - Inference
Example: email
Starting from letter m
For all ending letters, what is the best segmentation if last token
starts from m ?
S( e / m ) = 0.1
...
S( e / mail ) = 0.2
Remark: we've seen S( em ) and S( e / m ) → we know the best
segmentation that ends at m !
Course 2: Tokenization 30
Unigram - Inference
Example: email
Starting from letter a
For all ending letters, what is the best segmentation if last token
starts from a ? (hence after e / m )
S( e / m / a ) = 0.023
...
S( e / m / ail ) = ( ail is not in vocab)
Remark: we've seen S( ema ), ..., S( e / m / a ) → we know the best
segmentation that ends at a ! (here: e / ma is best)
Course 2: Tokenization 31
Unigram - Inference
Example: email
Starting from letter i
For all ending letters, what is the best segmentation if last token
starts from i ? (hence after e / ma )
S( e / ma / i ) = 0.004
S( e / ma / il ) = 0.03
Remark: we only have 2 candidates left! (here: ema / i is best)
Course 2: Tokenization 32
Unigram - Inference
Example: email
Starting from letter l
For all ending letters, what is the best segmentation if last token
starts from l ? (hence after ema / i )
S( ema / i / l ) = 0.002
Takeaway: At each start position, we know what the best
segmentation up to start is => we just need to explore after start
Course 2: Tokenization 33
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:
Course 2: Tokenization 34
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:
For all token , compute
Get rid of the 20% tokens that least decrease the score when
removed
Iterate (until you have desired vocabulary size)
Course 2: Tokenization 35
Limits & Alternatives
Course 2: Tokenization 36
Limits
Fixed vocabulary...
... leads to OOV (out-of-vocabulary)
... scales poorly to 100+ languages (and scripts)
... can cause over-segmentation
... is not robust to misspellings
bpe("artificial intelligence is real") => 'artificial', 'intelligence', 'is', 'real'
bpe("aritificial inteligense is reaal") =>
'ari', '##ti', '##fi', '##cial', 'intel', '##igen', '##se', 'is', 're', '##aa', '##l'
Course 2: Tokenization 37
Alternatives - BPE dropout (Provilkov et al.)
→ Randomly removes part of the vocabulary during training
=> makes models more robust to misspellings
Course 2: Tokenization 38
Alternatives - CharacterBERT (El Boukkouri et al.)
Course 2: Tokenization 39
Alternatives - ByT5 (Xue et al.)
Gives directly bytes (~characters) as inputs to the model
=> more robust and data efficient BUT ~10 times slower and more
hardware consumption
Course 2: Tokenization 40
Neural tokenization - CANINE (Clark et al.)
Downsamples characters into 4 smaller sequences
Course 2: Tokenization 41
Neural tokenization - MANTa (Godey et al.)
Allows the language model to learn its own mapping
Course 2: Tokenization 42
Neural tokenization - MEGABYTE (Yu et al.)
Encode and then decode autoregressively
Course 2: Tokenization 43
Takeaways
Tokenization: Art of splitting sentences/words into meaningful
smaller units
In ML: subword tokenization is (very) commonly used
Three main algorithms
BPE: iteratively learn most frequent merges
WordPiece: BPE with adjusted frequency score
Unigram: Start big and remove tokens that won't be missed
When facing noisy and/or complex text, alternatives exist
Course 2: Tokenization 44