0% found this document useful (0 votes)
89 views44 pages

Course2 Tokenization

This document provides an overview of tokenization in natural language processing. It discusses how tokenization involves splitting text into discrete units called tokens. Common tokenization algorithms discussed include byte-pair encoding (BPE) and WordPiece, which iteratively merge common token pairs to build a fixed vocabulary. The document also describes how modern frameworks first perform pre-tokenization through normalization, segmentation, and other preprocessing before applying algorithms like BPE or WordPiece. Finally, it briefly discusses unigram tokenization and how it differs from BPE by starting with a large vocabulary and pruning less common tokens.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views44 pages

Course2 Tokenization

This document provides an overview of tokenization in natural language processing. It discusses how tokenization involves splitting text into discrete units called tokens. Common tokenization algorithms discussed include byte-pair encoding (BPE) and WordPiece, which iteratively merge common token pairs to build a fixed vocabulary. The document also describes how modern frameworks first perform pre-tokenization through normalization, segmentation, and other preprocessing before applying algorithms like BPE or WordPiece. Finally, it briefly discusses unigram tokenization and how it differs from BPE by starting with a large vocabulary and pruning less common tokens.

Uploaded by

komala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Course 2: Tokenization

1
What is tokenization?
Turning text...

I love playing soccer!

...into tokens

['I', 'love', 'play', 'ing', 'soccer', '!']

Course 2: Tokenization 2
Historical Notions

Course 2: Tokenization 3
Tokenization Origins
The word token comes from linguistics

“ non-empty contiguous sequence of graphemes or phonemes in a


document

Split text on blanks

Course 2: Tokenization 4
Tokenization Origins

old_tokenize("I love playing soccer!") = ['I', 'love', 'playing', 'soccer!']

Different from word-forms


damélo → da / mé/ lo (=give/ me/ it )

Course 2: Tokenization 5
Tokenization Origins
Natural language is split into...

Sentences, utterances, documents... ( macroscopical )


that are split into...
Tokens, word-forms... ( microscopical )

→ Used for linguistic tasks (POS tagging, syntax parsing,...)

Course 2: Tokenization 6
Tokenization & ML
ML-based NLP (mostly) relies on sub-word tokenization:

Gives better performance


Fixed-size vocabulary often required
Out-Of-Vocabulary (OOV) issue

Course 2: Tokenization 7
Tokenization & ML
Evolution of modeling complexity w.r.t. the sequence length n

Model Type Year Complexity


Tf-Idf 1972 O(1)
RNNs ~1985 O(n)
Transformers 2017 O(n2 )

→ Long sequences (e.g. character-level) are prohibitive

Course 2: Tokenization 8
Modern framework
Pre-tokenization
"I'm lovin' it!" -> ["i", "am", "loving", "it", "!"]

Normalization
Rules around punctuation ( _:_ , _! , ...)
Spelling correction ( "imo" -> "in my opinion" )
Named entities ( "covid" -> "COVID-19" )
...
Rule-based segmentation
Blanks, punctuation, ...
Course 2: Tokenization 9
Modern framework
Tokenization -> ["i", "am", "lov", "##ing", "it", "!"]
Split units at subword level
Fixed vocabulary
Trained on text samples
Used in inference mode at pre-processing time

Course 2: Tokenization 10
Sub-word Tokenization

Course 2: Tokenization 11
Granularity

Course 2: Tokenization 12
Granularity
→ Trade-off between short sequences and reasonable vocabulary size

Fertility
For a string sequence :

Course 2: Tokenization 13
Algorithms

Course 2: Tokenization 14
Byte-Pair Encoding (BPE)
Let's encode " aaabdaaabac" in an optimized way:

Observed pairs: { aa , ab , bd , da , ba , ac}


Observed occurences: { aa : 4, ab : 2, bd : 1, da : 1, ba : 1, ac: 1}
Set X = aa
Encode aaabdaaabac → XabdXabac
Start again from XabdXabac

Course 2: Tokenization 15
Byte-Pair Encoding (BPE)
(current rules: aa → X )
Let's encode " XabdXabac" in an optimized way:

Observed pairs: { Xa , ab , bd , dX , ba , ac}


Observed occurences: { Xa : 2, ab: 2, bd : 1, dX : 1, ba : 1, ac: 1}
Set Y = ab
Encode XabdXabac → XYdXYac
Start again from XYdXYac

Course 2: Tokenization 16
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y )
Let's encode " XYdXYac" in an optimized way:

Observed pairs: { XY , Yd , dX , Ya , ac}


Observed occurences: { XY : 2, Yd : 1, dX : 1, Ya : 1, ac: 1}
Set Z = XY
Encode XYdXYac → ZdZac
Start again from ZdZac

Course 2: Tokenization 17
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y , XY → Z )
Let's encode " ZdZac" in an optimized way:

Observed pairs: { Zd , dZ , Za , ac}


Observed occurences: { Zd : 1, dZ : 1, Za : 1, ac: 1}
All pairs are unique => END

Course 2: Tokenization 18
Byte-Pair Encoding (BPE)
Final encoding: aaabdaaabac → ZdZac

with merge rules:

1. aa → X
2. ab → Y
3. XY → Z

Decoding: follow merge rules in opposite order

Course 2: Tokenization 19
BPE Training - pre-tokenization
training_sentences = [
"Education is very important!",
"A cat and a dog live on an island",
"We'll be landing in Cabo Verde",
]

=>

pretokenized = ["education_", "is_", "very_", "important_", "!_", "a_",


"cat_", "and_", "a_", "dog_", "live_", "on_", "an_", "island_",
"we", "'", "ll_", "be_", "landing_", "in_", "cabo_" "Verde_"
]

Course 2: Tokenization 20
BPE Training - iteration 1
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'a', 'n', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['a', 'n', 'd', '_'], ..., ['o', 'n', '_'], ['a', 'n', '_'], ['i', 's', 'l', 'a', 'n', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'a', 'n', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

→ Most common pair: "an"


tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

Course 2: Tokenization 21
BPE Training - iteration 2
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

→ Most common pair: "ca"


tokenized = [
['e', 'd', 'u', 'ca', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['ca', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

Course 2: Tokenization 22
BPE Training - iteration 14 (final)
tokenized = [
['e', 'd', 'u', 'cat', 'i', 'on_'], ['is', '_'], ['ver', 'y', '_'], ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a_'], ['cat', '_'], ['and_'], ['a_'], ..., ['on_'], ['an', '_'], ['is', 'l', 'and_'],
['w', 'e'], ["'"], ..., ['l', 'and', 'i', 'n', 'g_'], ['i', 'n_'], ['ca', 'b', 'o', '_'], ['ver', 'd', 'e_']
]

"Created" tokens:

['an', 'ca', 'n_', 've', 'and', 'cat', 'on_', 'is', 'ver', 'a_', 'and_', 'g_', 'e_']

→ English common words (a, and, on, is, ...)


→ and vs and_

Course 2: Tokenization 23
BPE - Granularity

Course 2: Tokenization 24
WordPiece
Based on merge rules too
Initial processing is different:

BPE:

["education", "is"] => [["e", "d", "u", ..., "n", "_"], ["i", "s", "_"]]

WordPiece:

["education", "is"] => [["e", "##d", "##u", "##c",...], ["i", "##s"]]

Course 2: Tokenization 25
WordPiece
Pairs are scored using this score function:

if and are common, less likely to merge


ex: dream/##ing → not merged
if and are rare but is common, more likely to merge
ex: pulv/##erise → pulverise

Course 2: Tokenization 26
Unigram
Unigram is working in the opposite direction:

Start from a (too) big subword vocabulary


Gradually eliminate tokens that won't be missed
Score all possible segmentations and take max:
Ex: brew
S( b / r / e/ w ) → P( b ) x P( r ) x P( e) x P( w ) = 0.024
S( br / e/ w ) → P( br ) x P( e) x P( w ) = 0.031
...

Course 2: Tokenization 27
Unigram - Inference
A string of length has possible segmentations

→ Unigram is using the Viterbi algorithm:

Observation:
for all and indexes:
if the optimal segmentation is known...
... then all segmentations of type where
are suboptimal

Course 2: Tokenization 28
Unigram - Inference
Example: email

Starting from letter e


For all ending letters, what is the best segmentation if last token
starts from e?
S( e) = 0.15
S( em ) = 0.02
...
S( email ) = 0.001

Course 2: Tokenization 29
Unigram - Inference
Example: email

Starting from letter m


For all ending letters, what is the best segmentation if last token
starts from m ?
S( e / m ) = 0.1
...
S( e / mail ) = 0.2
Remark: we've seen S( em ) and S( e / m ) → we know the best
segmentation that ends at m !
Course 2: Tokenization 30
Unigram - Inference
Example: email

Starting from letter a


For all ending letters, what is the best segmentation if last token
starts from a ? (hence after e / m )
S( e / m / a ) = 0.023
...
S( e / m / ail ) = ( ail is not in vocab)
Remark: we've seen S( ema ), ..., S( e / m / a ) → we know the best
segmentation that ends at a ! (here: e / ma is best)
Course 2: Tokenization 31
Unigram - Inference
Example: email

Starting from letter i


For all ending letters, what is the best segmentation if last token
starts from i ? (hence after e / ma )
S( e / ma / i ) = 0.004
S( e / ma / il ) = 0.03
Remark: we only have 2 candidates left! (here: ema / i is best)

Course 2: Tokenization 32
Unigram - Inference
Example: email

Starting from letter l


For all ending letters, what is the best segmentation if last token
starts from l ? (hence after ema / i )
S( ema / i / l ) = 0.002

Takeaway: At each start position, we know what the best


segmentation up to start is => we just need to explore after start

Course 2: Tokenization 33
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:

Course 2: Tokenization 34
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:

For all token , compute


Get rid of the 20% tokens that least decrease the score when
removed
Iterate (until you have desired vocabulary size)
Course 2: Tokenization 35
Limits & Alternatives

Course 2: Tokenization 36
Limits
Fixed vocabulary...
... leads to OOV (out-of-vocabulary)
... scales poorly to 100+ languages (and scripts)
... can cause over-segmentation
... is not robust to misspellings

bpe("artificial intelligence is real") => 'artificial', 'intelligence', 'is', 'real'

bpe("aritificial inteligense is reaal") =>


'ari', '##ti', '##fi', '##cial', 'intel', '##igen', '##se', 'is', 're', '##aa', '##l'

Course 2: Tokenization 37
Alternatives - BPE dropout (Provilkov et al.)
→ Randomly removes part of the vocabulary during training

=> makes models more robust to misspellings


Course 2: Tokenization 38
Alternatives - CharacterBERT (El Boukkouri et al.)

Course 2: Tokenization 39
Alternatives - ByT5 (Xue et al.)
Gives directly bytes (~characters) as inputs to the model

=> more robust and data efficient BUT ~10 times slower and more
hardware consumption

Course 2: Tokenization 40
Neural tokenization - CANINE (Clark et al.)
Downsamples characters into 4 smaller sequences

Course 2: Tokenization 41
Neural tokenization - MANTa (Godey et al.)
Allows the language model to learn its own mapping

Course 2: Tokenization 42
Neural tokenization - MEGABYTE (Yu et al.)
Encode and then decode autoregressively

Course 2: Tokenization 43
Takeaways
Tokenization: Art of splitting sentences/words into meaningful
smaller units
In ML: subword tokenization is (very) commonly used
Three main algorithms
BPE: iteratively learn most frequent merges
WordPiece: BPE with adjusted frequency score
Unigram: Start big and remove tokens that won't be missed
When facing noisy and/or complex text, alternatives exist

Course 2: Tokenization 44

You might also like