0% found this document useful (0 votes)

89 views44 pages

Course2 Tokenization

This document provides an overview of tokenization in natural language processing. It discusses how tokenization involves splitting text into discrete units called tokens. Common tokenization algorithms discussed include byte-pair encoding (BPE) and WordPiece, which iteratively merge common token pairs to build a fixed vocabulary. The document also describes how modern frameworks first perform pre-tokenization through normalization, segmentation, and other preprocessing before applying algorithms like BPE or WordPiece. Finally, it briefly discusses unigram tokenization and how it differs from BPE by starting with a large vocabulary and pruning less common tokens.

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views44 pages

Course2 Tokenization

Uploaded by

komala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Course 2: Tokenization

1
What is tokenization?
Turning text...

I love playing soccer!

...into tokens

['I', 'love', 'play', 'ing', 'soccer', '!']

Course 2: Tokenization 2
Historical Notions

Course 2: Tokenization 3
Tokenization Origins
The word token comes from linguistics

“ non-empty contiguous sequence of graphemes or phonemes in a

document
“

Split text on blanks

Course 2: Tokenization 4
Tokenization Origins

old_tokenize("I love playing soccer!") = ['I', 'love', 'playing', 'soccer!']

Different from word-forms

damélo → da / mé/ lo (=give/ me/ it )

Course 2: Tokenization 5
Tokenization Origins
Natural language is split into...

Sentences, utterances, documents... ( macroscopical )

that are split into...
Tokens, word-forms... ( microscopical )

→ Used for linguistic tasks (POS tagging, syntax parsing,...)

Course 2: Tokenization 6
Tokenization & ML
ML-based NLP (mostly) relies on sub-word tokenization:

Gives better performance

Fixed-size vocabulary often required
Out-Of-Vocabulary (OOV) issue

Course 2: Tokenization 7
Tokenization & ML
Evolution of modeling complexity w.r.t. the sequence length n

Model Type Year Complexity

Tf-Idf 1972 O(1)
RNNs ~1985 O(n)
Transformers 2017 O(n2 )

→ Long sequences (e.g. character-level) are prohibitive

Course 2: Tokenization 8
Modern framework
Pre-tokenization
"I'm lovin' it!" -> ["i", "am", "loving", "it", "!"]

Normalization
Rules around punctuation ( _:_ , _! , ...)
Spelling correction ( "imo" -> "in my opinion" )
Named entities ( "covid" -> "COVID-19" )
...
Rule-based segmentation
Blanks, punctuation, ...
Course 2: Tokenization 9
Modern framework
Tokenization -> ["i", "am", "lov", "##ing", "it", "!"]
Split units at subword level
Fixed vocabulary
Trained on text samples
Used in inference mode at pre-processing time

Course 2: Tokenization 10
Sub-word Tokenization

Course 2: Tokenization 11
Granularity

Course 2: Tokenization 12
Granularity
→ Trade-off between short sequences and reasonable vocabulary size

Fertility
For a string sequence :

Course 2: Tokenization 13
Algorithms

Course 2: Tokenization 14
Byte-Pair Encoding (BPE)
Let's encode " aaabdaaabac" in an optimized way:

Observed pairs: { aa , ab , bd , da , ba , ac}

Observed occurences: { aa : 4, ab : 2, bd : 1, da : 1, ba : 1, ac: 1}
Set X = aa
Encode aaabdaaabac → XabdXabac
Start again from XabdXabac

Course 2: Tokenization 15
Byte-Pair Encoding (BPE)
(current rules: aa → X )
Let's encode " XabdXabac" in an optimized way:

Observed pairs: { Xa , ab , bd , dX , ba , ac}

Observed occurences: { Xa : 2, ab: 2, bd : 1, dX : 1, ba : 1, ac: 1}
Set Y = ab
Encode XabdXabac → XYdXYac
Start again from XYdXYac

Course 2: Tokenization 16
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y )
Let's encode " XYdXYac" in an optimized way:

Observed pairs: { XY , Yd , dX , Ya , ac}

Observed occurences: { XY : 2, Yd : 1, dX : 1, Ya : 1, ac: 1}
Set Z = XY
Encode XYdXYac → ZdZac
Start again from ZdZac

Course 2: Tokenization 17
Byte-Pair Encoding (BPE)
(current rules: aa → X , ab → Y , XY → Z )
Let's encode " ZdZac" in an optimized way:

Observed pairs: { Zd , dZ , Za , ac}

Observed occurences: { Zd : 1, dZ : 1, Za : 1, ac: 1}
All pairs are unique => END

Course 2: Tokenization 18
Byte-Pair Encoding (BPE)
Final encoding: aaabdaaabac → ZdZac

with merge rules:

1. aa → X
2. ab → Y
3. XY → Z

Decoding: follow merge rules in opposite order

Course 2: Tokenization 19
BPE Training - pre-tokenization
training_sentences = [
"Education is very important!",
"A cat and a dog live on an island",
"We'll be landing in Cabo Verde",
]

pretokenized = ["education_", "is_", "very_", "important_", "!_", "a_",

"cat_", "and_", "a_", "dog_", "live_", "on_", "an_", "island_",
"we", "'", "ll_", "be_", "landing_", "in_", "cabo_" "Verde_"
]

Course 2: Tokenization 20
BPE Training - iteration 1
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'a', 'n', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['a', 'n', 'd', '_'], ..., ['o', 'n', '_'], ['a', 'n', '_'], ['i', 's', 'l', 'a', 'n', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'a', 'n', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

→ Most common pair: "an"

tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

Course 2: Tokenization 21
BPE Training - iteration 2
tokenized = [
['e', 'd', 'u', 'c', 'a', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['c', 'a', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

→ Most common pair: "ca"

tokenized = [
['e', 'd', 'u', 'ca', 't', 'i', 'o', 'n', '_'], ..., ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a', '_'], ['ca', 't', '_'], ['an', 'd', '_'], ..., ['o', 'n', '_'], ['an', '_'], ['i', 's', 'l', 'an', 'd', '_'],
['w', 'e'], ["'"], ['l', 'l', '_'], ['b', 'e', '_'], ['l', 'an', 'd', 'i', 'n', 'g', '_'], ..., ['v', 'e', 'r', 'd', 'e', '_']
]

Course 2: Tokenization 22
BPE Training - iteration 14 (final)
tokenized = [
['e', 'd', 'u', 'cat', 'i', 'on_'], ['is', '_'], ['ver', 'y', '_'], ['i', 'm', 'p', 'o', 'r', 't', 'an', 't', '_'], ['!', '_'],
['a_'], ['cat', '_'], ['and_'], ['a_'], ..., ['on_'], ['an', '_'], ['is', 'l', 'and_'],
['w', 'e'], ["'"], ..., ['l', 'and', 'i', 'n', 'g_'], ['i', 'n_'], ['ca', 'b', 'o', '_'], ['ver', 'd', 'e_']
]

"Created" tokens:

['an', 'ca', 'n_', 've', 'and', 'cat', 'on_', 'is', 'ver', 'a_', 'and_', 'g_', 'e_']

→ English common words (a, and, on, is, ...)

→ and vs and_

Course 2: Tokenization 23
BPE - Granularity

Course 2: Tokenization 24
WordPiece
Based on merge rules too
Initial processing is different:

BPE:

["education", "is"] => [["e", "d", "u", ..., "n", "_"], ["i", "s", "_"]]

WordPiece:

["education", "is"] => [["e", "##d", "##u", "##c",...], ["i", "##s"]]

Course 2: Tokenization 25
WordPiece
Pairs are scored using this score function:

if and are common, less likely to merge

ex: dream/##ing → not merged
if and are rare but is common, more likely to merge
ex: pulv/##erise → pulverise

Course 2: Tokenization 26
Unigram
Unigram is working in the opposite direction:

Start from a (too) big subword vocabulary

Gradually eliminate tokens that won't be missed
Score all possible segmentations and take max:
Ex: brew
S( b / r / e/ w ) → P( b ) x P( r ) x P( e) x P( w ) = 0.024
S( br / e/ w ) → P( br ) x P( e) x P( w ) = 0.031
...

Course 2: Tokenization 27
Unigram - Inference
A string of length has possible segmentations

→ Unigram is using the Viterbi algorithm:

Observation:
for all and indexes:
if the optimal segmentation is known...
... then all segmentations of type where
are suboptimal

Course 2: Tokenization 28
Unigram - Inference
Example: email

Starting from letter e

For all ending letters, what is the best segmentation if last token
starts from e?
S( e) = 0.15
S( em ) = 0.02
...
S( email ) = 0.001

Course 2: Tokenization 29
Unigram - Inference
Example: email

Starting from letter m

For all ending letters, what is the best segmentation if last token
starts from m ?
S( e / m ) = 0.1
...
S( e / mail ) = 0.2
Remark: we've seen S( em ) and S( e / m ) → we know the best
segmentation that ends at m !
Course 2: Tokenization 30
Unigram - Inference
Example: email

Starting from letter a

For all ending letters, what is the best segmentation if last token
starts from a ? (hence after e / m )
S( e / m / a ) = 0.023
...
S( e / m / ail ) = ( ail is not in vocab)
Remark: we've seen S( ema ), ..., S( e / m / a ) → we know the best
segmentation that ends at a ! (here: e / ma is best)
Course 2: Tokenization 31
Unigram - Inference
Example: email

Starting from letter i

For all ending letters, what is the best segmentation if last token
starts from i ? (hence after e / ma )
S( e / ma / i ) = 0.004
S( e / ma / il ) = 0.03
Remark: we only have 2 candidates left! (here: ema / i is best)

Course 2: Tokenization 32
Unigram - Inference
Example: email

Starting from letter l

For all ending letters, what is the best segmentation if last token
starts from l ? (hence after ema / i )
S( ema / i / l ) = 0.002

Takeaway: At each start position, we know what the best

segmentation up to start is => we just need to explore after start

Course 2: Tokenization 33
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:

Course 2: Tokenization 34
Unigram - Training (≈)
Start from a very big vocabulary
Infer on all pre-tokenized units and get total score as:

For all token , compute

Get rid of the 20% tokens that least decrease the score when
removed
Iterate (until you have desired vocabulary size)
Course 2: Tokenization 35
Limits & Alternatives

Course 2: Tokenization 36
Limits
Fixed vocabulary...
... leads to OOV (out-of-vocabulary)
... scales poorly to 100+ languages (and scripts)
... can cause over-segmentation
... is not robust to misspellings

bpe("artificial intelligence is real") => 'artificial', 'intelligence', 'is', 'real'

bpe("aritificial inteligense is reaal") =>

'ari', '##ti', '##fi', '##cial', 'intel', '##igen', '##se', 'is', 're', '##aa', '##l'

Course 2: Tokenization 37
Alternatives - BPE dropout (Provilkov et al.)
→ Randomly removes part of the vocabulary during training

=> makes models more robust to misspellings

Course 2: Tokenization 38
Alternatives - CharacterBERT (El Boukkouri et al.)

Course 2: Tokenization 39
Alternatives - ByT5 (Xue et al.)
Gives directly bytes (~characters) as inputs to the model

=> more robust and data efficient BUT ~10 times slower and more
hardware consumption

Course 2: Tokenization 40
Neural tokenization - CANINE (Clark et al.)
Downsamples characters into 4 smaller sequences

Course 2: Tokenization 41
Neural tokenization - MANTa (Godey et al.)
Allows the language model to learn its own mapping

Course 2: Tokenization 42
Neural tokenization - MEGABYTE (Yu et al.)
Encode and then decode autoregressively

Course 2: Tokenization 43
Takeaways
Tokenization: Art of splitting sentences/words into meaningful
smaller units
In ML: subword tokenization is (very) commonly used
Three main algorithms
BPE: iteratively learn most frequent merges
WordPiece: BPE with adjusted frequency score
Unigram: Start big and remove tokens that won't be missed
When facing noisy and/or complex text, alternatives exist

Course 2: Tokenization 44

Natural Language Processing - Session 4 - Tokenization and Stemming
No ratings yet
Natural Language Processing - Session 4 - Tokenization and Stemming
63 pages
Assignment No 1 - Genai Fa24-Msds-0007
No ratings yet
Assignment No 1 - Genai Fa24-Msds-0007
10 pages
Assignment No 1 - Genai
No ratings yet
Assignment No 1 - Genai
10 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
SPR 08 Algorithms
No ratings yet
SPR 08 Algorithms
41 pages
WordPiece Tokenization - Hugging Face NLP Course
No ratings yet
WordPiece Tokenization - Hugging Face NLP Course
12 pages
4 Tokenization MED
No ratings yet
4 Tokenization MED
60 pages
Formalizing BPE Tokenization
No ratings yet
Formalizing BPE Tokenization
12 pages
Assignment Fa24 MSDS 0006
No ratings yet
Assignment Fa24 MSDS 0006
14 pages
BPE Tokenization for NLP Enthusiasts
No ratings yet
BPE Tokenization for NLP Enthusiasts
17 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
15 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Tokenization
No ratings yet
Tokenization
7 pages
NLP Tokenization Techniques
No ratings yet
NLP Tokenization Techniques
11 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
19 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Tokenization Methods Practical Applications
No ratings yet
Tokenization Methods Practical Applications
29 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
Tokenization
No ratings yet
Tokenization
26 pages
A7 NLP Exp2
No ratings yet
A7 NLP Exp2
11 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
ML LS8
No ratings yet
ML LS8
25 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Tokenization
No ratings yet
Tokenization
34 pages
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Week 02
No ratings yet
NLP Week 02
55 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
Improving Upon Earley's Parsing Algorithm in Prolog
No ratings yet
Improving Upon Earley's Parsing Algorithm in Prolog
14 pages
NLP 1
No ratings yet
NLP 1
8 pages
PL Özet (1,3,4)
No ratings yet
PL Özet (1,3,4)
8 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
2 Tokens Naturalness of Code
No ratings yet
2 Tokens Naturalness of Code
56 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
SP Unit III-2024-25
No ratings yet
SP Unit III-2024-25
126 pages
DS Lecture 03 04
No ratings yet
DS Lecture 03 04
74 pages
Lexical and Syntax Analysis
No ratings yet
Lexical and Syntax Analysis
63 pages
Pipeline
No ratings yet
Pipeline
9 pages
Week 2
No ratings yet
Week 2
90 pages
N Grams
No ratings yet
N Grams
51 pages
Theory of Computation - Practical
No ratings yet
Theory of Computation - Practical
23 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
Thuật toán NLP
No ratings yet
Thuật toán NLP
57 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
03LexicalAndSyntaxAnalysis 1
No ratings yet
03LexicalAndSyntaxAnalysis 1
25 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
NLP Final
No ratings yet
NLP Final
27 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Si 060608
No ratings yet
Si 060608
17 pages
Pcep 30 02 - 5
No ratings yet
Pcep 30 02 - 5
11 pages
Md200 PR Interface
No ratings yet
Md200 PR Interface
43 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
20 pages
Adjectives Class 5 Worksheet Net Explanations
No ratings yet
Adjectives Class 5 Worksheet Net Explanations
5 pages
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
No ratings yet
Constraints Satisfaction Problem - Artificial Intelligence Multiple Choice Questions - Sanfoundry
5 pages
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Forward Chaining - Artificial Intelligence Questions and Answers - Sanfoundry
7 pages
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Environments - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Machine Learning - Artificial Intelligence Questions and Answers - Sanfoundry
3 pages
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - Facts - 3 - Sanfoundry
4 pages
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
AI History - 1-Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Agent Architecture - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
No ratings yet
Intelligent Agents & Environment - AI Questions and Answers - Sanfoundry
5 pages
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
No ratings yet
Once A Pone A Time There Lived A Girl Called Likhita and Her Parent All Ways Chear Her and Love Her A Lot
1 page
Communicating, Perceiving and Acting
No ratings yet
Communicating, Perceiving and Acting
32 pages
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Frames - Artificial Intelligence Questions and Answers - Sanfoundry
5 pages
Course4 Efficiency
No ratings yet
Course4 Efficiency
41 pages
5th Class Computer Ch1 QANS
No ratings yet
5th Class Computer Ch1 QANS
2 pages
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
No ratings yet
Artificial Intelligence Questions and Answers - LISP Programming - 2 - Sanfoundry
4 pages
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Knowledge & Reasoning - Artificial Intelligence Questions and Answers - Sanfoundry
6 pages
Govinda Krishna Jai
No ratings yet
Govinda Krishna Jai
1 page
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
No ratings yet
Game Theory - Artificial Intelligence Questions and Answers - Sanfoundry
4 pages
AI Algorithms and Statistics
No ratings yet
AI Algorithms and Statistics
11 pages
AI1
No ratings yet
AI1
38 pages
Quizizz - Text Structure
No ratings yet
Quizizz - Text Structure
7 pages
Easy Animal Riddles For Kids - Kidpid
No ratings yet
Easy Animal Riddles For Kids - Kidpid
16 pages
Lulu Daily
No ratings yet
Lulu Daily
2 pages
Advanced NLP
No ratings yet
Advanced NLP
111 pages
STD 5 CH 1
No ratings yet
STD 5 CH 1
3 pages
Terms 1
No ratings yet
Terms 1
46 pages
Business Requirements Document (Active Codes)
No ratings yet
Business Requirements Document (Active Codes)
5 pages
Ex - Press 4 Is So Easy
No ratings yet
Ex - Press 4 Is So Easy
28 pages
Presentation On Android OS
No ratings yet
Presentation On Android OS
25 pages
Cs PPT CHP 3 Part 4
No ratings yet
Cs PPT CHP 3 Part 4
37 pages
Instructions Guide Coluna BT Goodis Black Box
No ratings yet
Instructions Guide Coluna BT Goodis Black Box
44 pages
ACH580 - Modbus - Control Program Firmware Manual
No ratings yet
ACH580 - Modbus - Control Program Firmware Manual
28 pages
3914.practical Methods of Optimization. Volume 1. Unconstrained Optimization by R. Fletcher
100% (1)
3914.practical Methods of Optimization. Volume 1. Unconstrained Optimization by R. Fletcher
126 pages
50 Excel Practical Assignments Unsolved
No ratings yet
50 Excel Practical Assignments Unsolved
71 pages
Comm 205 R Practice Exam
No ratings yet
Comm 205 R Practice Exam
16 pages
Multi-Core Computing: Mohammad Tarik M Husam Shakiar
No ratings yet
Multi-Core Computing: Mohammad Tarik M Husam Shakiar
27 pages
Information Technology: A Group Chat Application Using Java
No ratings yet
Information Technology: A Group Chat Application Using Java
10 pages
Kerberos and Netlogon Changes-V2
No ratings yet
Kerberos and Netlogon Changes-V2
2 pages
Indicador 2.0
100% (1)
Indicador 2.0
8 pages
Wistron W37
No ratings yet
Wistron W37
51 pages
Brand Guidelines 2002
No ratings yet
Brand Guidelines 2002
18 pages
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
No ratings yet
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
74 pages
Winv112cp581 e
No ratings yet
Winv112cp581 e
11 pages
TDS2
No ratings yet
TDS2
13 pages
3D Intraoral Scanners
No ratings yet
3D Intraoral Scanners
19 pages
First Tech Federal Bank Open Up Method PDF
100% (3)
First Tech Federal Bank Open Up Method PDF
26 pages
M1xseries Manual e
No ratings yet
M1xseries Manual e
50 pages
Manufacturing Management Course Overview
No ratings yet
Manufacturing Management Course Overview
332 pages
Ece 4219
No ratings yet
Ece 4219
2 pages
Crippa CA945 Drive Faults
No ratings yet
Crippa CA945 Drive Faults
57 pages
BAPI - PO - CREATE1: Price Is Not Transferred: Symptom
No ratings yet
BAPI - PO - CREATE1: Price Is Not Transferred: Symptom
3 pages
Haris Khan
No ratings yet
Haris Khan
1 page
Quarter 4 - Module 5: Dimensions and Resources of Media and Information
No ratings yet
Quarter 4 - Module 5: Dimensions and Resources of Media and Information
16 pages
Shyfem Finite Element Model For Coastal Seas User Manual
No ratings yet
Shyfem Finite Element Model For Coastal Seas User Manual
54 pages
SpinView Getting Started
No ratings yet
SpinView Getting Started
12 pages
Project Profile On Personnelcomputers Assembly
No ratings yet
Project Profile On Personnelcomputers Assembly
2 pages

Course2 Tokenization

Uploaded by

Course2 Tokenization

Uploaded by

Course 2: Tokenization

I love playing soccer!

['I', 'love', 'play', 'ing', 'soccer', '!']

“ non-empty contiguous sequence of graphemes or phonemes in a

Split text on blanks

old_tokenize("I love playing soccer!") = ['I', 'love', 'playing', 'soccer!']

Different from word-forms

Sentences, utterances, documents... ( macroscopical )

→ Used for linguistic tasks (POS tagging, syntax parsing,...)

Gives better performance

Model Type Year Complexity

→ Long sequences (e.g. character-level) are prohibitive

Observed pairs: { aa , ab , bd , da , ba , ac}

Observed pairs: { Xa , ab , bd , dX , ba , ac}

Observed pairs: { XY , Yd , dX , Ya , ac}

Observed pairs: { Zd , dZ , Za , ac}

with merge rules:

Decoding: follow merge rules in opposite order

pretokenized = ["education_", "is_", "very_", "important_", "!_", "a_",

→ Most common pair: "an"

→ Most common pair: "ca"

→ English common words (a, and, on, is, ...)

["education", "is"] => [["e", "##d", "##u", "##c",...], ["i", "##s"]]

if and are common, less likely to merge

Start from a (too) big subword vocabulary

→ Unigram is using the Viterbi algorithm:

Starting from letter e

Starting from letter m

Starting from letter a

Starting from letter i

Starting from letter l

Takeaway: At each start position, we know what the best

For all token , compute

bpe("artificial intelligence is real") => 'artificial', 'intelligence', 'is', 'real'

bpe("aritificial inteligense is reaal") =>

=> makes models more robust to misspellings

You might also like