GitHub - Saya47/adaptbpe: Codebase for EMNLP Findings Submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

This is the official codebase for the EMNLP submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models.

Here we give the modeified tokenization files specific to the models: RoBERTA and BART. Inside the files we have marked the changes that we have made explicity over the original file so that you can undo them to switch to default tokenization for comparison.

The file tokenization_bart.py is originally present here and file tokenization_roberta.py is present here .In brief, we have made following changes in RoBERTa and BART:

We create a new disctionary for added vocabulary to efficintly search longest substring in newly added vocab as explained in tokenization_roberta.py and tokenization_bart.py.
We add one function and two helper functions to conduct the longest match recursively based on implemenation of FLOTA. This is described in tokenization_roberta.py and tokenization_bart.py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Vocab-Files		Vocab-Files
README.md		README.md
tokenization_BART.py		tokenization_BART.py
tokenization_roberta.py		tokenization_roberta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Saya47/adaptbpe

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages