This is the official codebase for the EMNLP submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models.
Here we give the modeified tokenization files specific to the models: RoBERTA and BART. Inside the files we have marked the changes that we have made explicity over the original file so that you can undo them to switch to default tokenization for comparison.
The file tokenization_bart.py is originally present here and file tokenization_roberta.py is present here .In brief, we have made following changes in RoBERTa and BART:
-
We create a new disctionary for added vocabulary to efficintly search longest substring in newly added vocab as explained in tokenization_roberta.py and tokenization_bart.py.
-
We add one function and two helper functions to conduct the longest match recursively based on implemenation of FLOTA. This is described in tokenization_roberta.py and tokenization_bart.py