Skip to content
forked from gb-kgp/adaptbpe

Codebase for EMNLP Findings Submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Notifications You must be signed in to change notification settings

Saya47/adaptbpe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is the official codebase for the EMNLP submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models.

Here we give the modeified tokenization files specific to the models: RoBERTA and BART. Inside the files we have marked the changes that we have made explicity over the original file so that you can undo them to switch to default tokenization for comparison.

The file tokenization_bart.py is originally present here and file tokenization_roberta.py is present here .In brief, we have made following changes in RoBERTa and BART:

  1. We create a new disctionary for added vocabulary to efficintly search longest substring in newly added vocab as explained in tokenization_roberta.py and tokenization_bart.py.

  2. We add one function and two helper functions to conduct the longest match recursively based on implemenation of FLOTA. This is described in tokenization_roberta.py and tokenization_bart.py

About

Codebase for EMNLP Findings Submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%