Code for ACL2023 paper "Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation". [paper]
- python version == 3.7.0
- fairseq version == 0.12.2
- pytorch version == 1.13.1
- sacrebleu version == 1.5.1
- admin-torch version == 0.1.0
bash train_ende_big_teacher.sh
bash train_ende_vanilla_word_kd.sh
bash train_ende_word_kd_wo_corr.sh
bash train_ende_word_kd_wo_top1.sh
bash train_ende_student_baseline.sh
bash train_ende_word_kd_topk_info.sh
bash train_ende_tie_kd.sh
We use model averaging trick for the evaluation of all the student models (average last 5 checkpoints):
bash average_ckpts.sh
Then we use fairseq-interactive to generate translations with the averaged model:
bash interactive_ende.sh
We use multi-bleu.perl in mosesdecoder to calculate the tokenized BLEU scores of the translations. Besides, we also use COMET (Unbabel/wmt20-comet-da) for a more convincing evaluation.
Please cite this paper if you use this repo.
@article{zhang2023towards,
title={Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation},
author={Zhang, Songming and Liang, Yunlong and Wang, Shuaibo and Han, Wenjuan and Liu, Jian and Xu, Jinan and Chen, Yufeng},
journal={arXiv preprint arXiv:2305.08096},
year={2023}
}