NMT-KD

Code for ACL2023 paper "Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation". [paper]

Requirements

python version == 3.7.0
fairseq version == 0.12.2
pytorch version == 1.13.1
sacrebleu version == 1.5.1
admin-torch version == 0.1.0

Analysis Experiments (Take WMT14 En-De as an example)

Step 0: Train a teacher model (Transformer-big, 300k steps)

bash train_ende_big_teacher.sh

Step 1: Removing information from word-level KD

(a) Vanilla word-level KD

bash train_ende_vanilla_word_kd.sh

(b) Removing the correlation information

bash train_ende_word_kd_wo_corr.sh

(c) Removing the top-1 information

bash train_ende_word_kd_wo_top1.sh

(d) Student baseline (no KD)

bash train_ende_student_baseline.sh

Step 2: Expand top-1 to top-k information

bash train_ende_word_kd_topk_info.sh

Method: Top-1 Information Enhanced KD (TIE-KD)

bash train_ende_tie_kd.sh

Evaluation

Model Averaging

We use model averaging trick for the evaluation of all the student models (average last 5 checkpoints):

bash average_ckpts.sh

Generation

Then we use fairseq-interactive to generate translations with the averaged model:

bash interactive_ende.sh

We use multi-bleu.perl in mosesdecoder to calculate the tokenized BLEU scores of the translations. Besides, we also use COMET (Unbabel/wmt20-comet-da) for a more convincing evaluation.

Citation

Please cite this paper if you use this repo.

@article{zhang2023towards,
  title={Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation},
  author={Zhang, Songming and Liang, Yunlong and Wang, Shuaibo and Han, Wenjuan and Liu, Jian and Xu, Jinan and Chen, Yufeng},
  journal={arXiv preprint arXiv:2305.08096},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NMT-KD

Requirements

Analysis Experiments (Take WMT14 En-De as an example)

Step 0: Train a teacher model (Transformer-big, 300k steps)

Step 1: Removing information from word-level KD

(a) Vanilla word-level KD

(b) Removing the correlation information

(c) Removing the top-1 information

(d) Student baseline (no KD)

Step 2: Expand top-1 to top-k information

Method: Top-1 Information Enhanced KD (TIE-KD)

Evaluation

Model Averaging

Generation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
fairseq-kd		fairseq-kd
README.md		README.md
average_ckpts.sh		average_ckpts.sh
interactive_ende.sh		interactive_ende.sh
remove_info.png		remove_info.png
train_ende_big_teacher.sh		train_ende_big_teacher.sh
train_ende_student_baseline.sh		train_ende_student_baseline.sh
train_ende_tie_kd.sh		train_ende_tie_kd.sh
train_ende_vanilla_word_kd.sh		train_ende_vanilla_word_kd.sh
train_ende_word_kd_topk_info.sh		train_ende_word_kd_topk_info.sh
train_ende_word_kd_wo_corr.sh		train_ende_word_kd_wo_corr.sh
train_ende_word_kd_wo_top1.sh		train_ende_word_kd_wo_top1.sh

Folders and files

Latest commit

History

Repository files navigation

NMT-KD

Requirements

Analysis Experiments (Take WMT14 En-De as an example)

Step 0: Train a teacher model (Transformer-big, 300k steps)

Step 1: Removing information from word-level KD

(a) Vanilla word-level KD

(b) Removing the correlation information

(c) Removing the top-1 information

(d) Student baseline (no KD)

Step 2: Expand top-1 to top-k information

Method: Top-1 Information Enhanced KD (TIE-KD)

Evaluation

Model Averaging

Generation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages