Sequence-Level Knowledge Distillation

EMNLP 2016 Yoon Kim • Alexander M. Rush

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4

Full paper


Task Dataset Model Metric name Metric value Global rank Compare
Machine Translation IWSLT2015 Thai-English Seq-KD + Seq-Inter + Word-KD BLEU score 14.2 # 1
Machine Translation WMT2014 English-German Seq-KD + Seq-Inter + Word-KD BLEU score 18.5 # 26