Proactive Sequence Generator via Knowledge Acquisition
Sequence-to-sequence models such as transformers, which are now being used in a wide variety of NLP tasks, typically need to have very high capacity in order to perform well. Unfortunately, in production, memory size and inference speed are all strictly constrained. To address this problem, Knowledge Distillation (KD), a technique to train small models to mimic larger pre-trained models, has drawn lots of attention. The KD approach basically attempts to maximize recall, i.e., ranking Top-k”tokens in teacher models as higher as possible, however, whereas precision is more important for sequence generation because of exposure bias. Motivated by this, we develop Knowledge Acquisition (KA) where student models receive log q(y_t|y_{<t},x) as rewards when producing the next token y_t given previous tokens y_{<t} and the source sentence x. We demonstrate the effectiveness of our approach on WMT’17 De-En and IWSLT’15 Th-En translation tasks, with experimental results showing that our approach gains +0.7-1.1 BLEU score compared to token-level knowledge distillation.
PDF Abstract