Block Skim Transformer for Efficient Question Answering

1 Jan 2021 · Yue Guan, Jingwen Leng, Yuhao Zhu, Minyi Guo ·

Transformer based encoder models have achieved promising results on natural language processing (NLP) task including question answering (QA). Different from sequence classification or language modeling tasks, hidden states at all position are used for the final classification in QA. However, we do not always need all the context to answer the raised question. Following this idea, we proposed Block Skim Transformer (BST) to improve and accelerate the processing of transformer QA models. The key idea of BST is to identify the context that must be further processed and the blocks that could be safely rejected early on during inference. Critically, we learn such information from self attention weights. As a result, the model hidden states are pruned at sequence dimension, achieving significant inference speedup. We also show that such extra training optimization objection also improves model performance. As an plugin to the transformer based QA models, BST is compatible to other model compression methods without changing existing network architectures. BST improves QA models performance on different datasets and achieves $1.6\times$ speedup on $\text{BERT}_{\text{large}}$ model.

PDF Abstract