ERNIE-SPARSE: Robust Efficient Transformer Through Hierarchically Unifying Isolated Information

29 Sep 2021 · Yang Liu, Jiaxiang Liu, Yuxiang Lu, Shikun Feng, Yu Sun, Zhida Feng, Li Chen, Hao Tian, Hua Wu, Haifeng Wang ·

Sparse Transformer has recently attracted a lot of attention since the ability for reducing the quadratic dependency on the sequence length. In this paper, we argue that two factors could affect the robustness and causing performance degradation of the Sparse Transformer. The first factor is information bottleneck sensitivity, which is caused by the key feature of Sparse Transformer — only a small number of global tokens can attend to all other tokens. The second factor is sparse pattern sensitivity, which is caused by different token connections in different sparse patterns. To address these issues, we propose a well-designed model, named ERNIE-SPARSE. It consists of two distinctive parts: (i) a Hierarchical Sparse Transformer (HST) mechanism, which introduces special tokens to sequentially model local and global information. This method is not affected by bottleneck size and improves model robustness and performance. (ii) Sparse-Attention-Oriented Regularization (SAOR) method, the first robust training method designed for Sparse Transformer, which increases model robustness by forcing the output distributions of transformers with different sparse patterns to be consistent with each other. To evaluate the effectiveness of ERNIE-SPARSE, we perform extensive evaluations. Firstly, we perform experiments on a multi-modal long sequence modeling task benchmark, Long Range Arena (LRA). Experimental results demonstrate that ERNIE-SPARSE significantly outperforms a variety of strong baseline methods including the dense attention and other efficient sparse attention methods and achieves improvements by 2.7% (55.01% vs. 57.78%). Secondly, to further show the effectiveness of our method, we pretrain ERNIE-SPARSE and verified it on 3 text classification and 2 QA downstream tasks, achieve improvements on classification benchmark by 0.83% (91.63% vs. 92.46%), on QA benchmark by 3.27% (74.7% vs. 71.43%). Experimental results continue to demonstrate its superior performance.

PDF Abstract