Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

1 Jan 2021 · Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Yunhai Tong ·

Transformer is a ubiquitous model for natural language processing and has also attracted wide attentions in other domains such as computer vision. The self-attention maps, learned independently for each layer, are indispensable for a transformer model to encode the dependencies among input tokens, however, learning them effectively is still a challenging problem. In this paper, we address this problem and propose a novel approach to improve self-attention through supplementary prediction modules. The underlying assumption is that the attention structures in the current layer should not be completely independent from those in the previous layer. Instead, we model their dependencies via a chain of prediction models that take previous attention maps as input to predict the attention maps of a new layer through convolutional neural networks. Specifically, we propose Predictive Attention Transformer and obtain significant performance gains for various kinds of tasks on top of multiple state-of-the-art models. On GLUE benchmark, the average performances of BERT-Base and BERT-Large are lifted by 4.1 and 2.5 points respectively. For machine translation, it improves the BLUE score of a vanilla Transformer consistently on IWSLT'14 De-En dataset with different model sizes. For ImageNet classification, we achieve significant improvement over a strong backbone model with comparable capacity.

PDF Abstract