Analyzing the Implicit Position Encoding Ability of Transformer Decoder

29 Sep 2021 · Ziyang Luo, Yadong Xi, Jing Ma, Xiaoxi Mao, Changjie Fan ·

A common limitation of Transformer Encoder's self-attention mechanism is that it cannot automatically capture the information of word order, so one needs to feed the explicit position encodings into the target model. On the other hand, Transformer Decoder with the auto-regressive attention masks is naturally sensitive to the word order information. In this work, based on the analysis of implicit position encoding power of Transformer Decoder, we obtain the conditions that at least two or more layers are required for the Decoder to encode word positions. To examine the correlations between the implicit and explicit position encodings respectively from the Transformer Encoder and Decoder, extensive experiments conducted on two large Wikipedia datasets demonstrate that all kinds of explicit position encoding mechanisms improve the performance of Decoder, but the gap of learnable position embeddings is smaller than the others. To make use of the power of implicit position encoding, we propose a new model, called \textit{DecBERT}, and fine-tune it on GLUE benchmarks. Experimental results show that (1) the implicit position encoding ability is strong enough to enhance language modeling and perform well on downstream tasks; and (2) our model accelerates the pre-training process and achieves superior performances than the baseline systems when pre-training with the same amount of computational resource.

PDF Abstract