Moreover, and in contrast to RNNs, the Transformer model is not computationally universal, limiting its theoretical expressivity. In this paper we propose the Universal Transformer which addresses these practical and theoretical shortcomings and we show that it leads to improved performance on several tasks. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised.
|Task||Dataset||Model||Metric name||Metric value||Global rank||Compare|
|Machine Translation||WMT 2014 EN-DE||universal transformer base||BLEU||28.9||# 1|
|Machine Translation||WMT2014 English-German||universal transformer base||BLEU score||28.9||# 10|