RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at

PDF Abstract Findings (ACL) 2021 PDF Findings (ACL) 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Linguistic Acceptability CoLA RealFormer Accuracy 59.83% # 24
Semantic Textual Similarity MRPC RealFormer Accuracy 87.01% # 26
F1 90.91% # 9
Natural Language Inference MultiNLI RealFormer Matched 86.28 # 21
Mismatched 86.34 # 13
Natural Language Inference QNLI RealFormer Accuracy 91.89% # 24
Natural Language Inference RTE RealFormer Accuracy 73.65% # 30
Sentiment Analysis SST-2 Binary classification RealFormer Accuracy 94.04 # 33
Semantic Textual Similarity STS Benchmark RealFormer Pearson Correlation 0.9011 # 16
Spearman Correlation 0.8988 # 5