RealFormer: Transformer Likes Residual Attention

21 Dec 2020  ·  Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie ·

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD... Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Linguistic Acceptability CoLA RealFormer Accuracy 59.83% # 16
Semantic Textual Similarity MRPC RealFormer Accuracy 87.01% # 19
F1 90.91% # 7
Natural Language Inference MultiNLI RealFormer Matched 86.28 # 18
Mismatched 86.34 # 13
Natural Language Inference QNLI RealFormer Accuracy 91.89% # 18
Paraphrase Identification Quora Question Pairs RealFormer Accuracy 91.34 # 1
F1 88.28 # 2
Natural Language Inference RTE RealFormer Accuracy 73.65% # 17
Sentiment Analysis SST-2 Binary classification RealFormer Accuracy 94.04 # 25
Semantic Textual Similarity STS Benchmark RealFormer Pearson Correlation 0.9011 # 14
Spearman Correlation 0.8988 # 4

Methods