DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering BoolQ DeBERTa-1.5B Accuracy 90.4 # 5
Linguistic Acceptability CoLA Dev DeBERTa (large) Accuracy 69.5 # 3
Natural Language Inference CommitmentBank DeBERTa-1.5B F1 94.9 # 2
Accuracy 97.2 # 2
Question Answering COPA DeBERTa-Ensemble Accuracy 98.4 # 2
Question Answering COPA DeBERTa-1.5B Accuracy 96.8 # 3
Natural Language Inference MRPC Dev DeBERTa (large) Accuracy 92.5 # 1
Natural Language Inference MultiNLI DeBERTa (large) Matched 91.1 # 5
Mismatched 91.1 # 4
Question Answering MultiRC DeBERTa-1.5B F1 88.2 # 2
EM 63.7 # 2
Natural Language Inference QNLI DeBERTa (large) Accuracy 95.3% # 10
Question Answering Quora Question Pairs DeBERTa (large) Accuracy 92.3% # 1
Reading Comprehension RACE DeBERTalarge Accuracy 86.8 # 5
Common Sense Reasoning ReCoRD DeBERTa-1.5B F1 94.5 # 2
EM 94.1 # 1
Natural Language Inference RTE DeBERTa-1.5B Accuracy 93.2% # 2
Question Answering SQuAD2.0 DeBERTalarge EM 88.0 # 73
F1 90.7 # 78
Sentiment Analysis SST-2 Binary classification DeBERTa (large) Accuracy 96.5 # 14
Semantic Textual Similarity STS Benchmark DeBERTa (large) Accuracy 92.5 # 1
Common Sense Reasoning SWAG DeBERTalarge Test 90.8 # 1
Coreference Resolution Winograd Schema Challenge DeBERTa-1.5B Accuracy 95.9 # 1
Natural Language Inference WNLI DeBERTa Accuracy 94.5% # 1
Word Sense Disambiguation Words in Context DeBERTa-1.5B Accuracy 76.4 # 5
Word Sense Disambiguation Words in Context DeBERTa-Ensemble Accuracy 77.5 # 3