DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques... (read more)

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Question Answering BoolQ DeBERTa-1.5B Accuracy 90.4 # 2
Linguistic Acceptability CoLA Dev DeBERTa (large) Accuracy 69.5 # 1
Natural Language Inference CommitmentBank DeBERTa-1.5B F1 94.9 # 1
Accuracy 97.2 # 1
Named Entity Recognition CoNLL 2003 NER dev DeBERTalarge F1 93.8 # 1
Question Answering COPA DeBERTa-Ensemble Accuracy 98.4 # 1
Question Answering COPA DeBERTa-1.5B Accuracy 96.8 # 2
Natural Language Inference MRPC Dev DeBERTa (large) Accuracy 92.5 # 1
Natural Language Inference MultiNLI DeBERTa (large) Matched 91.1 # 4
Mismatched 91.1 # 3
Question Answering MultiRC DeBERTa-1.5B F1a 88.2 # 1
EM 63.7 # 1
Natural Language Inference QNLI DeBERTa (large) Accuracy 95.3% # 6
Question Answering Quora Question Pairs DeBERTa (large) Accuracy 92.3% # 1
Reading Comprehension RACE DeBERTalarge Accuracy 86.8 # 5
Common Sense Reasoning ReCoRD DeBERTa-1.5B F1 94.5 # 1
Acc 94.1 # 1
Natural Language Inference RTE DeBERTa-1.5B Accuracy 93.2% # 1
Question Answering SQuAD1.1 dev DeBERTalarge EM 90.1 # 1
F1 95.5 # 3
Question Answering SQuAD2.0 DeBERTalarge EM 88.0 # 52
F1 90.7 # 62
Sentiment Analysis SST-2 Binary classification DeBERTa (large) Accuracy 96.5 # 8
Semantic Textual Similarity STS Benchmark DeBERTa (large) Accuracy 92.5 # 1
Common Sense Reasoning SWAG DeBERTalarge Test 90.8 # 1
Coreference Resolution Winograd Schema Challenge DeBERTa-1.5B Accuracy 95.9 # 1
Natural Language Inference WNLI DeBERTa Accuracy 94.5% # 1
Word Sense Disambiguation Words in Context DeBERTa-Ensemble Accuracy 77.5 # 2
Word Sense Disambiguation Words in Context DeBERTa-1.5B Accuracy 76.4 # 4

Methods used in the Paper


METHOD TYPE
BPE
Subword Segmentation
SentencePiece
Tokenizers
GLU
Activation Functions
Adafactor
Stochastic Optimization
Inverse Square Root Schedule
Learning Rate Schedules
T5
Transformers
Weight Decay
Regularization
Adam
Stochastic Optimization
Multi-Head Attention
Attention Modules
Dropout
Regularization
GELU
Activation Functions
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Dense Connections
Feedforward Networks
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
WordPiece
Subword Segmentation
Residual Connection
Skip Connections
BERT
Language Models
RoBERTa
Transformers
Softmax
Output Functions