TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	BoolQ	DeBERTa-1.5B	Accuracy	90.4	# 8
Linguistic Acceptability	CoLA Dev	DeBERTa (large)	Accuracy	69.5	# 3
Natural Language Inference	CommitmentBank	DeBERTa-1.5B	F1	94.9	# 4
Natural Language Inference	CommitmentBank	DeBERTa-1.5B	Accuracy	97.2	# 6
Question Answering	COPA	DeBERTa-Ensemble	Accuracy	98.4	# 5
Question Answering	COPA	DeBERTa-1.5B	Accuracy	96.8	# 7
Sentence Completion	HellaSwag	DeBERTa++	Accuracy	93	# 7
Natural Language Inference	MultiNLI	DeBERTa (large)	Matched	91.1	# 6
Natural Language Inference	MultiNLI	DeBERTa (large)	Mismatched	91.1	# 5
Question Answering	MultiRC	DeBERTa-1.5B	F1	88.2	# 4
Question Answering	MultiRC	DeBERTa-1.5B	EM	63.7	# 2
Natural Language Inference	QNLI	DeBERTa (large)	Accuracy	95.3%	# 10
Question Answering	Quora Question Pairs	DeBERTa (large)	Accuracy	92.3%	# 1
Reading Comprehension	RACE	DeBERTalarge	Accuracy	86.8	# 5
Common Sense Reasoning	ReCoRD	DeBERTa-1.5B	F1	94.5	# 3
Common Sense Reasoning	ReCoRD	DeBERTa-1.5B	EM	94.1	# 3
Natural Language Inference	RTE	DeBERTa-1.5B	Accuracy	93.2%	# 5
Question Answering	SQuAD2.0	DeBERTalarge	EM	88.0	# 73
Question Answering	SQuAD2.0	DeBERTalarge	F1	90.7	# 78
Sentiment Analysis	SST-2 Binary classification	DeBERTa (large)	Accuracy	96.5	# 14
Semantic Textual Similarity	STS Benchmark	DeBERTa (large)	Accuracy	92.5	# 1
Common Sense Reasoning	SWAG	DeBERTalarge	Test	90.8	# 1
Coreference Resolution	Winograd Schema Challenge	DeBERTa-1.5B	Accuracy	95.9	# 6
Natural Language Inference	WNLI	DeBERTa	Accuracy	94.5	# 2
Word Sense Disambiguation	Words in Context	DeBERTa-1.5B	Accuracy	76.4	# 9
Word Sense Disambiguation	Words in Context	DeBERTa-Ensemble	Accuracy	77.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/question-answering-on-quora-question-pairs)](https://paperswithcode.com/sota/question-answering-on-quora-question-pairs?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/common-sense-reasoning-on-swag)](https://paperswithcode.com/sota/common-sense-reasoning-on-swag?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/natural-language-inference-on-wnli)](https://paperswithcode.com/sota/natural-language-inference-on-wnli?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/linguistic-acceptability-on-cola-dev)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola-dev?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/question-answering-on-multirc)](https://paperswithcode.com/sota/question-answering-on-multirc?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/word-sense-disambiguation-on-words-in-context)](https://paperswithcode.com/sota/word-sense-disambiguation-on-words-in-context?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/reading-comprehension-on-race)](https://paperswithcode.com/sota/reading-comprehension-on-race?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/natural-language-inference-on-commitmentbank)](https://paperswithcode.com/sota/natural-language-inference-on-commitmentbank?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=deberta-decoding-enhanced-bert-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deberta-decoding-enhanced-bert-with/question-answering-on-squad20)](https://paperswithcode.com/sota/question-answering-on-squad20?p=deberta-decoding-enhanced-bert-with)`

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

ICLR 2021 · Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen ·

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Code

Add Remove Mark official

microsoft/DeBERTa official

1,847

huggingface/transformers

124,984

osu-nlp-group/mind2web

569

neuralmind-ai/coliee

↳ Quickstart in

Colab

huberemanuel/DeBERTa

See all 9 implementations

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Linguistic Acceptability

Named Entity Recognition (NER)

Natural Language Inference

Natural Language Understanding

Question Answering

Reading Comprehension

Semantic Textual Similarity

Sentence Completion

Sentiment Analysis

Word Sense Disambiguation

Datasets

GLUE

SST

SQuAD

MultiNLI SST-2

QNLI

MRPC

CoLA

HellaSwag

BoolQ

SuperGLUE

RACE

BookCorpus

WSC

COPA

SWAG

MultiRC

ReCoRD

Quora

Quora Question Pairs RTE STS Benchmark WNLI CommitmentBank

Results from the Paper

Edit

Ranked #1 on Common Sense Reasoning on SWAG

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	BoolQ	DeBERTa-1.5B	Accuracy	90.4	# 8	Compare
Linguistic Acceptability	CoLA Dev	DeBERTa (large)	Accuracy	69.5	# 3	Compare
Natural Language Inference	CommitmentBank	DeBERTa-1.5B	F1	94.9	# 4	Compare
Natural Language Inference	CommitmentBank	DeBERTa-1.5B	Accuracy	97.2	# 6	Compare
Question Answering	COPA	DeBERTa-Ensemble	Accuracy	98.4	# 5	Compare
Question Answering	COPA	DeBERTa-1.5B	Accuracy	96.8	# 7	Compare
Sentence Completion	HellaSwag	DeBERTa++	Accuracy	93	# 7	Compare
Natural Language Inference	MultiNLI	DeBERTa (large)	Matched	91.1	# 6	Compare
Natural Language Inference	MultiNLI	DeBERTa (large)	Mismatched	91.1	# 5	Compare
Question Answering	MultiRC	DeBERTa-1.5B	F1	88.2	# 4	Compare
Question Answering	MultiRC	DeBERTa-1.5B	EM	63.7	# 2	Compare
Natural Language Inference	QNLI	DeBERTa (large)	Accuracy	95.3%	# 10	Compare
Question Answering	Quora Question Pairs	DeBERTa (large)	Accuracy	92.3%	# 1	Compare
Reading Comprehension	RACE	DeBERTalarge	Accuracy	86.8	# 5	Compare
Common Sense Reasoning	ReCoRD	DeBERTa-1.5B	F1	94.5	# 3	Compare
Common Sense Reasoning	ReCoRD	DeBERTa-1.5B	EM	94.1	# 3	Compare
Natural Language Inference	RTE	DeBERTa-1.5B	Accuracy	93.2%	# 5	Compare
Question Answering	SQuAD2.0	DeBERTalarge	EM	88.0	# 73	Compare
Question Answering	SQuAD2.0	DeBERTalarge	F1	90.7	# 78	Compare
Sentiment Analysis	SST-2 Binary classification	DeBERTa (large)	Accuracy	96.5	# 14	Compare
Semantic Textual Similarity	STS Benchmark	DeBERTa (large)	Accuracy	92.5	# 1	Compare
Common Sense Reasoning	SWAG	DeBERTalarge	Test	90.8	# 1	Compare
Coreference Resolution	Winograd Schema Challenge	DeBERTa-1.5B	Accuracy	95.9	# 6	Compare
Natural Language Inference	WNLI	DeBERTa	Accuracy	94.5	# 2	Compare
Word Sense Disambiguation	Words in Context	DeBERTa-1.5B	Accuracy	76.4	# 9	Compare
Word Sense Disambiguation	Words in Context	DeBERTa-Ensemble	Accuracy	77.5	# 4	Compare

Methods

Add Remove

Adafactor • Adam • Attention Dropout • BERT • BPE • DeBERTa • Dense Connections • Disentangled Attention Mechanism • Dropout • GELU • GLU • Inverse Square Root Schedule • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • RoBERTa • Scaled Dot-Product Attention • SentencePiece • Softmax • T5 • Weight Decay • WordPiece

Edit Social Preview

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove