TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Language Inference	MRPC	DeBERTaV3large	Acc	92.2	# 1
Natural Language Inference	QNLI	DeBERTaV3large	Accuracy	96%	# 8
Natural Language Inference	RTE	DeBERTaV3large	Accuracy	92.7%	# 7
Question Answering	SWAG	DeBERTaV3large	Accuracy	93.4	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/debertav3-improving-deberta-using-electra/natural-language-inference-on-mrpc)](https://paperswithcode.com/sota/natural-language-inference-on-mrpc?p=debertav3-improving-deberta-using-electra)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/debertav3-improving-deberta-using-electra/question-answering-on-swag)](https://paperswithcode.com/sota/question-answering-on-swag?p=debertav3-improving-deberta-using-electra)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/debertav3-improving-deberta-using-electra/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=debertav3-improving-deberta-using-electra)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/debertav3-improving-deberta-using-electra/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=debertav3-improving-deberta-using-electra)`

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

18 Nov 2021 · Pengcheng He, Jianfeng Gao, Weizhu Chen ·

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.

PDF Abstract

Code

Add Remove Mark official

microsoft/DeBERTa official

1,842

stareru/csqa_debertav3

Tasks

Add Remove

Language Modelling

Natural Language Inference

Natural Language Understanding

Question Answering

XLM-R

Datasets

GLUE

SST

SQuAD

MultiNLI SST-2

QNLI

MRPC

CoLA

SuperGLUE

RACE

XNLI

SWAG

ReCoRD CC100 RTE

Results from the Paper

Edit

Ranked #1 on Natural Language Inference on MRPC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Language Inference	MRPC	DeBERTaV3large	Acc	92.2	# 1	Compare
Natural Language Inference	QNLI	DeBERTaV3large	Accuracy	96%	# 8	Compare
Natural Language Inference	RTE	DeBERTaV3large	Accuracy	92.7%	# 7	Compare
Question Answering	SWAG	DeBERTaV3large	Accuracy	93.4	# 1	Compare

Methods

Add Remove

Adam • Attention Dropout • DeBERTa • Dense Connections • Disentangled Attention Mechanism • Dropout • ELECTRA • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece • XLM-R

Edit Social Preview

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove