TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Linguistic Acceptability	CoLA	ALBERT	Accuracy	69.1%	# 14
Common Sense Reasoning	CommonsenseQA	Albert Lan et al. (2020) (ensemble)	Accuracy	76.5	# 10
Multi-task Language Understanding	MMLU	ALBERT-xxlarge 223M (fine-tuned)	Average (%)	27.1	# 97
Semantic Textual Similarity	MRPC	ALBERT	Accuracy	93.4%	# 2
Natural Language Inference	MultiNLI	ALBERT	Matched	91.3	# 5
Multimodal Intent Recognition	PhotoChat	ALBERT-base	F1	52.2	# 6
Multimodal Intent Recognition	PhotoChat	ALBERT-base	Precision	44.8	# 6
Multimodal Intent Recognition	PhotoChat	ALBERT-base	Recall	62.7	# 3
Natural Language Inference	QNLI	ALBERT	Accuracy	99.2%	# 1
Question Answering	Quora Question Pairs	ALBERT	Accuracy	90.5%	# 3
Natural Language Inference	RTE	ALBERT	Accuracy	89.2%	# 16
Question Answering	SQuAD2.0	ALBERT (single model)	EM	88.107	# 64
Question Answering	SQuAD2.0	ALBERT (single model)	F1	90.902	# 67
Question Answering	SQuAD2.0	ALBERT (ensemble model)	EM	89.731	# 27
Question Answering	SQuAD2.0	ALBERT (ensemble model)	F1	92.215	# 28
Question Answering	SQuAD2.0 dev	ALBERT xxlarge	F1	88.1	# 4
Question Answering	SQuAD2.0 dev	ALBERT xxlarge	EM	85.1	# 4
Question Answering	SQuAD2.0 dev	ALBERT xlarge	F1	85.9	# 7
Question Answering	SQuAD2.0 dev	ALBERT xlarge	EM	83.1	# 6
Question Answering	SQuAD2.0 dev	ALBERT base	F1	79.1	# 10
Question Answering	SQuAD2.0 dev	ALBERT base	EM	76.1	# 9
Question Answering	SQuAD2.0 dev	ALBERT large	F1	82.1	# 9
Question Answering	SQuAD2.0 dev	ALBERT large	EM	79.0	# 8
Sentiment Analysis	SST-2 Binary classification	ALBERT	Accuracy	97.1	# 5
Semantic Textual Similarity	STS Benchmark	ALBERT	Pearson Correlation	0.925	# 4
Natural Language Inference	WNLI	ALBERT	Accuracy	91.8	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/question-answering-on-quora-question-pairs)](https://paperswithcode.com/sota/question-answering-on-quora-question-pairs?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/question-answering-on-squad20-dev)](https://paperswithcode.com/sota/question-answering-on-squad20-dev?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/natural-language-inference-on-wnli)](https://paperswithcode.com/sota/natural-language-inference-on-wnli?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/multimodal-intent-recognition-on-photochat)](https://paperswithcode.com/sota/multimodal-intent-recognition-on-photochat?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/common-sense-reasoning-on-commonsenseqa)](https://paperswithcode.com/sota/common-sense-reasoning-on-commonsenseqa?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/question-answering-on-squad20)](https://paperswithcode.com/sota/question-answering-on-squad20?p=albert-a-lite-bert-for-self-supervised)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/albert-a-lite-bert-for-self-supervised/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=albert-a-lite-bert-for-self-supervised)`

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ICLR 2020 · Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut ·

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Code

Add Remove Mark official

google-research/ALBERT official

3,210

huggingface/transformers

124,793

tensorflow/models

72,088

PaddlePaddle/PaddleNLP

11,398

brightmart/albert_zh

3,902

See all 48 implementations

Tasks

Add Remove

Common Sense Reasoning

Linguistic Acceptability

Multimodal Intent Recognition

Multi-task Language Understanding

Natural Language Inference

Question Answering

Self-Supervised Learning

Semantic Textual Similarity

Sentence

Datasets

GLUE

SST

SQuAD

MultiNLI SST-2

QNLI

MRPC

MMLU

CoLA

RACE

CommonsenseQA

Quora

Quora Question Pairs RTE STS Benchmark

PhotoChat WNLI

Results from the Paper

Edit

Ranked #1 on Natural Language Inference on QNLI

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Linguistic Acceptability	CoLA	ALBERT	Accuracy	69.1%	# 14	Compare
Multi-task Language Understanding	MMLU	ALBERT-xxlarge 223M (fine-tuned)	Average (%)	27.1	# 97	Compare
Semantic Textual Similarity	MRPC	ALBERT	Accuracy	93.4%	# 2	Compare
Natural Language Inference	MultiNLI	ALBERT	Matched	91.3	# 5	Compare
Multimodal Intent Recognition	PhotoChat	ALBERT-base	F1	52.2	# 6	Compare
			Precision	44.8	# 6	Compare
			Recall	62.7	# 3	Compare
Natural Language Inference	QNLI	ALBERT	Accuracy	99.2%	# 1	Compare
Question Answering	Quora Question Pairs	ALBERT	Accuracy	90.5%	# 3	Compare
Natural Language Inference	RTE	ALBERT	Accuracy	89.2%	# 16	Compare
Question Answering	SQuAD2.0	ALBERT (single model)	EM	88.107	# 64	Compare
Question Answering	SQuAD2.0	ALBERT (single model)	F1	90.902	# 67	Compare
Question Answering	SQuAD2.0	ALBERT (ensemble model)	EM	89.731	# 27	Compare
Question Answering	SQuAD2.0	ALBERT (ensemble model)	F1	92.215	# 28	Compare
Question Answering	SQuAD2.0 dev	ALBERT xxlarge	F1	88.1	# 4	Compare
Question Answering	SQuAD2.0 dev	ALBERT xxlarge	EM	85.1	# 4	Compare
Question Answering	SQuAD2.0 dev	ALBERT xlarge	F1	85.9	# 7	Compare
Question Answering	SQuAD2.0 dev	ALBERT xlarge	EM	83.1	# 6	Compare
Question Answering	SQuAD2.0 dev	ALBERT base	F1	79.1	# 10	Compare
Question Answering	SQuAD2.0 dev	ALBERT base	EM	76.1	# 9	Compare
Question Answering	SQuAD2.0 dev	ALBERT large	F1	82.1	# 9	Compare
Question Answering	SQuAD2.0 dev	ALBERT large	EM	79.0	# 8	Compare
Sentiment Analysis	SST-2 Binary classification	ALBERT	Accuracy	97.1	# 5	Compare
Semantic Textual Similarity	STS Benchmark	ALBERT	Pearson Correlation	0.925	# 4	Compare
Natural Language Inference	WNLI	ALBERT	Accuracy	91.8	# 5	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Uses Extra Training Data	Source Paper	Compare
Common Sense Reasoning	CommonsenseQA	Albert Lan et al. (2020) (ensemble)	Accuracy	76.5	# 10			See all

Methods

Add Remove

Adam • ALBERT • Attention Dropout • BERT • Dense Connections • Dropout • GELU • LAMB • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • SPEED • Weight Decay • WordPiece

Edit Social Preview

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit