ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times... (read more)

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Linguistic Acceptability CoLA ALBERT Accuracy 69.1% # 2
Semantic Textual Similarity MRPC ALBERT Accuracy 93.4% # 2
Natural Language Inference MultiNLI ALBERT Matched 91.3 # 3
Natural Language Inference QNLI ALBERT Accuracy 99.2% # 1
Question Answering Quora Question Pairs ALBERT Accuracy 90.5% # 2
Natural Language Inference RTE ALBERT Accuracy 89.2% # 4
Question Answering SQuAD2.0 ALBERT (ensemble model) EM 89.731 # 19
F1 92.215 # 21
Question Answering SQuAD2.0 ALBERT (single model) EM 88.107 # 49
F1 90.902 # 54
Question Answering SQuAD2.0 dev ALBERT xlarge F1 85.9 # 7
EM 83.1 # 5
Question Answering SQuAD2.0 dev ALBERT xxlarge F1 88.1 # 4
EM 85.1 # 4
Question Answering SQuAD2.0 dev ALBERT large F1 82.1 # 9
EM 79 # 7
Question Answering SQuAD2.0 dev ALBERT base F1 79.1 # 11
EM 76.1 # 9
Sentiment Analysis SST-2 Binary classification ALBERT Accuracy 97.1 # 3
Semantic Textual Similarity STS Benchmark ALBERT Pearson Correlation 0.925 # 2
Natural Language Inference WNLI ALBERT Accuracy 91.8% # 5

Methods used in the Paper


METHOD TYPE
Weight Decay
Regularization
Dropout
Regularization
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
BERT
Language Models
Residual Connection
Skip Connections
Adam
Stochastic Optimization
LAMB
Large Batch Optimization
GELU
Activation Functions
Dense Connections
Feedforward Networks
Multi-Head Attention
Attention Modules
WordPiece
Subword Segmentation
Softmax
Output Functions
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
ALBERT
Transformers