XLNet: Generalized Autoregressive Pretraining for Language Understanding

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy... (read more)

PDF Abstract NeurIPS 2019 PDF NeurIPS 2019 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Humor Detection 200k Short Texts for Humor Detection XLNet Large Cased F1-score 0.920 # 2
Text Classification AG News XLNet Error 4.45 # 1
Text Classification Amazon-2 XLNet Error 2.11 # 1
Text Classification Amazon-5 XLNet Error 31.67 # 1
Natural Language Inference ANLI test XLNet (Large) A1 70.3 # 4
A2 50.9 # 2
A3 49.4 # 1
Document Ranking ClueWeb09-B XLNet nDCG@20 31.10 # 1
ERR@20 20.28 # 1
Linguistic Acceptability CoLA XLNet (single model) Accuracy 69% # 3
Text Classification DBpedia XLNet Error 0.62 # 1
Text Classification IMDb XLNet Accuracy (2 classes) 96.8 # 1
Accuracy (10 classes) - # 3
Sentiment Analysis IMDb XLNet Accuracy 96.21 # 2
Semantic Textual Similarity MRPC XLNet (single model) Accuracy 90.8% # 6
Natural Language Inference MultiNLI XLNet (single model) Matched 90.8 # 6
Natural Language Inference QNLI XLNet (single model) Accuracy 94.9% # 7
Paraphrase Identification Quora Question Pairs XLNet-Large (ensemble) Accuracy 90.3 # 1
F1 74.2 # 1
Question Answering Quora Question Pairs XLNet (single model) Accuracy 92.3% # 1
Question Answering RACE XLNet RACE-m 85.45 # 1
RACE-h 80.21 # 1
RACE 81.75 # 1
Reading Comprehension RACE XLNet Accuracy 85.4 # 7
Accuracy (High) 84.0 # 5
Accuracy (Middle) 88.6 # 4
Natural Language Inference RTE XLNet (single model) Accuracy 85.9% # 7
Semantic Textual Similarity SentEval XLNet-Large MRPC 93.0/90.7 # 1
SICK-R - # 3
SICK-E - # 3
STS 91.6/91.1* # 1
Question Answering SQuAD1.1 XLNet (single model) EM 89.898 # 3
F1 95.080 # 3
Question Answering SQuAD1.1 dev XLNet (single model) EM 89.7 # 5
F1 95.1 # 4
Question Answering SQuAD2.0 XLNet (single model) EM 87.926 # 55
F1 90.689 # 63
Question Answering SQuAD2.0 dev XLNet (single model) F1 90.6 # 1
EM 87.9 # 1
Sentiment Analysis SST-2 Binary classification XLNet-Large (ensemble) Accuracy 96.8 # 6
Sentiment Analysis SST-2 Binary classification XLNet (single model) Accuracy 97 # 4
Semantic Textual Similarity STS Benchmark XLNet (single model) Pearson Correlation 0.925 # 2
Natural Language Inference WNLI XLNet Accuracy 92.5% # 3
Text Classification Yelp-2 XLNet Accuracy 98.63% # 1
Text Classification Yelp-5 XLNet Accuracy 72.95% # 2
Sentiment Analysis Yelp Binary classification XLNet Error 1.55 # 1
Sentiment Analysis Yelp Fine-grained classification XLNet Error 27.80 # 1

Methods used in the Paper


METHOD TYPE
Cosine Annealing
Learning Rate Schedules
Variational Dropout
Regularization
ReLU
Activation Functions
Adaptive Input Representations
Input Embedding Factorization
Adaptive Softmax
Output Functions
Linear Warmup With Cosine Annealing
Learning Rate Schedules
Transformer-XL
Transformers
Residual Connection
Skip Connections
Linear Warmup With Linear Decay
Learning Rate Schedules
BPE
Subword Segmentation
SentencePiece
Tokenizers
GELU
Activation Functions
Dense Connections
Feedforward Networks
Adam
Stochastic Optimization
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
XLNet
Transformers