ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective... (read more)

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Question Answering Quora Question Pairs ELECTRA Accuracy 90.1% # 7

Methods used in the Paper


METHOD TYPE
Cosine Annealing
Learning Rate Schedules
Residual Connection
Skip Connections
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Linear Warmup With Cosine Annealing
Learning Rate Schedules
ELECTRA
Transformers
RoBERTa
Transformers
SentencePiece
Tokenizers
BPE
Subword Segmentation
Dense Connections
Feedforward Networks
Weight Decay
Regularization
WordPiece
Subword Segmentation
Softmax
Output Functions
Dropout
Regularization
Discriminative Fine-Tuning
Fine-Tuning
GELU
Activation Functions
Adam
Stochastic Optimization
GPT
Transformers
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
XLNet
Transformers
BERT
Language Models