MASS: Masked Sequence to Sequence Pre-training for Language Generation

7 May 2019 Kaitao Song Xu Tan Tao Qin Jianfeng Lu Tie-Yan Liu

Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Text Summarization GigaWord MASS ROUGE-1 38.73 # 12
ROUGE-2 19.71 # 12
ROUGE-L 35.96 # 12
Unsupervised Machine Translation WMT2014 English-French MASS (6-layer Transformer) BLEU 37.5 # 1
Unsupervised Machine Translation WMT2014 French-English MASS (6-layer Transformer) BLEU 34.9 # 2
Unsupervised Machine Translation WMT2016 English-German MASS (6-layer Transformer) BLEU 28.3 # 2
Unsupervised Machine Translation WMT2016 English-Romanian MASS (6-layer Transformer) BLEU 35.2 # 3
Unsupervised Machine Translation WMT2016 German-English MASS (6-layer Transformer) BLEU 35.2 # 2
Unsupervised Machine Translation WMT2016 Romanian-English MASS (6-layer Transformer) BLEU 33.1 # 2

Methods used in the Paper


METHOD TYPE
Weight Decay
Regularization
Residual Connection
Skip Connections
Adam
Stochastic Optimization
Layer Normalization
Normalization
Softmax
Output Functions
Scaled Dot-Product Attention
Attention Mechanisms
Dropout
Regularization
GELU
Activation Functions
Multi-Head Attention
Attention Modules
Attention Dropout
Regularization
WordPiece
Subword Segmentation
Linear Warmup With Linear Decay
Learning Rate Schedules
Dense Connections
Feedforward Networks
BERT
Language Models