Muppet: Massive Multi-task Representations with Pre-Finetuning

We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper

Ranked #3 on Text Summarization on GigaWord (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Question Answering BoolQ MUPPET Roberta Large Accuracy 87.5 # 8
Question Answering BoolQ MUPPET Roberta Base Accuracy 83.8 # 13
Abstractive Text Summarization CNN / Daily Mail MUPPET BART Large ROUGE-1 44.45 # 10
ROUGE-2 21.25 # 17
ROUGE-L 41.4 # 8
Common Sense Reasoning CommonsenseQA MUPPET Roberta Large Accuracy 79.2 # 5
Text Summarization GigaWord MUPPET BART Large ROUGE-1 40.4 # 3
ROUGE-2 20.54 # 4
ROUGE-L 36.21 # 15
Sentence Completion HellaSwag MUPPET Roberta Large Accuracy 86.4 # 5
Text Summarization Reddit TIFU MUPPET BART Large ROUGE-1 30.3 # 3
ROUGE-2 11.25 # 1
ROUGE-L 24.92 # 2
Natural Language Inference RTE MUPPET Roberta Large Accuracy 92.8% # 3
Sentiment Analysis SST-2 Binary classification MUPPET Roberta Large Accuracy 97.4 # 3
Sentiment Analysis SST-2 Binary classification MUPPET Roberta base Accuracy 96.7 # 11


No methods listed for this paper. Add relevant methods here