Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

PDF Abstract

Results from the Paper


 Ranked #1 on Long-range modeling on SCROLLS (CNLI metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) UL2 20B (chain-of-thought + self-consistency) Accuracy 49.5 # 35
Common Sense Reasoning ARC (Challenge) UL2 20B (zero-shot) Accuracy 29.8 # 51
Common Sense Reasoning ARC (Challenge) UL2 20B (chain-of-thought) Accuracy 42.9 # 43
Common Sense Reasoning ARC (Easy) UL2 20B (chain-of-thought + self-consistency) Accuracy 69.8 # 34
Common Sense Reasoning ARC (Easy) UL2 20B (0-shot) Accuracy 32.2 # 46
Common Sense Reasoning ARC (Easy) UL2 20B (chain-of-thought) Accuracy 38.4 # 44
Question Answering BoolQ UL2 20B (0-shot) Accuracy 63.1 # 50
Question Answering BoolQ UL2 20B (fine-tuned) Accuracy 90.8 # 7
Common Sense Reasoning CommonsenseQA UL2 20B (chain-of-thought + self-consistency) Accuracy 55.7 # 32
Common Sense Reasoning CommonsenseQA UL2 20B (zero-shot) Accuracy 34.2 # 36
Common Sense Reasoning CommonsenseQA UL2 20B (chain-of-thought) Accuracy 51.4 # 34
Question Answering COPA UL2 20B (0-shot) Accuracy 85 # 29
Question Answering COPA UL2 20B (fine-tuned) Accuracy 99 # 4
Arithmetic Reasoning GSM8K UL2 20B (chain-of-thought) Accuracy 4.4 # 150
Parameters (Billion) 20 # 73
Arithmetic Reasoning GSM8K UL2 20B (0-shot) Accuracy 4.1 # 151
Parameters (Billion) 20 # 73
Multi-task Language Understanding MMLU FLAN-UL2 20B (chain-of-thought) Average (%) 52.2 # 68
Multi-task Language Understanding MMLU FLAN-UL2 20B (5-shot) Average (%) 55.7 # 62
Multi-task Language Understanding MMLU UL2 20B (5-shot) Average (%) 39.2 # 84
Natural Language Inference RTE UL2 20B (0-shot) Accuracy 60.7% # 71
Natural Language Inference RTE UL2 20B (fine-tuned) Accuracy 92.1% # 10
Long-range modeling SCROLLS UL2 GovRep 53.6 / 26.1 / 28.8 # 8
SumScr 32.9 / 7.8 / 19.4 # 8
QMSum 31.1 / 8.5 / 20.4 # 8
Qspr 37.6 # 7
Nrtv 24.2 # 5
QALT EM-T/H 45.8 / 40.7 # 2
Avg. 37.87 # 7
Long-range modeling SCROLLS UL2 20B CNLI 88.7 # 1
Coreference Resolution Winograd Schema Challenge UL2 20B (fine-tuned) Accuracy 98.1 # 3
Coreference Resolution Winograd Schema Challenge UL2 20B (0-shot) Accuracy 79.9 # 23
Word Sense Disambiguation Words in Context UL2 20B (fine-tuned) Accuracy 77.3 # 6
Word Sense Disambiguation Words in Context UL2 20B (0-shot) Accuracy 49.8 # 34

Methods