Finetuned Language Models Are Zero-Shot Learners

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Results from the Paper

 Ranked #1 on Common Sense Reasoning on ReCoRD (Accuracy metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering ARC-c FLAN 137B zero-shot Accuracy 63.1 # 1
Question Answering ARC-e FLAN 137B zero-shot Accuracy 79.6 # 1
Question Answering BoolQ FLAN 137B zero-shot Accuracy 82.9 # 7
Question Answering COPA FLAN 137B zero-shot Accuracy 91.0 # 6
Sentence Completion HellaSwag FLAN 137B zero-shot Accuracy 56.7 # 9
Question Answering MultiRC FLAN 137B zero-shot F1a 77.5 # 4
Question Answering NaturalQA FLAN 137B zero-shot EM 20.7 # 2
Question Answering OBQA FLAN 137B zero-shot Accuracy 78.4 # 1
Question Answering PIQA FLAN 137B zero-shot Accuracy 80.5 # 2
Common Sense Reasoning ReCoRD FLAN 137B zero-shot Accuracy 72.5 # 1
Natural Language Inference RTE FLAN 137B zero-shot Accuracy 84.1% # 12
Question Answering StoryCloze FLAN 137B zero-shot Accuracy 93.4 # 1
Question Answering TriviaQA FLAN 137B zero-shot EM 56.7 # 14
Common Sense Reasoning Winograd Schema Challenge FLAN 137B zero-shot Score 71.2 # 3
Machine Translation WMT2014 English-French FLAN 137B zero-shot BLEU score 34 # 45
Machine Translation WMT2014 French-English FLAN 137B zero-shot BLEU score 36.5 # 1
Machine Translation WMT2016 English-German FLAN 137B zero-shot BLEU score 27.0 # 5
Machine Translation WMT2016 English-Romanian FLAN 137B zero-shot BLEU score 18.4 # 19
Machine Translation WMT2016 German-English FLAN 137B zero-shot BLEU score 39.8 # 1
Machine Translation WMT2016 Romanian-English FLAN 137B zero-shot BLEU score 36.7 # 2
Common Sense Reasoning WSC273 FLAN 137B zero-shot Accuracy 80.8 # 1