Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multi-task Language Understanding BBH-alg PaLM 540B Average (%) 38.3 # 7
Multi-task Language Understanding BBH-alg Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) Average (%) 66.5 # 2
Multi-task Language Understanding BBH-alg Flan-PaLM 540B (3-shot, fine-tuned) Average (%) 48.2 # 6
Multi-task Language Understanding BBH-alg Flan-PaLM 540B (3-shot, fine-tuned, CoT) Average (%) 61.3 # 4
Multi-task Language Understanding BBH-alg PaLM 540B (CoT + self-consistency) Average (%) 62.2 # 3
Multi-task Language Understanding BBH-alg PaLM 540B (CoT) Average (%) 57.6 # 5
Multi-task Language Understanding BBH-nlp Flan-PaLM 540B (5-shot, finetuned) Average (%) 70.0 # 6
Multi-task Language Understanding BBH-nlp PaLM 540B Average (%) 62.7 # 7
Multi-task Language Understanding BBH-nlp Flan-PaLM 540B (3-shot, fine-tuned, CoT) Average (%) 72.4 # 4
Multi-task Language Understanding BBH-nlp Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) Average (%) 78.4 # 1
Multi-task Language Understanding BBH-nlp PaLM 540B (CoT + self-consistency) Average (%) 78.2 # 2
Multi-task Language Understanding BBH-nlp PaLM 540B (CoT) Average (%) 71.2 # 5
Multi-task Language Understanding MGSM text-davinci-002 Average (%) 23.7 # 10
Multi-task Language Understanding MGSM GPT-3 Davinci 175B Average (%) 5.7 # 12
Multi-task Language Understanding MGSM Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC) Average (%) 72.0 # 3
Multi-task Language Understanding MGSM Flan-PaLM 540B (8-shot, fine-tuned) Average (%) 21.2 # 11
Multi-task Language Understanding MGSM Flan-U-PaLM 540B (CoT) Average (%) 60.4 # 4
Multi-task Language Understanding MGSM Flan-PaLM 540B (8-shot, fine-tuned, CoT) Average (%) 57.0 # 5
Multi-task Language Understanding MGSM code-davinci-002 Average (%) 35 # 9
Multi-task Language Understanding MGSM text-davinci-003 Average (%) 36 # 8
Multi-task Language Understanding MMLU text-davinci-002 175B (5-shot) Average (%) 63.1 # 45
Multi-task Language Understanding MMLU Flan-cont-PaLM 62B (CoT) Average (%) 62 # 48
Multi-task Language Understanding MMLU Flan-T5-Large 780M Average (%) 45.1 # 71
Multi-task Language Understanding MMLU Flan-PaLM 540B (CoT) Average (%) 70.9 # 26
Multi-task Language Understanding MMLU Flan-PaLM (5-shot, finetuned) Average (%) 72.2 # 23
Multi-task Language Understanding MMLU Flan-T5-Base 250M Average (%) 35.9 # 84
Multi-task Language Understanding MMLU Flan-T5-Small 80M Average (%) 28.7 # 91
Multi-task Language Understanding MMLU Flan-PaLM 8B Average (%) 49.3 # 65
Multi-task Language Understanding MMLU Flan-PaLM (5-shot, finetuned, CoT) Average (%) 70.2 # 30
Multi-task Language Understanding MMLU Flan-PaLM 540B Average (%) 73.5 # 21
Multi-task Language Understanding MMLU Flan-U-PaLM 540B Average (%) 74.1 # 19
Multi-task Language Understanding MMLU GPT-3 Davinci 175B (5-shot) Average (%) 39.7 # 77
Multi-task Language Understanding MMLU GPT-3 Davinci 175B (CoT) Average (%) 59.5 # 52
Multi-task Language Understanding MMLU Flan-T5-Small 80M (CoT) Average (%) 12.1 # 107
Multi-task Language Understanding MMLU Flan-T5-Base 250M (CoT) Average (%) 33.7 # 85
Multi-task Language Understanding MMLU Flan-T5-Large 780M (CoT) Average (%) 40.5 # 76
Multi-task Language Understanding MMLU Flan-T5-XL 3B (CoT) Average (%) 45.5 # 69
Multi-task Language Understanding MMLU Flan-T5-XXL 11B (CoT) Average (%) 48.6 # 67
Multi-task Language Understanding MMLU Flan-T5-XL 3B Average (%) 52.4 # 63
Multi-task Language Understanding MMLU Flan-T5-XXL 11B Average (%) 55.1 # 58
Multi-task Language Understanding MMLU Flan-cont-PaLM 62B Average (%) 66.1 # 40
Multi-task Language Understanding MMLU Flan-U-PaLM 540B (CoT) Average (%) 69.8 # 32
Multi-task Language Understanding MMLU code-davinci-002 175B (CoT) Average (%) 64.5 # 43
Multi-task Language Understanding MMLU Flan-PaLM Average (%) 56.9 # 55
Multi-task Language Understanding MMLU code-davinci-002 175B (5-shot) Average (%) 68.2 # 37
Multi-task Language Understanding MMLU text-davinci-003 175B (CoT) Average (%) 64.6 # 42
Multi-task Language Understanding MMLU text-davinci-003 175B (5-shot) Average (%) 64.8 # 41
Multi-task Language Understanding MMLU text-davinci-002 175B (CoT) Average (%) 60 # 50
Cross-Lingual Question Answering TyDiQA-GoldP Flan-U-PaLM 540B (direct-prompting) EM 68.3 # 3
Cross-Lingual Question Answering TyDiQA-GoldP Flan-PaLM 540B (direct-prompting) EM 67.8 # 4
Coreference Resolution Winograd Schema Challenge Flan-T5 XXL (zero -shot) Accuracy 89.82 # 10

Methods