Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

PDF Abstract Google Research 2022 PDF
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Memorization BIG-bench (Hindu Knowledge) PaLM-540B (few-shot, k=5) Accuracy 95.4 # 1
Memorization BIG-bench (Hindu Knowledge) PaLM-62B (few-shot, k=5) Accuracy 77.7 # 3
Common Sense Reasoning BIG-bench (Known Unknowns) PaLM-540B (few-shot, k=5) Accuracy 73.9 # 1
Auto Debugging Big-bench Lite PaLM 62B (few-shot, k=5) Exact string match 38.2 # 1
Auto Debugging Big-bench Lite PaLM 540B (few-shot, k=5) Exact string match 38.2 # 1
Auto Debugging Big-bench Lite PaLM 8B (few-shot, k=5) Exact string match 14.7 # 3
Multiple Choice Question Answering (MCQA) BIG-bench (Novel Concepts) PaLM-62B (few-shot, k=5) Accuracy 59.4 # 3
Multiple Choice Question Answering (MCQA) BIG-bench (Novel Concepts) PaLM-540B (few-shot, k=5) Accuracy 71.9 # 1
Logical Reasoning BIG-bench (StrategyQA) PaLM-540B (few-shot, k=5) Accuracy 73.9 # 1
Logical Reasoning BIG-bench (StrategyQA) PaLM-62B (few-shot, k=5) Accuracy 65.4 # 3
Common Sense Reasoning BIG-bench (Winowhy) PaLM-540B (few-shot, k=5) Accuracy 65.9 # 1
Common Sense Reasoning BIG-bench (Winowhy) PaLM-62B (few-shot, k=5) Accuracy 61.0 # 3
Question Answering BoolQ PaLM 540B (fine-tuned) Accuracy 92.2 # 3
Natural Language Inference CommitmentBank PaLM 540B (finetuned) F1 100 # 1
Accuracy 100 # 1
Question Answering COPA PaLM 540B (finetuned) Accuracy 100 # 1
Extreme Summarization GEM-XSum PaLM (finetuning)-540B ROUGE-2 21.2 # 2
Parameters 540 B # 2
Extreme Summarization GEM-XSum T5-XXL ROUGE-2 21.0 # 3
Extreme Summarization GEM-XSum PaLM (finetuning)-62B ROUGE-2 18.5 # 4
Parameters 62 B # 3
Sentence Completion HellaSwag PaLM-540B (Few-Shot) Accuracy 83.8 # 29
Sentence Completion HellaSwag PaLM-540B (0-shot) Accuracy 83.4 # 32
Sentence Completion HellaSwag PaLM-540B (1-shot) Accuracy 83.6 # 30
Language Modelling LAMBADA PaLM-540B (Zero-Shot) Accuracy 77.9 # 15
Language Modelling LAMBADA PaLM-540B (Few-Shot) Accuracy 89.7 # 1
Language Modelling LAMBADA PaLM-540B (One-Shot) Accuracy 81.8 # 9
Code Generation MBPP PaLM Coder 540B Accuracy 47 # 65
Code Generation MBPP PaLM 540B Accuracy 36.8 # 80
Multi-task Language Understanding MGSM PaLM 540B Average (%) 55.0 # 6
Multi-task Language Understanding MMLU PaLM Average (%) 69.3 # 40
Question Answering MultiRC PaLM 540B (finetuned) F1 90.1 # 1
EM 69.2 # 1
Question Answering Natural Questions PaLM-540B (Zero-Shot) EM 21.2 # 43
Question Answering Natural Questions PaLM-540B (Few-Shot, k=64) EM 39.6 # 26
Question Answering Natural Questions PaLM-540B (One-Shot) EM 29.3 # 35
Question Answering OBQA PaLM 540B (zero-shot) Accuracy 53.4 # 8
Question Answering OBQA PaLM 62B (zero-shot) Accuracy 50.4 # 9
Reading Comprehension RACE PaLM 8B (zero-shot) Accuracy (High) 42.3 # 14
Accuracy (Middle) 57.9 # 14
Reading Comprehension RACE PaLM 540B (zero-shot) Accuracy (High) 49.1 # 8
Accuracy (Middle) 68.1 # 7
Reading Comprehension RACE PaLM 62B (zero-shot) Accuracy (High) 47.5 # 10
Accuracy (Middle) 64.3 # 9
Common Sense Reasoning ReCoRD PaLM 540B (finetuned) F1 94.6 # 2
EM 94.0 # 4
Natural Language Inference RTE PaLM 540B (fine-tuned) Accuracy 95.7% # 2
Natural Language Inference RTE PaLM 540B (0-shot) Accuracy 72.9% # 49
Natural Language Inference RTE PaLM 540B (1-shot) Accuracy 78.7% # 41
Natural Language Inference RTE PaLM 540B (5-shot) Accuracy 79.6% # 38
Question Answering TriviaQA PaLM-540B (Few-Shot) EM 81.4 # 11
Question Answering TriviaQA PaLM-540B (One-Shot) EM 81.4 # 11
Question Answering TriviaQA PaLM-540B (Zero-Shot) EM 76.9 # 17
Cross-Lingual Question Answering TyDiQA-GoldP PaLM-540B (CoT) EM 52.9 # 7
Question Answering WebQuestions PaLM-540B (Few-Shot) EM 43.5 # 5
Question Answering WebQuestions PaLM-540B (One-Shot) EM 22.6 # 14
Question Answering WebQuestions PaLM-540B (Zero-Shot) EM 10.6 # 18
Coreference Resolution Winograd Schema Challenge PaLM 540B (5-shot) Accuracy 89.5 # 11
Coreference Resolution Winograd Schema Challenge PaLM 540B (0-shot) Accuracy 89.1 # 12
Coreference Resolution Winograd Schema Challenge PaLM 540B (fine-tuned) Accuracy 100 # 1
Coreference Resolution Winograd Schema Challenge PaLM 540B (1-shot) Accuracy 86.3 # 16
Common Sense Reasoning WinoGrande PaLM 62B (0-shot) Accuracy 77.0 # 21
Common Sense Reasoning WinoGrande PaLM 540B (0-shot) Accuracy 81.1 # 15
Common Sense Reasoning WinoGrande PaLM-cont 62B (0-shot) Accuracy 77.0 # 21
Word Sense Disambiguation Words in Context PaLM 540B (finetuned) Accuracy 78.8 # 2

Methods