Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

PDF Abstract Google Research 2022 PDF

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Novel Concepts BIG-bench PaLM-540B (few-shot, k=5) Accuracy 71.9 # 1
StrategyQA BIG-bench PaLM-62B (few-shot, k=5) Accuracy 65.4 # 3
Winowhy BIG-bench PaLM-540B (few-shot, k=5) Accuracy 65.9 # 1
Winowhy BIG-bench PaLM-62B (few-shot, k=5) Accuracy 61.0 # 3
Known Unknowns BIG-bench PaLM-540B (few-shot, k=5) Accuracy 73.9 # 1
Hindu Knowledge BIG-bench PaLM-540B (few-shot, k=5) Accuracy 95.4 # 1
Hindu Knowledge BIG-bench PaLM-62B (few-shot, k=5) Accuracy 77.7 # 3
Logic Grid Puzzle BIG-bench PaLM-540B (few-shot, k=5) Accuracy 42.4 # 2
Logic Grid Puzzle BIG-bench PaLM-62B (few-shot, k=5) Accuracy 36.5 # 3
Novel Concepts BIG-bench PaLM-62B (few-shot, k=5) Accuracy 59.4 # 3
StrategyQA BIG-bench PaLM-540B (few-shot, k=5) Accuracy 73.9 # 1
Auto Debugging Big-bench Lite PaLM 8B (few-shot, k=5) Exact string match 14.7 # 3
Auto Debugging Big-bench Lite PaLM 540B (few-shot, k=5) Exact string match 38.2 # 1
Auto Debugging Big-bench Lite PaLM 62B (few-shot, k=5) Exact string match 38.2 # 1
Question Answering BoolQ PaLM 540B (finetuned) Accuracy 92.2 # 2
Natural Language Inference CommitmentBank PaLM 540B (finetuned) F1 100 # 1
Accuracy 100 # 1
Question Answering COPA PaLM 540B (finetuned) Accuracy 100 # 1
Extreme Summarization GEM-XSum PaLM (finetuning)-540B ROUGE-2 21.2 # 2
Parameters 540 B # 2
Extreme Summarization GEM-XSum T5-XXL ROUGE-2 21.0 # 3
Extreme Summarization GEM-XSum PaLM (finetuning)-62B ROUGE-2 18.5 # 4
Parameters 62 B # 3
Sentence Completion HellaSwag PaLM-540B (Few-Shot) Accuracy 83.8 # 5
Sentence Completion HellaSwag PaLM-540B (Zero-Shot) Accuracy 83.4 # 7
Sentence Completion HellaSwag PaLM-540B (One-Shot) Accuracy 83.6 # 6
Code Generation HumanEval PaLM 8B Pass@1 3.6 # 20
Pass@100 18.7 # 13
Code Generation HumanEval PaLM-cont 62B Pass@1 23.7 # 9
Code Generation HumanEval PaLM 62B Pass@1 15.9 # 15
Pass@100 46.3 # 9
Code Generation HumanEval PaLM 540B Pass@1 26.2 # 8
Pass@100 76.2 # 2
Language Modelling LAMBADA PaLM-540B (Few-Shot) Accuracy 89.7 # 1
Language Modelling LAMBADA PaLM-540B (One-Shot) Accuracy 81.8 # 5
Language Modelling LAMBADA PaLM-540B (Zero-Shot) Accuracy 77.9 # 10
Multi-task Language Understanding MGSM PaLM 540B Average (%) 55.0 # 4
Multi-task Language Understanding MMLU PaLM 540B (few-shot, k=5) Humanities 77.0 # 1
Average (%) 69.3 # 12
Parameters (Billions) 540 # 40
STEM 55.6 # 7
Social Sciences 81.0 # 1
Other 69.6 # 5
Tokens (Billions) 780 # 5
Question Answering MultiRC PaLM 540B (finetuned) F1 90.1 # 1
EM 69.2 # 1
Question Answering Natural Questions PaLM-540B (Zero-Shot) EM 21.2 # 29
Question Answering Natural Questions PaLM-540B (One-Shot) EM 29.3 # 24
Question Answering Natural Questions PaLM-540B (Few-Shot, k=64) EM 39.6 # 18
Question Answering OBQA PaLM 540B (zero-shot) Accuracy 53.4 # 7
Question Answering OBQA PaLM 62B (zero-shot) Accuracy 50.4 # 8
Reading Comprehension RACE PaLM 62B (zero-shot) Accuracy (High) 47.5 # 10
Accuracy (Middle) 64.3 # 9
Reading Comprehension RACE PaLM 540B (zero-shot) Accuracy (High) 49.1 # 8
Accuracy (Middle) 68.1 # 7
Reading Comprehension RACE PaLM 8B (zero-shot) Accuracy (High) 42.3 # 14
Accuracy (Middle) 57.9 # 14
Common Sense Reasoning ReCoRD PaLM 540B (finetuned) F1 94.6 # 1
EM 94.0 # 2
Natural Language Inference RTE PaLM 540B (finetuned) Accuracy 95.7% # 1
Question Answering TriviaQA PaLM-540B (Few-Shot) EM 81.4 # 1
Question Answering TriviaQA PaLM-540B (Zero-Shot) EM 76.9 # 4
Question Answering TriviaQA PaLM-540B (One-Shot) EM 81.4 # 1
Cross-Lingual Question Answering TyDiQA-GoldP PaLM-540B (CoT) EM 52.9 # 7
Question Answering WebQuestions PaLM-540B (Few-Shot) EM 43.5 # 5
Question Answering WebQuestions PaLM-540B (Zero-Shot) EM 10.6 # 14
Question Answering WebQuestions PaLM-540B (One-Shot) EM 22.6 # 11
Common Sense Reasoning WinoGrande PaLM 62B (zero-shot) Accuracy 77.0 # 4
Common Sense Reasoning WinoGrande PaLM-cont 62B (zero-shot) Accuracy 77.0 # 4
Common Sense Reasoning WinoGrande PaLM 540B (zero-shot) Accuracy 81.1 # 3
Word Sense Disambiguation Words in Context PaLM 540B (finetuned) Accuracy 78.8 # 2
Coreference Resolution WSC PaLM 540B Accuracy 89.5 # 1

Methods