Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract

Results from the Paper


 Ranked #1 on Language Modelling on The Pile (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Natural Language Inference ANLI test GPT-3 A1 36.8 # 5
A2 34 # 5
A3 40.2 # 5
Common Sense Reasoning ARC (Challenge) GPT-3 175B (Few-Shot) Accuracy 51.5 # 1
Common Sense Reasoning ARC (Easy) GPT-3 175B (Few-Shot) Accuracy 70.1 # 1
Question Answering BoolQ GPT-3 175B (Few-Shot) Accuracy 76.4 # 8
Natural Language Inference CommitmentBank GPT-3 175B (Few-Shot) F1 52 # 4
Accuracy 75.6 # 4
Question Answering COPA GPT-3 175B (Few-Shot) Accuracy 92 # 5
Zero-Shot Learning COPA GPT-3 Accuracy 73.0 # 4
Question Answering CoQA GPT-3 175B (Few-Shot) Overall 85 # 1
Question Answering DROP Test GPT-3 175B (Few-Shot) F1 36.5 # 9
Sentence Completion HellaSwag GPT-3 175B (Few-Shot) Accuracy 79.3 # 8
Zero-Shot Learning HellaSwag GPT-3 Accuracy 51.0 # 2
Language Modelling LAMBADA GPT-3 175B (Few-Shot) Accuracy 86.4 # 3
Perplexity 1.92 # 1
Multi-task Language Understanding MMLU GPT-3 175B (few-shot, k=5) Humanities 40.8 # 6
Average (%) 43.9 # 5
Parameters (Billions) 175 # 16
STEM 36.7 # 5
Social Sciences 50.4 # 5
Other 48.8 # 5
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU GPT-3 6.7B (fine-tuned) Humanities 42.1 # 5
Average (%) 43.2 # 6
Parameters (Billions) 6.7 # 8
STEM 35.1 # 6
Social Sciences 49.2 # 6
Other 46.9 # 6
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU GPT-3 6.7B (few-shot, k=5) Humanities 26.1 # 17
Average (%) 24.9 # 19
Parameters (Billions) 6.7 # 8
STEM 25.6 # 17
Social Sciences 21.6 # 19
Other 25.5 # 15
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU GPT-3 2.7B (few-shot, k=5) Humanities 24.4 # 19
Average (%) 25.9 # 16
Parameters (Billions) 2.7 # 6
STEM 26.0 # 14
Social Sciences 30.9 # 10
Other 24.1 # 18
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU GPT-3 (fine-tuned) Humanities 52.5 # 3
Average (%) 53.9 # 3
Parameters (Billions) 175 # 16
STEM 41.4 # 3
Social Sciences 63.9 # 3
Other 57.9 # 3
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU GPT-3 13B (few-shot, k=5) Humanities 27.1 # 15
Average (%) 26 # 15
Parameters (Billions) 13 # 12
STEM 24.3 # 19
Social Sciences 25.6 # 16
Other 26.5 # 14
Tokens (Billions) 300 # 2
Question Answering MultiRC GPT-3 175B (Few-Shot) F1a 75.4 # 5
Question Answering Natural Questions GPT-3 175B (Few-Shot) EM 29.9 # 3
Question Answering OpenBookQA GPT-3 175B (Few-Shot) Accuracy 65.4 # 3
Language Modelling Penn Treebank (Word Level) GPT-3 (Zero-Shot) Test perplexity 20.5 # 1
Params 175000M # 1
Question Answering PIQA GPT-3 175B (Few-Shot) Accuracy 82.8 # 1
Zero-Shot Learning PIQA GPT-3 Accuracy 72.9 # 2
Question Answering QuAC GPT-3 175B (Few-Shot) F1 44.3 # 2
Question Answering RACE GPT-3 175B (Few-Shot) RACE-m 58.1 # 6
RACE-h 46.8 # 5
Zero-Shot Learning ReCoRD GPT-3 Accuracy 82.1 # 1
Natural Language Inference RTE GPT-3 175B (Few-Shot) Accuracy 69% # 24
Zero-Shot Learning StoryCloze GPT-3 Accuracy 72.4 # 2
Question Answering Story Cloze Test GPT-3 175B (Few-Shot) Accuracy 87.7 # 2
Language Modelling The Pile GPT-3 (Zero-Shot) Bits per byte 0.7177 # 1
Question Answering TriviaQA GPT-3 175B (Few-Shot) EM 71.2 # 9
Question Answering WebQuestions GPT-3-175B (One-Shot) EM 25.3 # 7
Question Answering WebQuestions GPT-3-175B (Zero-Shot) EM 14.4 # 10
Question Answering WebQuestions GPT-3-175B (Few-Shot) EM 41.5 # 4
Coreference Resolution Winograd Schema Challenge GPT-3 175B (Few-Shot) Accuracy 80.1 # 3
Zero-Shot Learning Winogrande GPT-3 Accuracy 57.4 # 1
Unsupervised Machine Translation WMT2014 English-French GPT-3 175B (Few-Shot) BLEU 32.6 # 5
Unsupervised Machine Translation WMT2014 French-English GPT-3 175B (Few-Shot) BLEU 39.2 # 1
Unsupervised Machine Translation WMT2016 English-German GPT-3 175B (Few-Shot) BLEU 29.7 # 1
Unsupervised Machine Translation WMT2016 English-Romanian GPT-3 175B (Few-Shot) BLEU 21 # 1
Unsupervised Machine Translation WMT2016 German-English GPT-3 175B (Few-Shot) BLEU 40.6 # 1
Unsupervised Machine Translation WMT2016 Romanian-English GPT-3 175B (Few-Shot) BLEU 39.5 # 1
Word Sense Disambiguation Words in Context GPT-3 175B (Few-Shot) Accuracy 49.4 # 6
Coreference Resolution WSC GPT-3 175B (Few-Shot) Accuracy 80.1 # 2

Methods