Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

PDF Abstract NA 2021 PDF

Tasks


Abstract Algebra Anachronisms Analogical Similarity Analytic Entailment Anatomy Astronomy BIG-bench Machine Learning Business Ethics Causal Judgment Checkmate In One Clinical Knowledge Code Line Descriptions College Biology College Chemistry College Computer Science College Mathematics College Medicine College Physics Common Sense Reasoning Computer Security Conceptual Physics Crash Blossom Crass AI Dark Humor Detection Date Understanding Disambiguation QA Discourse Marker Prediction Econometrics Electrical Engineering Elementary Mathematics Emotional Intelligence Empirical Judgments English Proverbs Entailed Polarity Epistemic Reasoning Ethics Evaluating Information Essentiality Fact Checking Fantasy Reasoning FEVER (2-way) FEVER (3-way) Figure Of Speech Detection Formal Fallacies Syllogisms Negation Formal Logic General Knowledge Global Facts GRE Reading Comprehension High School Biology High School Chemistry High School Computer Science High School European History High School Geography High School Government and Politics High School Macroeconomics High School Mathematics High School Microeconomics High School Physics High School Psychology High School Statistics High School US History High School World History Hindu Knowledge Human Aging Human Organs Senses Multiple Choice Human Sexuality Hyperbaton Identify Odd Metapor Implicatures Implicit Relations Intelligent Communication Intent Recognition International Law Irony Identification Jurisprudence Known Unknowns LAMBADA Language Modelling Logical Args Logical Fallacies Logical Fallacy Detection Logical Reasoning Logical Sequence Logic Grid Puzzle Management Marketing Mathematical Induction Mathematical Reasoning Medical Genetics Memorization Metaphor Boolean Miscellaneous Misconceptions Moral Disputes Moral Permissibility Moral Scenarios Movie Dialog Same Or Different Movie Genre Recommendation System Movie Recommendation Multiple Choice Question Answering (MCQA) Multi-task Language Understanding Natural Questions Navigate Nonsense Words Grammar Novel Concepts Nutrition Odd One Out Penguins In A Table Philosophy Phrase Relatedness Physical Intuition Physics MC Prehistory Presuppositions As NLI Professional Accounting Professional Law Professional Medicine Professional Psychology Public Relations Question Answering Question Selection RACE-h RACE-m Reading Comprehension Reasoning About Colored Objects Riddle Sense Ruin Names Sarcasm Detection Security Studies Sentence Ambiguity Sentence Completion Similarities Abstraction SNARKS Sociology Sports Understanding StrategyQA Temporal Sequences Timedial TriviaQA Understanding Fables US Foreign Policy Virology Winowhy Word Sense Disambiguation World Religions

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Language Modelling Arxiv HEP-TH citation graph Gopher BPB 0.662 # 1
Analogical Similarity BIG-bench Gopher-280B (few-shot, k=5) Accuracy 17.2 # 2
International Law BIG-bench Gopher-280B (few-shot, k=5) Accuracy 77.7 # 1
High School World History BIG-bench Gopher-280B (few-shot, k=5) Accuracy 75.1 # 1
High School US History BIG-bench Gopher-280B (few-shot, k=5) Accuracy 78.9 # 1
High School European History BIG-bench Gopher-280B (few-shot, k=5) Accuracy 72.1 # 1
TriviaQA BIG-bench Gopher-280B (few-shot, k=64) Accuracy 57.1 # 1
Similarities Abstraction BIG-bench Gopher-280B (few-shot, k=5) Accuracy 81.8 # 2
Natural Questions BIG-bench Gopher-280B (few-shot, k=64) Accuracy 28.2 # 1
Miscellaneous BIG-bench Gopher-280B (few-shot, k=5) Accuracy 75.7 # 1
Global Facts BIG-bench Gopher-280B (few-shot, k=5) Accuracy 38.0 # 1
General Knowledge BIG-bench Gopher-280B (few-shot, k=5) Accuracy 93.9 # 2
Sentence Ambiguity BIG-bench Gopher-280B (few-shot, k=5) Accuracy 69.1 # 2
Misconceptions BIG-bench Gopher-280B (few-shot, k=5) Accuracy 61.7 # 2
FEVER (3-way) BIG-bench Gopher-280B (few-shot, k=15) Accuracy 77.5 # 1
FEVER (2-way) BIG-bench Gopher-280B (few-shot, k=10) Accuracy 77.5 # 1
Moral Scenarios BIG-bench Gopher-280B (few-shot, k=5) Accuracy 40.2 # 1
Moral Permissibility BIG-bench Gopher-280B (few-shot, k=5) Accuracy 55.1 # 2
Moral Disputes BIG-bench Gopher-280B (few-shot, k=5) Accuracy 66.8 # 1
Dark Humor Detection BIG-bench Gopher-280B (few-shot, k=5) Accuracy 83.1 # 1
Understanding Fables BIG-bench Gopher-280B (few-shot, k=5) Accuracy 39.6 # 2
Timedial BIG-bench Gopher-280B (few-shot, k=5) Accuracy 50.9 # 2
Riddle Sense BIG-bench Gopher-280B (few-shot, k=5) Accuracy 68.2 # 2
Irony Identification BIG-bench Gopher-280B (few-shot, k=5) Accuracy 69.7 # 2
Empirical Judgments BIG-bench Gopher-280B (few-shot, k=5) Accuracy 52.5 # 2
Discourse Marker Prediction BIG-bench Gopher-280B (few-shot, k=5) Accuracy 11.7 # 2
Crass AI BIG-bench Gopher-280B (few-shot, k=5) Accuracy 56.8 # 4
US Foreign Policy BIG-bench Gopher-280B (few-shot, k=5) Accuracy 81.0 # 1
Sociology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 84.1 # 1
Security Studies BIG-bench Gopher-280B (few-shot, k=5) Accuracy 64.9 # 1
Public Relations BIG-bench Gopher-280B (few-shot, k=5) Accuracy 71.8 # 1
Professional Psychology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 68.1 # 1
Human Sexuality BIG-bench Gopher-280B (few-shot, k=5) Accuracy 67.2 # 1
High School Psychology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 81.8 # 1
High School Microeconomics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 66.4 # 1
High School Macroeconomics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 65.1 # 1
High School Government and Politics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 83.9 # 1
High School Geography BIG-bench Gopher-280B (few-shot, k=5) Accuracy 76.8 # 1
Physics MC BIG-bench Gopher-280B (few-shot, k=5) Accuracy 50.9 # 2
High School Computer Science BIG-bench Gopher-280B (few-shot, k=5) Accuracy 54.0 # 1
High School Chemistry BIG-bench Gopher-280B (few-shot, k=5) Accuracy 47.8 # 1
Conceptual Physics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 49.4 # 1
Computer Security BIG-bench Gopher-280B (few-shot, k=5) Accuracy 65.0 # 1
College Chemistry BIG-bench Gopher-280B (few-shot, k=5) Accuracy 45.0 # 1
RACE-m BIG-bench Gopher-280B (few-shot, k=5) Accuracy 75.1 # 1
RACE-h BIG-bench Gopher-280B (few-shot, k=5) Accuracy 71.6 # 1
Question Selection BIG-bench Gopher-280B (few-shot, k=5) Accuracy 41.4 # 2
Phrase Relatedness BIG-bench Gopher-280B (few-shot, k=5) Accuracy 81.8 # 2
Nonsense Words Grammar BIG-bench Gopher-280B (few-shot, k=5) Accuracy 61.4 # 2
Movie Dialog Same Or Different BIG-bench Gopher-280B (few-shot, k=5) Accuracy 50.7 # 2
LAMBADA BIG-bench Gopher-280B (zero-shot) Accuracy 74.5 # 2
Intent Recognition BIG-bench Gopher-280B (few-shot, k=5) Accuracy 88.7 # 2
Implicit Relations BIG-bench Gopher-280B (few-shot, k=5) Accuracy 36.4 # 2
Implicatures BIG-bench Gopher-280B (few-shot, k=5) Accuracy 62.0 # 2
GRE Reading Comprehension BIG-bench Gopher-280B (few-shot, k=5) Accuracy 27.3 # 2
Crash Blossom BIG-bench Gopher-280B (few-shot, k=5) Accuracy 63.6 # 1
Odd One Out BIG-bench Gopher-280B (few-shot, k=5) Accuracy 32.5 # 2
Identify Odd Metapor BIG-bench Gopher-280B (few-shot, k=5) Accuracy 38.6 # 2
Business Ethics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 70.0 # 1
BIG-bench Machine Learning BIG-bench Gopher-280B (few-shot, k=5) Accuracy 41.1 # 1
College Mathematics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 37.0 # 1
High School Statistics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 50 # 1
High School Physics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 33.8 # 1
College Physics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 34.3 # 1
High School Mathematics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 23.7 # 1
Electrical Engineering BIG-bench Gopher-280B (few-shot, k=5) Accuracy 60 # 1
Econometrics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 43 # 1
College Computer Science BIG-bench Gopher-280B (few-shot, k=5) Accuracy 49 # 1
High School Biology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 71.3 # 1
College Biology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 70.8 # 1
Astronomy BIG-bench Gopher-280B (few-shot, k=5) Accuracy 65.8 # 1
Elementary Mathematics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 33.6 # 1
Figure Of Speech Detection BIG-bench Gopher-280B (few-shot, k=5) Accuracy 52.7 # 2
Fantasy Reasoning BIG-bench Gopher-280B (few-shot, k=5) Accuracy 64.1 # 2
English Proverbs BIG-bench Gopher-280B (few-shot, k=5) Accuracy 57.6 # 2
Virology BIG-bench Gopher-280B (few-shot, k=5) Accuracy 47.0 # 1
Professional Medicine BIG-bench Gopher-280B (few-shot, k=5) Accuracy 64.0 # 1
Nutrition BIG-bench Gopher-280B (few-shot, k=5) Accuracy 69.9 # 1
Medical Genetics BIG-bench Gopher-280B (few-shot, k=5) Accuracy 69.0 # 1
Human Organs Senses Multiple Choice BIG-bench Gopher-280B (few-shot, k=5) Accuracy 84.8 # 2
Human Aging BIG-bench Gopher-280B (few-shot, k=5) Accuracy 66.4 # 1
College Medicine BIG-bench Gopher-280B (few-shot, k=5) Accuracy 60.1 # 1
Clinical Knowledge BIG-bench Gopher-280B (few-shot, k=5) Accuracy 67.2 # 1
Anatomy BIG-bench Gopher-280B (few-shot, k=5) Accuracy 56.3 # 1
Professional Accounting BIG-bench Gopher-280B (few-shot, k=5) Accuracy 44.3 # 1
Mathematical Induction BIG-bench Gopher-280B (few-shot, k=5) Accuracy 57.6 # 1
Formal Logic BIG-bench Gopher-280B (few-shot, k=5) Accuracy 35.7 # 1
Abstract Algebra BIG-bench Gopher-280B (few-shot, k=5) Accuracy 25.0 # 1
Presuppositions As NLI BIG-bench Gopher-280B (few-shot, k=5) Accuracy 34.0 # 2
Physical Intuition BIG-bench Gopher-280B (few-shot, k=5) Accuracy 59.7 # 2
Metaphor Boolean BIG-bench Gopher-280B (few-shot, k=5) Accuracy 59.3 # 2
Logical Args BIG-bench Gopher-280B (few-shot, k=5) Accuracy 59.1 # 1
Evaluating Information Essentiality BIG-bench Gopher-280B (few-shot, k=5) Accuracy 16.7 # 2
Epistemic Reasoning BIG-bench Gopher-280B (few-shot, k=5) Accuracy 56.4 # 2
Entailed Polarity BIG-bench Gopher-280B (few-shot, k=5) Accuracy 89.5 # 2
Analytic Entailment BIG-bench Gopher-280B (few-shot, k=5) Accuracy 53.0 # 2
World Religions BIG-bench Gopher-280B (few-shot, k=5) Accuracy 84.2 # 1
Professional Law BIG-bench Gopher-280B (few-shot, k=5) Accuracy 44.5 # 1
Prehistory BIG-bench Gopher-280B (few-shot, k=5) Accuracy 67.6 # 1
Philosophy BIG-bench Gopher-280B (few-shot, k=5) Accuracy 68.8 # 1
Marketing BIG-bench Gopher-280B (few-shot, k=5) Accuracy 83.3 # 1
Management BIG-bench Gopher-280B (few-shot, k=5) Accuracy 77.7 # 1
Logical Fallacies BIG-bench Gopher-280B (few-shot, k=5) Accuracy 72.4 # 1
Jurisprudence BIG-bench Gopher-280B (few-shot, k=5) Accuracy 71.3 # 1
Word Sense Disambiguation BIG-bench (Anachronisms) Gopher-280B (few-shot, k=5) Accuracy 56.4 # 2
Common Sense Reasoning BIG-bench (Causal Judgment) Gopher-280B (few-shot, k=5) Accuracy 50.8 # 8
Common Sense Reasoning BIG-bench (Date Understanding) Gopher-280B (few-shot, k=5) Accuracy 44.1 # 9
Common Sense Reasoning BIG-bench (Disambiguation QA) Gopher-280B (few-shot, k=5) Accuracy 45.5 # 5
Logical Reasoning BIG-bench (Formal Fallacies Syllogisms Negation) Gopher-280B (few-shot, k=5) Accuracy 50.7 # 9
Memorization BIG-bench (Hindu Knowledge) Gopher-280B (few-shot, k=5) Accuracy 80 # 2
Multiple Choice Question Answering (MCQA) BIG-bench (Hyperbaton) Gopher-280B (few-shot, k=5) Accuracy 51.7 # 9
Common Sense Reasoning BIG-bench (Known Unknowns) Gopher-280B (few-shot, k=5) Accuracy 63.6 # 3
Logical Reasoning BIG-bench (Logical Fallacy Detection) Gopher-280B (few-shot, k=5) Accuracy 58.9 # 2
Common Sense Reasoning BIG-bench (Logical Sequence) Gopher-280B (few-shot, k=5) Accuracy 36.4 # 2
Logical Reasoning BIG-bench (Logic Grid Puzzle) Gopher-280B (few-shot, k=5) Accuracy 35.1 # 4
Multiple Choice Question Answering (MCQA) BIG-bench (Movie Recommendation) Gopher-280B (few-shot, k=5) Accuracy 50.5 # 9
Multiple Choice Question Answering (MCQA) BIG-bench (Navigate) Gopher-280B (few-shot, k=5) Accuracy 51.1 # 5
Multiple Choice Question Answering (MCQA) BIG-bench (Novel Concepts) Gopher-280B (few-shot, k=5) Accuracy 59.1 # 4
Logical Reasoning BIG-bench (Penguins In A Table) Gopher-280B (few-shot, k=5) Accuracy 40.6 # 5
Logical Reasoning BIG-bench (Reasoning About Colored Objects) Gopher-280B (few-shot, k=5) Accuracy 49.2 # 4
Multiple Choice Question Answering (MCQA) BIG-bench (Ruin Names) Gopher-280B (few-shot, k=5) Accuracy 38.6 # 9
Sarcasm Detection BIG-bench (SNARKS) Gopher-280B (few-shot, k=5) Accuracy 48.3 # 8
Common Sense Reasoning BIG-bench (Sports Understanding) Gopher-280B (few-shot, k=5) Accuracy 54.9 # 6
Logical Reasoning BIG-bench (StrategyQA) Gopher-280B (few-shot, k=5) Accuracy 61.0 # 4
Logical Reasoning BIG-bench (Temporal Sequences) Gopher-280B (few-shot, k=5) Accuracy 19.0 # 9
Common Sense Reasoning BIG-bench (Winowhy) Gopher-280B (few-shot, k=5) Accuracy 56.7 # 4
Language Modelling Bookcorpus2 Gopher BPB 0.741 # 1
Language Modelling Books3 Gopher BPB 0.712 # 1
Question Answering BoolQ Gopher (zero-shot) Accuracy 79.3 # 27
Language Modelling Curation Corpus Gopher BPB 0.475 # 1
Language Modelling DM Mathematics Gopher BPB 1.14 # 1
Language Modelling FreeLaw Gopher BPB 0.513 # 1
Language Modelling GitHub Gopher BPB 0.377 # 1
Language Modelling Gutenberg PG-19 Gopher BPB 0.656 # 1
Language Modelling HackerNews Gopher BPB 0.890 # 1
Sentence Completion HellaSwag Gopher 280B (0-shot) Accuracy 79.2 # 42
Multi-task Language Understanding MMLU Gopher 0.4B (5-shot) Average (%) 25.7 # 102
Multi-task Language Understanding MMLU Gopher 1.4B (5-shot) Average (%) 27.3 # 95
Multi-task Language Understanding MMLU Gopher 280B (5-shot) Average (%) 60.0 # 50
Multi-task Language Understanding MMLU Gopher 7.1B (5-shot) Average (%) 29.5 # 90
Question Answering Natural Questions Gopher (few-shot, k=64) EM 28.2 # 30
Language Modelling NIH ExPorter Gopher BPB 0.590 # 1
Language Modelling OpenSubtitles Gopher BPB 0.899 # 1
Language Modelling OpenWebtext2 Gopher BPB 0.677 # 1
Language Modelling PhilPapers Gopher BPB 0.695 # 1
Language Modelling Pile CC Gopher BPB 0.691 # 1
Question Answering PIQA Gopher 280B (0-shot) Accuracy 81.8 # 20
Language Modelling PubMed Central Gopher BPB 0.525 # 1
Language Modelling PubMed Cognitive Control Abstracts Gopher BPB 0.577 # 1
Question Answering SIQA Gopher (zero-shot) Accuracy 50.6 # 16
Language Modelling StackExchange Gopher BPB 0.641 # 1
Question Answering TruthfulQA Gopher 280B (zero-shot, Our Prompt + Choices) MC1 0.295 # 8
Question Answering TruthfulQA Gopher 1.4 (zero-shot, QA prompts) MC1 0.23 # 13
Question Answering TruthfulQA Gopher 7.1 (zero-shot, QA prompts) MC1 0.25 # 11
Question Answering TruthfulQA Gopher 280B (zero-shot, QA prompts) MC1 0. 27 # 24
Question Answering TruthfulQA Gopher 1.4B (zero-shot, Our Prompt + Choices) MC1 0.217 # 16
Question Answering TruthfulQA Gopher 7.1B (zero-shot, Our Prompt + Choices) MC1 0.23 # 13
Language Modelling Ubuntu IRC Gopher BPB 1.09 # 1
Language Modelling USPTO Backgrounds Gopher BPB 0.546 # 1
Common Sense Reasoning WinoGrande Gopher 280B (0-shot) Accuracy 70.1 # 37

Methods


No methods listed for this paper. Add relevant methods here