Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

PDF Abstract NA 2021 PDF

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Language Modelling arXiv Gopher BPB 0.662 # 1
Identify Odd Metapor BIG-bench Gopher Accuracy 38.6 # 1
Odd One Out BIG-bench Gopher Accuracy 32.5 # 1
Anachronisms BIG-bench Gopher Accuracy 56.4 # 1
Causal Judgment BIG-bench Gopher Accuracy 50.8 # 1
Crash Blossom BIG-bench Gopher Accuracy 63.6 # 1
Analogical Similarity BIG-bench Gopher Accuracy 17.2 # 1
Language Modelling Bookcorpus2 Gopher BPB 0.741 # 1
Language Modelling Books3 Gopher BPB 0.712 # 1
Language Modelling Curation Corpus Gopher BPB 0.475 # 1
Language Modelling DM Mathematics Gopher BPB 1.14 # 1
Language Modelling FreeLaw Gopher BPB 0.513 # 1
Language Modelling GitHub Gopher BPB 0.377 # 1
Language Modelling Gutenberg PG-19 Gopher BPB 0.656 # 1
Language Modelling HackerNews Gopher BPB 0.890 # 1
Multi-task Language Understanding MMLU Gopher-1.4B (few-shot, k=5) Humanities 27.5 # 13
Average (%) 27.3 # 12
Parameters (Billions) 1.4 # 4
STEM 26.6 # 13
Social Sciences 30.0 # 12
Other 24.7 # 17
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU Gopher (few-shot, k=5) Humanities 65.8 # 2
Average (%) 60.0 # 2
Parameters (Billions) 280 # 18
STEM 48.0 # 2
Social Sciences 71.2 # 2
Other 64.0 # 2
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU Gopher-0.4B (few-shot, k=5) Humanities 26.6 # 16
Average (%) 25.7 # 17
Parameters (Billions) 0.4 # 3
STEM 26.0 # 14
Social Sciences 23.4 # 18
Other 24.1 # 18
Tokens (Billions) 300 # 2
Multi-task Language Understanding MMLU Gopher-7.1B (few-shot, k=5) Humanities 28.0 # 10
Average (%) 29.5 # 9
Parameters (Billions) 7.1 # 10
STEM 30.1 # 9
Social Sciences 31.0 # 9
Other 31.0 # 9
Tokens (Billions) 300 # 2
Language Modelling NIH ExPorter Gopher BPB 0.590 # 1
Language Modelling OpenSubtitles Gopher BPB 0.899 # 1
Language Modelling OpenWebtext2 Gopher BPB 0.677 # 1
Language Modelling PhilPapers Gopher BPB 0.695 # 1
Language Modelling Pile CC Gopher BPB 0.691 # 1
Language Modelling PubMed Abstracts Gopher BPB 0.577 # 1
Language Modelling PubMed Central Gopher BPB 0.525 # 1
Language Modelling StackExchange Gopher BPB 0.641 # 1
Language Modelling Ubuntu IRC Gopher BPB 1.09 # 1
Language Modelling USPTO Backgrounds Gopher BPB 0.546 # 1
Language Modelling WikiText-103 Gopher BPB 0.566 # 1

Methods


No methods listed for this paper. Add relevant methods here