Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

PDF Abstract

Results from the Paper


Ranked #4 on Language Modelling on LAMBADA (Perplexity metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) Pythia 12B (0-shot) Accuracy 31.8 # 45
Common Sense Reasoning ARC (Challenge) Pythia 12B (5-shot) Accuracy 36.8 # 43
Common Sense Reasoning ARC (Easy) Pythia 12B (5-shot) Accuracy 71.5 # 24
Common Sense Reasoning ARC (Easy) Pythia 12B (0-shot) Accuracy 70.2 # 29
Language Modelling LAMBADA Pythia 12B(Zero-Shot) Perplexity 3.92 # 4
Language Modelling LAMBADA Pythia 6.9B (0-shot) Accuracy 67.28 # 26
Language Modelling LAMBADA Pythia 12B (0-shot) Accuracy 70.46 # 22
Language Modelling LAMBADA Pythia 6.9B(Zero-Shot) Perplexity 4.45 # 8
Question Answering PIQA Pythia 12B (0-shot) Accuracy 76 # 41
Question Answering PIQA Pythia 12B (5-shot) Accuracy 76.7 # 39
Question Answering PIQA Pythia 6.9B (0-shot) Accuracy 75.2 # 44
Question Answering PIQA Pythia 1B (5-shot) Accuracy 70.4 # 52
Coreference Resolution Winograd Schema Challenge Pythia 12B (0-shot) Accuracy 54.8 # 71
Coreference Resolution Winograd Schema Challenge Pythia 12B (5-shot) Accuracy 36.5 # 80
Coreference Resolution Winograd Schema Challenge Pythia 6.9B (0-shot) Accuracy 36.5 # 80
Coreference Resolution Winograd Schema Challenge Pythia 2.8B (0-shot) Accuracy 38.5 # 79
Common Sense Reasoning WinoGrande Pythia 6.9B (0-shot) Accuracy 60.9 # 45
Common Sense Reasoning WinoGrande Pythia 12B (0-shot) Accuracy 63.9 # 43
Common Sense Reasoning WinoGrande Pythia 12B (5-shot) Accuracy 66.6 # 39
Common Sense Reasoning WinoGrande Pythia 2.8B (0-shot) Accuracy 59.4 # 48

Methods