TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Challenge)	Pythia 12B (0-shot)	Accuracy	31.8	# 45
Common Sense Reasoning	ARC (Challenge)	Pythia 12B (5-shot)	Accuracy	36.8	# 43
Common Sense Reasoning	ARC (Easy)	Pythia 12B (5-shot)	Accuracy	71.5	# 24
Common Sense Reasoning	ARC (Easy)	Pythia 12B (0-shot)	Accuracy	70.2	# 29
Language Modelling	LAMBADA	Pythia 12B(Zero-Shot)	Perplexity	3.92	# 4
Language Modelling	LAMBADA	Pythia 6.9B (0-shot)	Accuracy	67.28	# 26
Language Modelling	LAMBADA	Pythia 12B (0-shot)	Accuracy	70.46	# 22
Language Modelling	LAMBADA	Pythia 6.9B(Zero-Shot)	Perplexity	4.45	# 8
Question Answering	PIQA	Pythia 12B (0-shot)	Accuracy	76	# 41
Question Answering	PIQA	Pythia 12B (5-shot)	Accuracy	76.7	# 39
Question Answering	PIQA	Pythia 6.9B (0-shot)	Accuracy	75.2	# 44
Question Answering	PIQA	Pythia 1B (5-shot)	Accuracy	70.4	# 52
Coreference Resolution	Winograd Schema Challenge	Pythia 12B (0-shot)	Accuracy	54.8	# 71
Coreference Resolution	Winograd Schema Challenge	Pythia 12B (5-shot)	Accuracy	36.5	# 80
Coreference Resolution	Winograd Schema Challenge	Pythia 6.9B (0-shot)	Accuracy	36.5	# 80
Coreference Resolution	Winograd Schema Challenge	Pythia 2.8B (0-shot)	Accuracy	38.5	# 79
Common Sense Reasoning	WinoGrande	Pythia 6.9B (0-shot)	Accuracy	60.9	# 45
Common Sense Reasoning	WinoGrande	Pythia 12B (0-shot)	Accuracy	63.9	# 43
Common Sense Reasoning	WinoGrande	Pythia 12B (5-shot)	Accuracy	66.6	# 39
Common Sense Reasoning	WinoGrande	Pythia 2.8B (0-shot)	Accuracy	59.4	# 48

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/language-modelling-on-lambada)](https://paperswithcode.com/sota/language-modelling-on-lambada?p=pythia-a-suite-for-analyzing-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=pythia-a-suite-for-analyzing-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=pythia-a-suite-for-analyzing-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=pythia-a-suite-for-analyzing-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=pythia-a-suite-for-analyzing-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pythia-a-suite-for-analyzing-large-language/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=pythia-a-suite-for-analyzing-large-language)`

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

3 Apr 2023 · Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal ·

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

PDF Abstract

Code

Add Remove Mark official

eleutherai/gpt-neox official

6,576

eleutherai/pythia official

2,040

jzhang38/tinyllama

↳ Quickstart in

Spaces

6,807

Lightning-AI/lit-gpt

6,628

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Language Modelling

Memorization

Question Answering

Datasets

TriviaQA

PIQA

WinoGrande

WSC

The Pile

LAMBADA

WinoBias CrowS-Pairs

ARC (AI2 Reasoning Challenge)

Results from the Paper

Edit

Ranked #4 on Language Modelling on LAMBADA (Perplexity metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Challenge)	Pythia 12B (0-shot)	Accuracy	31.8	# 45	Compare
Common Sense Reasoning	ARC (Challenge)	Pythia 12B (5-shot)	Accuracy	36.8	# 43	Compare
Common Sense Reasoning	ARC (Easy)	Pythia 12B (5-shot)	Accuracy	71.5	# 24	Compare
Common Sense Reasoning	ARC (Easy)	Pythia 12B (0-shot)	Accuracy	70.2	# 29	Compare
Language Modelling	LAMBADA	Pythia 12B(Zero-Shot)	Perplexity	3.92	# 4	Compare
Language Modelling	LAMBADA	Pythia 6.9B (0-shot)	Accuracy	67.28	# 26	Compare
Language Modelling	LAMBADA	Pythia 12B (0-shot)	Accuracy	70.46	# 22	Compare
Language Modelling	LAMBADA	Pythia 6.9B(Zero-Shot)	Perplexity	4.45	# 8	Compare
Question Answering	PIQA	Pythia 12B (0-shot)	Accuracy	76	# 41	Compare
Question Answering	PIQA	Pythia 12B (5-shot)	Accuracy	76.7	# 39	Compare
Question Answering	PIQA	Pythia 6.9B (0-shot)	Accuracy	75.2	# 44	Compare
Question Answering	PIQA	Pythia 1B (5-shot)	Accuracy	70.4	# 52	Compare
Coreference Resolution	Winograd Schema Challenge	Pythia 12B (0-shot)	Accuracy	54.8	# 71	Compare
Coreference Resolution	Winograd Schema Challenge	Pythia 12B (5-shot)	Accuracy	36.5	# 80	Compare
Coreference Resolution	Winograd Schema Challenge	Pythia 6.9B (0-shot)	Accuracy	36.5	# 80	Compare
Coreference Resolution	Winograd Schema Challenge	Pythia 2.8B (0-shot)	Accuracy	38.5	# 79	Compare
Common Sense Reasoning	WinoGrande	Pythia 6.9B (0-shot)	Accuracy	60.9	# 45	Compare
Common Sense Reasoning	WinoGrande	Pythia 12B (0-shot)	Accuracy	63.9	# 43	Compare
Common Sense Reasoning	WinoGrande	Pythia 12B (5-shot)	Accuracy	66.6	# 39	Compare
Common Sense Reasoning	WinoGrande	Pythia 2.8B (0-shot)	Accuracy	59.4	# 48	Compare

Methods

Add Remove

Pythia

Edit Social Preview

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove