TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	The Pile	GPT-3 Davinci 175B (pre-trained)	Bits per byte	0.7177	# 3
Language Modelling	The Pile	GPT-2 Small 124M (pre-trained)	Bits per byte	1.2253	# 12
Language Modelling	The Pile	GPT-3 Curie 6.7B (pre-trained)	Bits per byte	0.7980	# 5
Language Modelling	The Pile	GPT-3 Babbage 1.3B (pre-trained)	Bits per byte	0.8718	# 7
Language Modelling	The Pile	GPT-3 Ada 350M (pre-trained)	Bits per byte	0.9631	# 8
Language Modelling	The Pile	GPT-2 XL 1.5B (pre-trained)	Bits per byte	1.0468	# 9
Language Modelling	The Pile	GPT-2 Large 774M (pre-trained)	Bits per byte	1.0828	# 10
Language Modelling	The Pile	GPT-2 Medium 355M (pre-trained)	Bits per byte	1.0928	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-pile-an-800gb-dataset-of-diverse-text-for/language-modelling-on-the-pile)](https://paperswithcode.com/sota/language-modelling-on-the-pile?p=the-pile-an-800gb-dataset-of-diverse-text-for)`

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

31 Dec 2020 · Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy ·

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

PDF Abstract

Code

Add Remove Mark official

EleutherAI/The-Pile official

1,400

EleutherAI/gpt-neo

↳ Quickstart in

Colab

8,138

EleutherAI/GPTNeo

↳ Quickstart in

Colab

8,138

ncoop57/gpt-code-clippy

3,290

codedotal/gpt-code-clippy

3,290

See all 20 implementations

Tasks

Add Remove

Language Modelling

Datasets

Introduced in the Paper:

The Pile

Used in the Paper:

C4 test

WebText OpenWebText PG-19

Results from the Paper

Add Remove

Ranked #3 on Language Modelling on The Pile

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	The Pile	GPT-3 Davinci 175B (pre-trained)	Bits per byte	0.7177	# 3	Compare
Language Modelling	The Pile	GPT-2 Small 124M (pre-trained)	Bits per byte	1.2253	# 12	Compare
Language Modelling	The Pile	GPT-3 Curie 6.7B (pre-trained)	Bits per byte	0.7980	# 5	Compare
Language Modelling	The Pile	GPT-3 Babbage 1.3B (pre-trained)	Bits per byte	0.8718	# 7	Compare
Language Modelling	The Pile	GPT-3 Ada 350M (pre-trained)	Bits per byte	0.9631	# 8	Compare
Language Modelling	The Pile	GPT-2 XL 1.5B (pre-trained)	Bits per byte	1.0468	# 9	Compare
Language Modelling	The Pile	GPT-2 Large 774M (pre-trained)	Bits per byte	1.0828	# 10	Compare
Language Modelling	The Pile	GPT-2 Medium 355M (pre-trained)	Bits per byte	1.0928	# 11	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • Fixed Factorized Attention • GELU • GPT-2 • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove