WebText

Introduced by Radford et al. in Language Models are Unsupervised Multitask Learners

WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Similar Datasets

The Pile

OpenWebText

Usage

License

Private

Modalities

Texts

Languages

Kabyle

WebText

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit