WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.