OpenWebText

Introduced by Aaron Gokaslan et al. in OpenWebText corpus

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).

Source: RoBERTa: A Robustly Optimized BERT Pretraining Approach

Papers


Paper Code Results Date Stars

Tasks


Similar Datasets


License


Modalities


Languages