OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
Source: RoBERTa: A Robustly Optimized BERT Pretraining ApproachPaper | Code | Results | Date | Stars |
---|