CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.

Source: Martin et al

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets


License


Modalities


Languages