CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

LREC 2020 Guillaume WenzekMarie-Anne LachauxAlexis ConneauVishrav ChaudharyFrancisco GuzmánArmand JoulinEdouard Grave

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.