no code implementations • LREC 2020 • Veronika Laippala, Samuel R{\"o}nnqvist, Saara Hellstr{\"o}m, Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri Skantsi, Sampo Pyysalo
However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents.