Search Results for author: Valtteri Skantsi

Found 6 papers, 2 papers with code

Towards better structured and less noisy Web data: Oscar with Register annotations

no code implementations COLING (WNUT) 2022 Veronika Laippala, Anna Salmela, Samuel Rönnqvist, Alham Fikri Aji, Li-Hsin Chang, Asma Dhifallah, Larissa Goulart, Henna Kortelainen, Marc Pàmies, Deise Prina Dutra, Valtteri Skantsi, Lintang Sutawika, Sampo Pyysalo

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process.

Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification

no code implementations NoDaLiDa 2021 Samuel Rönnqvist, Valtteri Skantsi, Miika Oinonen, Veronika Laippala

This article studies register classification of documents from the unrestricted web, such as news articles or opinion blogs, in a multilingual setting, exploring both the benefit of training on multiple languages and the capabilities for zero-shot cross-lingual transfer.

XLM-R Zero-Shot Cross-Lingual Transfer

Finnish Paraphrase Corpus

1 code implementation NoDaLiDa 2021 Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Jenna Saarni, Maija Sevón, Otto Tarkka

Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.

From Web Crawl to Clean Register-Annotated Corpora

no code implementations LREC 2020 Veronika Laippala, Samuel R{\"o}nnqvist, Saara Hellstr{\"o}m, Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri Skantsi, Sampo Pyysalo

However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents.

Cannot find the paper you are looking for? You can Submit a new open access paper.