1 code implementation • VarDial (COLING) 2022 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén
This article describes the language identification approach used by the SUKI team in the Identification of Languages and Dialects of Italy and the French Cross-Domain Dialect Identification shared tasks organized as part of the VarDial workshop 2022.
no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.
no code implementations • EACL (VarDial) 2021 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén
This article describes the experiments and systems developed by the SUKI team for the second edition of the Romanian Dialect Identification (RDI) shared task which was organized as part of the 2021 VarDial Evaluation Campaign.
no code implementations • LREC 2022 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén
This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method.
no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri
This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.
no code implementations • VarDial (COLING) 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén
This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task.
no code implementations • VarDial (COLING) 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén
In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects.
no code implementations • 24 Mar 2022 • Anssi Moisio, Dejan Porjazovski, Aku Rouhe, Yaroslav Getman, Anja Virkkunen, Tamás Grósz, Krister Lindén, Mikko Kurimo
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 4 Dec 2020 • Krister Lindén, Tommi Jauhiainen, Sam Hardwick
Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e. g. when indicating hate speech and fake news.
no code implementations • 27 Aug 2020 • Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén
This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected.
no code implementations • LREC 2020 • Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Marko Tadić, Dan Tufiş, Tamás Váradi, Kadri Vider, Andy Way, François Yvon
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality.
2 code implementations • 12 Aug 2019 • Teemu Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, Krister Lindén
We present a corpus of Finnish news articles with a manually prepared named entity annotation.
no code implementations • 26 Mar 2019 • Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen
This article describes an unsupervised language model adaptation approach that can be used to enhance the performance of language identification methods.
no code implementations • WS 2019 • Tommi Jauhiainen, Heidi Jauhiainen, Tero Alstola, Krister Lindén
This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus.
1 code implementation • 22 Apr 2018 • Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén
Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in.