no code implementations • EMNLP (BlackboxNLP) 2020 • Hande Celikkanat, Sami Virpioja, Jörg Tiedemann, Marianna Apidianaki
Contextualized word representations encode rich information about syntax and semantics, alongside specificities of each context of use.
no code implementations • NAACL (SIGMORPHON) 2022 • Aku Rouhe, Stig-Arne Grönroos, Sami Virpioja, Mathias Creutz, Mikko Kurimo
Our approach is to pre-segment the input data for a neural sequence-to-sequence model with the unsupervised method.
Ranked #1 on Morpheme Segmentaiton on UniMorph 4.0 (f1 macro avg (subtask 2) metric)
no code implementations • NAACL (AmericasNLP) 2021 • Raúl Vázquez, Yves Scherrer, Sami Virpioja, Jörg Tiedemann
The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs.
no code implementations • EURALI (LREC) 2022 • Juho Leinonen, Niko Partanen, Sami Virpioja, Mikko Kurimo
Cross-language forced alignment is a solution for linguists who create speech corpora for very low-resource languages.
1 code implementation • NoDaLiDa 2021 • Juho Leinonen, Sami Virpioja, Mikko Kurimo
Forced alignment is an effective process to speed up linguistic research.
no code implementations • NoDaLiDa 2021 • Mikko Aulamo, Sami Virpioja, Yves Scherrer, Jörg Tiedemann
Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT.
no code implementations • WMT (EMNLP) 2020 • Yves Scherrer, Stig-Arne Grönroos, Sami Virpioja
This paper describes the joint participation of University of Helsinki and Aalto University to two shared tasks of WMT 2020: the news translation between Inuktitut and English and the low-resource translation between German and Upper Sorbian.
1 code implementation • 10 Apr 2023 • Aarne Talman, Hande Celikkanat, Sami Virpioja, Markus Heinonen, Jörg Tiedemann
This paper introduces Bayesian uncertainty modeling using Stochastic Weight Averaging-Gaussian (SWAG) in Natural Language Understanding (NLU) tasks.
2 code implementations • 4 Dec 2022 • Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raul Vazquez, Sami Virpioja
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows.
1 code implementation • 19 Aug 2020 • Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo
Using this corpus, we also construct a retrieval-based evaluation task for Finnish chatbot development.
no code implementations • LREC 2020 • Mittul Singh, Peter Smit, Sami Virpioja, Mikko Kurimo
We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • ACL 2020 • Mikko Aulamo, Sami Virpioja, J{\"o}rg Tiedemann
We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data.
1 code implementation • 28 May 2020 • Mittul Singh, Sami Virpioja, Peter Smit, Mikko Kurimo
On these tasks, interpolating the baseline RNNLM approximation and a conventional LM outperforms the conventional LM in terms of the Maximum Term Weighted Value for single-character subwords.
no code implementations • LREC 2020 • Mikko Aulamo, Umut Sulubacak, Sami Virpioja, J{\"o}rg Tiedemann
We show the use of these tools in parallel corpus creation and data diagnostics.
1 code implementation • 8 Apr 2020 • Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo
There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; Subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary.
1 code implementation • LREC 2020 • Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo
Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • WS 2019 • Yves Scherrer, Ra{\'u}l V{\'a}zquez, Sami Virpioja
This paper describes the University of Helsinki Language Technology group{'}s participation in the WMT 2019 similar language translation task.
no code implementations • WS 2019 • Aarne Talman, Umut Sulubacak, Raúl Vázquez, Yves Scherrer, Sami Virpioja, Alessandro Raganato, Arvi Hurskainen, Jörg Tiedemann
In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English.
no code implementations • WS 2018 • Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo
This article describes the Aalto University entry to the WMT18 News Translation Shared Task.
no code implementations • 13 Jul 2017 • Seppo Enarvi, Peter Smit, Sami Virpioja, Mikko Kurimo
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1