no code implementations • SemEval (NAACL) 2022 • Sami Itkonen, Jörg Tiedemann, Mathias Creutz
This paper describes the University of Helsinki submission to the SemEval 2022 task on multilingual idiomaticity detection.
1 code implementation • WMT (EMNLP) 2020 • Yves Scherrer, Alessandro Raganato, Jörg Tiedemann
This paper reports on our participation with the MUCOW test suite at the WMT 2020 news translation task.
no code implementations • VarDial (COLING) 2020 • Janine Siewert, Yves Scherrer, Martijn Wieling, Jörg Tiedemann
We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper.
no code implementations • EACL (BSNLP) 2021 • Anna Dmitrieva, Jörg Tiedemann
Parallel language corpora where regular texts are aligned with their simplified versions can be used in both natural language processing and theoretical linguistic studies.
no code implementations • WS (NoDaLiDa) 2019 • Mikko Aulamo, Jörg Tiedemann
This paper presents a flexible and powerful system for creating parallel corpora and for running neural machine translation services.
no code implementations • COLING 2022 • Raúl Vázquez, Hande Celikkanat, Vinit Ravishankar, Mathias Creutz, Jörg Tiedemann
We analyze the learning dynamics of neural language and translation models using Loss Change Allocation (LCA), an indicator that enables a fine-grained analysis of parameter updates when optimizing for the loss function.
no code implementations • EAMT 2020 • Jörg Tiedemann, Santhosh Thottingal
This paper presents OPUS-MT a project that focuses on the development of free resources and tools for machine translation.
no code implementations • LREC 2022 • Teemu Vahtola, Eetu Sjöblom, Jörg Tiedemann, Mathias Creutz
Noisy labels in training data present a challenging issue in classification tasks, misleading a model towards incorrect decisions during training.
no code implementations • EAMT 2022 • Raúl Vázquez, Michele Boggia, Alessandro Raganato, Niki A. Loppi, Stig-Arne Grönroos, Jörg Tiedemann
We describe the enhancement of a multilingual NMT toolkit developed as part of the FoTran project.
no code implementations • NAACL (AmericasNLP) 2021 • Raúl Vázquez, Yves Scherrer, Sami Virpioja, Jörg Tiedemann
The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs.
no code implementations • WMT (EMNLP) 2020 • Jörg Tiedemann
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection.
no code implementations • EMNLP 2021 • Alessandro Raganato, Raúl Vázquez, Mathias Creutz, Jörg Tiedemann
In this paper, we investigate the benefits of an explicit alignment to language labels in Transformer-based MNMT models in the zero-shot context, by jointly training one cross attention head with word alignment supervision to stress the focus on the target language label.
1 code implementation • EMNLP (BlackboxNLP) 2020 • Hande Celikkanat, Sami Virpioja, Jörg Tiedemann, Marianna Apidianaki
Contextualized word representations encode rich information about syntax and semantics, alongside specificities of each context of use.
no code implementations • NoDaLiDa 2021 • Mikko Aulamo, Sami Virpioja, Yves Scherrer, Jörg Tiedemann
Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT.
no code implementations • EAMT 2020 • Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen, Jörg Tiedemann
This paper presents a user evaluation of machine translation and post-editing for TV subtitles.
1 code implementation • 26 Sep 2024 • Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow
In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages.
1 code implementation • 22 Jul 2024 • Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, Jörg Tiedemann
We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios.
no code implementations • 25 Mar 2024 • Shaoxiong Ji, Timothee Mickus, Vincent Segonne, Jörg Tiedemann
We furthermore provide evidence through similarity measures and investigation of parameters that this lack of positive influence is due to output separability -- which we argue is of use for machine translation but detrimental elsewhere.
no code implementations • 20 Mar 2024 • Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer Van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive.
no code implementations • 12 Mar 2024 • Timothee Mickus, Elaine Zosa, Raúl Vázquez, Teemu Vahtola, Jörg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki
This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate.
1 code implementation • 12 Mar 2024 • Timothee Mickus, Stig-Arne Grönroos, Joseph Attieh, Michele Boggia, Ona de Gibert, Shaoxiong Ji, Niki Andreas Lopi, Alessandro Raganato, Raúl Vázquez, Jörg Tiedemann
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled.
no code implementations • 24 Jan 2024 • Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André F. T. Martins, Hinrich Schütze
Large language models (LLMs) have advanced the state of the art in natural language processing.
no code implementations • 20 Apr 2023 • Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ananiadou, Erik Cambria, Jörg Tiedemann
In the mental health domain, domain-specific language models are pretrained and released, which facilitates the early detection of mental health conditions.
1 code implementation • 10 Apr 2023 • Aarne Talman, Hande Celikkanat, Sami Virpioja, Markus Heinonen, Jörg Tiedemann
This paper introduces Bayesian uncertainty modeling using Stochastic Weight Averaging-Gaussian (SWAG) in Natural Language Understanding (NLU) tasks.
2 code implementations • 4 Dec 2022 • Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raul Vazquez, Sami Virpioja
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows.
no code implementations • COLING 2022 • Khalid Alnajjar, Mika Hämäläinen, Jörg Tiedemann, Jorma Laaksonen, Mikko Kurimo
Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience's laughter reaction should last with a mean absolute error of 600 milliseconds.
1 code implementation • *SEM (NAACL) 2022 • Aarne Talman, Marianna Apidianaki, Stergios Chatzikyriakidis, Jörg Tiedemann
A central question in natural language understanding (NLU) research is whether high performance demonstrates the models' strong reasoning capabilities.
no code implementations • RANLP 2013 • Jörg Tiedemann, Preslav Nakov
This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles.
1 code implementation • NoDaLiDa 2021 • Aarne Talman, Marianna Apidianaki, Stergios Chatzikyriakidis, Jörg Tiedemann
We propose a new diagnostics test suite which allows to assess whether a dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
1 code implementation • COLING 2020 • Emily Öhman, Marc Pàmies, Kaisla Kajava, Jörg Tiedemann
We introduce XED, a multilingual fine-grained emotion dataset.
1 code implementation • 13 Oct 2020 • Jörg Tiedemann
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection.
no code implementations • SEMEVAL 2020 • Marc Pàmies, Emily Öhman, Kaisla Kajava, Jörg Tiedemann
This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Alessandro Raganato, Yves Scherrer, Jörg Tiedemann
Transformer-based models have brought a radical change to neural machine translation.
no code implementations • 28 Nov 2019 • Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann
Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.
Ranked #4 on
Multimodal Machine Translation
on Multi30K
no code implementations • WS 2016 • Liane Guillou, Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, Mauro Cettolo, Bonnie Webber, Andrei Popescu-Belis
We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction.
1 code implementation • WS (NoDaLiDa) 2019 • Aarne Talman, Antti Suni, Hande Celikkanat, Sofoklis Kakouros, Jörg Tiedemann, Martti Vainio
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text.
Ranked #1 on
Prosody Prediction
on Helsinki Prosody Corpus
no code implementations • WS 2019 • Aarne Talman, Umut Sulubacak, Raúl Vázquez, Yves Scherrer, Sami Virpioja, Alessandro Raganato, Arvi Hurskainen, Jörg Tiedemann
In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English.
no code implementations • CL 2019 • Johannes Bjerva, Robert Östling, Maria Han Veiga, Jörg Tiedemann, Isabelle Augenstein
If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations.
1 code implementation • WS 2019 • Raúl Vázquez, Alessandro Raganato, Jörg Tiedemann, Mathias Creutz
In this paper, we propose a multilingual encoder-decoder architecture capable of obtaining multilingual sentence representations by means of incorporating an intermediate {\em attention bridge} that is shared across all languages.
no code implementations • IWSLT (EMNLP) 2018 • Umut Sulubacak, Jörg Tiedemann, Aku Rouhe, Stig-Arne Grönroos, Mikko Kurimo
In this paper, we also describe the experiments leading up to our final systems.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • WS 2018 • Stig-Arne Grönroos, Benoit Huet, Mikko Kurimo, Jorma Laaksonen, Bernard Merialdo, Phu Pham, Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann, Raphael Troncy, Raúl Vázquez
Our experiments show that the effect of the visual features in our system is small.
1 code implementation • 27 Aug 2018 • Aarne Talman, Anssi Yli-Jyrä, Jörg Tiedemann
We can show that the sentence embeddings learned in this way can be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7 out of 10 and SkipThought on 8 out of 9 SentEval sentence embedding evaluation tasks.
Ranked #5 on
Natural Language Inference
on SciTail
no code implementations • WS 2019 • Jörg Tiedemann, Yves Scherrer
In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones.
no code implementations • 1 Feb 2018 • Jörg Tiedemann
Translations capture important information about languages that can be used as implicit supervision in learning linguistic properties and semantic representations.
1 code implementation • WS 2017 • Robert Östling, Yves Scherrer, Jörg Tiedemann, Gongbo Tang, Tommi Nieminen
We also discuss our submissions for English--Latvian, English--Chinese and Chinese--English.
no code implementations • WS 2017 • Jörg Tiedemann, Yves Scherrer
We investigate the use of extended context in attention-based neural machine translation.
no code implementations • 18 Aug 2017 • Jörg Tiedemann
This paper describes the submission from the University of Helsinki to the shared task on cross-lingual dependency parsing at VarDial 2017.
no code implementations • 18 Aug 2017 • Robert Östling, Jörg Tiedemann
Neural machine translation (NMT) approaches have improved the state of the art in many machine translation settings over the last couple of years, but they require large amounts of training data to produce sensible output.
1 code implementation • IJCNLP 2017 • Yan Shao, Christian Hardmeier, Jörg Tiedemann, Joakim Nivre
We present a character-based model for joint segmentation and POS tagging for Chinese.
no code implementations • 22 Dec 2016 • Robert Östling, Jörg Tiedemann
Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other.