no code implementations • ParlaCLARIN (LREC) 2022 • Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Meden Katja
In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021.
no code implementations • ParlaCLARIN (LREC) 2022 • Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec
This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1, 816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • COLING (PEOPLES) 2020 • Nikola Ljubešić, Ilia Markov, Darja Fišer, Walter Daelemans
We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT).
no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri
This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.
no code implementations • VarDial (COLING) 2020 • Yves Scherrer, Nikola Ljubešić
This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation.
no code implementations • WNUT (ACL) 2021 • Yves Scherrer, Nikola Ljubešić
This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization.
1 code implementation • EMNLP (WNUT) 2021 • Rob van der Goot, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, Özlem Çetinoğlu, Rahmad Mahendra, Talha Çolakoğlu, Timothy Baldwin, Tommaso Caselli, Wladimir Sidorenko
This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation.
no code implementations • RANLP 2021 • Filip Markoski, Elena Markoska, Nikola Ljubešić, Eftim Zdravevski, Ljupco Kocarev
There is a shortage of high-quality corpora for South-Slavic languages.
no code implementations • EACL (BSNLP) 2021 • Nikola Ljubešić, Davor Lauc
In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.
no code implementations • LREC (LAW) 2022 • Kaja Dobrovoljc, Nikola Ljubešić
The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus.
no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.
no code implementations • EACL (VarDial) 2021 • Yves Scherrer, Nikola Ljubešić
This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation.
no code implementations • EACL (WASSA) 2021 • Ilia Markov, Nikola Ljubešić, Darja Fišer, Walter Daelemans
In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes.
no code implementations • EAMT 2022 • Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza
We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.
no code implementations • 2 Dec 2024 • Nikola Ljubešić, Taja Kuzman, Ivana Filipović Petrović, Jelena Parizoska, Petya Osenova
This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian CLARIN. SI infrastructure.
1 code implementation • 29 Nov 2024 • Taja Kuzman, Nikola Ljubešić
To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation.
no code implementations • 23 Sep 2024 • Nikola Ljubešić, Peter Rupnik, Danijel Koržinek
In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings.
no code implementations • 12 May 2024 • Çağrı Çöltekin, Matyáš Kopp, Katja Meden, Vaidas Morkevicius, Nikola Ljubešić, Tomaž Erjavec
We introduce a dataset on political orientation and power position identification.
1 code implementation • 8 Apr 2024 • Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord
The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed.
no code implementations • 19 Mar 2024 • Nikola Ljubešić, Taja Kuzman
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space.
no code implementations • 13 Mar 2024 • Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral
Large, curated, web-crawled corpora play a vital role in training language models (LMs).
2 code implementations • arXiv 2023 • Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.
Ranked #1 on Named Entity Recognition (NER) on UNER v1 (Danish)
no code implementations • 18 Sep 2023 • Michal Mochtak, Peter Rupnik, Nikola Ljubešić
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings.
1 code implementation • 8 Aug 2023 • Luka Terčon, Nikola Ljubešić
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline.
no code implementations • 31 May 2023 • Noëmi Aepli, Çağrı Çöltekin, Rob van der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023.
no code implementations • 7 Mar 2023 • Taja Kuzman, Igor Mozetič, Nikola Ljubešić
Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models.
1 code implementation • 16 Mar 2022 • Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone.
no code implementations • LREC 2022 • Taja Kuzman, Peter Rupnik, Nikola Ljubešić
This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.
no code implementations • 19 Apr 2021 • Nikola Ljubešić, Davor Lauc
In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.
no code implementations • EMNLP 2020 • Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, Marcos Zampieri
In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories.
1 code implementation • LREC 2020 • Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik
State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists.
no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec
In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD).
no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec
This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts.
1 code implementation • 9 Jul 2018 • Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić
We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages.
1 code implementation • ACL 2018 • Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank
Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent.