no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.
no code implementations • COLING (PEOPLES) 2020 • Nikola Ljubešić, Ilia Markov, Darja Fišer, Walter Daelemans
We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT).
no code implementations • RANLP 2021 • Filip Markoski, Elena Markoska, Nikola Ljubešić, Eftim Zdravevski, Ljupco Kocarev
There is a shortage of high-quality corpora for South-Slavic languages.
1 code implementation • EMNLP (WNUT) 2021 • Rob van der Goot, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, Özlem Çetinoğlu, Rahmad Mahendra, Talha Çolakoğlu, Timothy Baldwin, Tommaso Caselli, Wladimir Sidorenko
This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation.
no code implementations • WNUT (ACL) 2021 • Yves Scherrer, Nikola Ljubešić
This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization.
no code implementations • VarDial (COLING) 2020 • Yves Scherrer, Nikola Ljubešić
This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation.
no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri
This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.
no code implementations • EACL (WASSA) 2021 • Ilia Markov, Nikola Ljubešić, Darja Fišer, Walter Daelemans
In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes.
no code implementations • EACL (BSNLP) 2021 • Nikola Ljubešić, Davor Lauc
In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.
no code implementations • EACL (VarDial) 2021 • Yves Scherrer, Nikola Ljubešić
This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation.
no code implementations • 16 Mar 2022 • Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze
Geographic linguistic features are commonly used to improve the performance of pretrained language models (PLMs) on NLP tasks where geographic knowledge is intuitively beneficial (e. g., geolocation prediction and dialect feature prediction).
no code implementations • 11 Jan 2022 • Taja Kuzman, Peter Rupnik, Nikola Ljubešić
This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.
no code implementations • 19 Apr 2021 • Nikola Ljubešić, Davor Lauc
In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.
no code implementations • EMNLP 2020 • Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, Marcos Zampieri
In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories.
1 code implementation • LREC 2020 • Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik
State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists.
no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec
In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD).
no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec
This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts.
1 code implementation • 9 Jul 2018 • Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić
We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages.
1 code implementation • ACL 2018 • Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank
Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent.