Search Results for author: Nikola Ljubešić

Found 31 papers, 7 papers with code

Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection

no code implementations • EACL (WASSA) 2021 • Ilia Markov, Nikola Ljubešić, Darja Fišer, Walter Daelemans

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes.

Hate Speech Detection

Paper
Add Code

Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?

no code implementations • LREC (LAW) 2022 • Kaja Dobrovoljc, Nikola Ljubešić

The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus.

Dependency Parsing

Paper
Add Code

Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages

no code implementations • RANLP 2021 • Filip Markoski, Elena Markoska, Nikola Ljubešić, Eftim Zdravevski, Ljupco Kocarev

There is a shortage of high-quality corpora for South-Slavic languages.

Cultural Vocal Bursts Intensity Prediction

Paper
Add Code

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

1 code implementation • EMNLP (WNUT) 2021 • Rob van der Goot, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, Özlem Çetinoğlu, Rahmad Mahendra, Talha Çolakoğlu, Timothy Baldwin, Tommaso Caselli, Wladimir Sidorenko

This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation.

Dependency Parsing Lexical Normalization +2

Paper
Code

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations • EAMT 2022 • Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

Paper
Add Code

The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene

no code implementations • COLING (PEOPLES) 2020 • Nikola Ljubešić, Ilia Markov, Darja Fišer, Walter Daelemans

We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT).

Translation

Paper
Add Code

Findings of the VarDial Evaluation Campaign 2021

no code implementations • EACL (VarDial) 2021 • Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Ruba Priyadharshini, Christoph Purschke, Eswari Rajagopal, Yves Scherrer, Marcos Zampieri

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021.

Dialect Identification

Paper
Add Code

Social Media Variety Geolocation with geoBERT

no code implementations • EACL (VarDial) 2021 • Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation.

regression

Paper
Add Code

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations • EACL (BSNLP) 2021 • Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modelling +5

Paper
Add Code

ParlaSpeech-HR - a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus

no code implementations • ParlaCLARIN (LREC) 2022 • Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec

This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1, 816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models

no code implementations • VarDial (COLING) 2020 • Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation.

Paper
Add Code

Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization

no code implementations • WNUT (ACL) 2021 • Yves Scherrer, Nikola Ljubešić

This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization.

Lexical Normalization token-classification +1

Paper
Add Code

ParlaMint II: The Show Must Go On

no code implementations • ParlaCLARIN (LREC) 2022 • Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Meden Katja

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021.

Paper
Add Code

A Report on the VarDial Evaluation Campaign 2020

no code implementations • VarDial (COLING) 2020 • Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

Paper
Add Code

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

1 code implementation • 8 Apr 2024 • Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed.

Paper
Code

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

no code implementations • 19 Mar 2024 • Nikola Ljubešić, Taja Kuzman

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space.

Paper
Add Code

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

no code implementations • 13 Mar 2024 • Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

Large, curated, web-crawled corpora play a vital role in training language models (LMs).

Paper
Add Code

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

1 code implementation • arXiv 2023 • Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.

Ranked #1 on Named Entity Recognition (NER) on UNER v1 (Danish)

Cross-Lingual NER Multilingual Named Entity Recognition +3

Paper
Code

The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings

no code implementations • 18 Sep 2023 • Michal Mochtak, Peter Rupnik, Nikola Ljubešić

The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings.

Decision Making Language Modelling +1

Paper
Add Code

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

1 code implementation • 8 Aug 2023 • Luka Terčon, Nikola Ljubešić

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline.

Paper
Code

Findings of the VarDial Evaluation Campaign 2023

no code implementations • 31 May 2023 • Noëmi Aepli, Çağrı Çöltekin, Rob van der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023.

Intent Detection

Paper
Add Code

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

no code implementations • 7 Mar 2023 • Taja Kuzman, Igor Mozetič, Nikola Ljubešić

Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models.

Language Modelling text-classification +3

Paper
Add Code

Geographic Adaptation of Pretrained Language Models

no code implementations • 16 Mar 2022 • Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone.

Language Identification Language Modelling +2

Paper
Add Code

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

no code implementations • LREC 2022 • Taja Kuzman, Peter Rupnik, Nikola Ljubešić

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.

Paper
Add Code

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations • 19 Apr 2021 • Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modelling +5

Paper
Add Code

Findings of the 2020 Conference on Machine Translation (WMT20)

no code implementations • EMNLP 2020 • Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, Marcos Zampieri

In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories.

Machine Translation Translation

Paper
Add Code

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

1 code implementation • LREC 2020 • Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists.

Word Embeddings Word Sense Disambiguation +1

Paper
Code

KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning

no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts.

BIG-bench Machine Learning Term Extraction

Paper
Add Code

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

no code implementations • 5 Jun 2019 • Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD).

Paper
Add Code

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation • 9 Jul 2018 • Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages.

Cross-Lingual Transfer Word Embeddings

Paper
Code

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

1 code implementation • ACL 2018 • Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank

Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent.

Gender Prediction

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.