Search Results for author: Nikola Ljubešić

Found 35 papers, 9 papers with code

ParlaMint II: The Show Must Go On

no code implementations ParlaCLARIN (LREC) 2022 Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Meden Katja

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021.

ParlaSpeech-HR - a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus

no code implementations ParlaCLARIN (LREC) 2022 Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec

This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1, 816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene

no code implementations COLING (PEOPLES) 2020 Nikola Ljubešić, Ilia Markov, Darja Fišer, Walter Daelemans

We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT).


A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models

no code implementations VarDial (COLING) 2020 Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation.

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations EACL (BSNLP) 2021 Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modeling +6

Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?

no code implementations LREC (LAW) 2022 Kaja Dobrovoljc, Nikola Ljubešić

The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus.

Dependency Parsing

Social Media Variety Geolocation with geoBERT

no code implementations EACL (VarDial) 2021 Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation.


Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection

no code implementations EACL (WASSA) 2021 Ilia Markov, Nikola Ljubešić, Darja Fišer, Walter Daelemans

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes.

Hate Speech Detection

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations EAMT 2022 Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route

no code implementations2 Dec 2024 Nikola Ljubešić, Taja Kuzman, Ivana Filipović Petrović, Jelena Parizoska, Petya Osenova

This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian CLARIN. SI infrastructure.

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

1 code implementation29 Nov 2024 Taja Kuzman, Nikola Ljubešić

To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation.

News Classification text-classification +2

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

no code implementations23 Sep 2024 Nikola Ljubešić, Peter Rupnik, Danijel Koržinek

In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings.

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

1 code implementation8 Apr 2024 Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed.

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

no code implementations19 Mar 2024 Nikola Ljubešić, Taja Kuzman

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space.

The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings

no code implementations18 Sep 2023 Michal Mochtak, Peter Rupnik, Nikola Ljubešić

The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings.

Decision Making Language Modeling +2

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

1 code implementation8 Aug 2023 Luka Terčon, Nikola Ljubešić

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline.

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

no code implementations7 Mar 2023 Taja Kuzman, Igor Mozetič, Nikola Ljubešić

Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models.

Language Modeling Language Modelling +4

Geographic Adaptation of Pretrained Language Models

1 code implementation16 Mar 2022 Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone.

Language Identification Language Modeling +3

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

no code implementations LREC 2022 Taja Kuzman, Peter Rupnik, Nikola Ljubešić

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations19 Apr 2021 Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modeling +6

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

no code implementations5 Jun 2019 Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD).

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation9 Jul 2018 Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages.

Cross-Lingual Transfer Word Embeddings

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

1 code implementation ACL 2018 Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank

Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent.

Gender Prediction

Cannot find the paper you are looking for? You can Submit a new open access paper.