Search Results for author: Nikola Ljubešić

Found 29 papers, 6 papers with code

Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection

no code implementations EACL (WASSA) 2021 Ilia Markov, Nikola Ljubešić, Darja Fišer, Walter Daelemans

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes.

Hate Speech Detection

Social Media Variety Geolocation with geoBERT

no code implementations EACL (VarDial) 2021 Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation.

regression

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations EACL (BSNLP) 2021 Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modelling +5

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations EAMT 2022 Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

ParlaMint II: The Show Must Go On

no code implementations ParlaCLARIN (LREC) 2022 Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Meden Katja

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021.

Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?

no code implementations LREC (LAW) 2022 Kaja Dobrovoljc, Nikola Ljubešić

The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus.

Dependency Parsing

ParlaSpeech-HR - a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus

no code implementations ParlaCLARIN (LREC) 2022 Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec

This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1, 816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene

no code implementations COLING (PEOPLES) 2020 Nikola Ljubešić, Ilia Markov, Darja Fišer, Walter Daelemans

We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT).

Translation

A Report on the VarDial Evaluation Campaign 2020

no code implementations VarDial (COLING) 2020 Mihaela Gaman, Dirk Hovy, Radu Tudor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén, Nikola Ljubešić, Niko Partanen, Christoph Purschke, Yves Scherrer, Marcos Zampieri

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020.

Dialect Identification

HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models

no code implementations VarDial (COLING) 2020 Yves Scherrer, Nikola Ljubešić

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation.

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

1 code implementation8 Aug 2023 Luka Terčon, Nikola Ljubešić

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline.

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

no code implementations7 Mar 2023 Taja Kuzman, Igor Mozetič, Nikola Ljubešić

Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models.

Language Modelling text-classification +3

Geographic Adaptation of Pretrained Language Models

no code implementations16 Mar 2022 Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone.

Language Identification Language Modelling +2

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

no code implementations LREC 2022 Taja Kuzman, Peter Rupnik, Nikola Ljubešić

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

no code implementations19 Apr 2021 Nikola Ljubešić, Davor Lauc

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains.

Commonsense Causal Reasoning Language Modelling +5

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

no code implementations5 Jun 2019 Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD).

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

1 code implementation9 Jul 2018 Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić

We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages.

Cross-Lingual Transfer Word Embeddings

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

1 code implementation ACL 2018 Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank

Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent.

Gender Prediction

Cannot find the paper you are looking for? You can Submit a new open access paper.