Search Results for author: Peter Rupnik

Found 7 papers, 1 papers with code

ParlaSpeech-HR - a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus

no code implementations ParlaCLARIN (LREC) 2022 Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec

This paper presents our bootstrapping efforts of producing the first large freely available Croatian automatic speech recognition (ASR) dataset, 1, 816 hours in size, obtained from parliamentary transcripts and recordings from the ParlaMint corpus.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations EAMT 2022 Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

1 code implementation8 Apr 2024 Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed.

The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings

no code implementations18 Sep 2023 Michal Mochtak, Peter Rupnik, Nikola Ljubešić

The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings.

Decision Making Language Modelling +1

The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia

no code implementations2 Jun 2022 Michal Mochtak, Peter Rupnik, Nikola Ljubešič

A six-level schema is applied to the data with the aim of training a classification model for the detection of sentiment in parliamentary proceedings.

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

no code implementations LREC 2022 Taja Kuzman, Peter Rupnik, Nikola Ljubešić

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1, 125 crawled Slovenian web documents that consist of 650 thousand words.

Cannot find the paper you are looking for? You can Submit a new open access paper.