no code implementations • ParlaCLARIN (LREC) 2022 • Baybars Kulebi, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas
This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 17 Oct 2024 • José Giraldo, Martí Llopart-Font, Alex Peiró-Lilja, Carme Armentano-Oller, Gerard Sant, Baybars Külebi
High-quality audio data is a critical prerequisite for training robust text-to-speech models, which often limits the use of opportunistic or crowdsourced datasets.
no code implementations • 3 Dec 2021 • Carlos Rodriguez-Penagos, Carme Armentano-Oller, Marta Villegas, Maite Melero, Aitor Gonzalez, Ona de Gibert Bonet, Casimiro Carrino Pio
The Catalan Language Understanding Benchmark (CLUB) encompasses various datasets representative of different NLU tasks that enable accurate evaluations of language models, following the General Language Understanding Evaluation (GLUE) example.
no code implementations • Findings (ACL) 2021 • Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas
For this, we: (1) build a clean, high-quality textual Catalan corpus (CaText), the largest to date (but only a fraction of the usual size of the previous work in monolingual language models), (2) train a Transformer-based language model for Catalan (BERTa), and (3) devise a thorough evaluation in a diversity of settings, comprising a complete array of downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and Classification, Text Classification, Question Answering, and Semantic Textual Similarity, with most of the corresponding datasets being created ex novo.
2 code implementations • 15 Jul 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas
This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community.