no code implementations • WS 2019 • Casimiro Pio Carrino, Bardia Rafieian, Marta R. Costa-juss{\`a}, Jos{\'e} A. R. Fonollosa
Our best-submitted system ranked 2nd and 3rd for Spanish-English and English-Spanish translation directions, respectively.
3 code implementations • 11 Dec 2019 • Casimiro Pio Carrino, Marta R. Costa-jussà, José A. R. Fonollosa
We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model.
no code implementations • LREC 2020 • Casimiro Pio Carrino, Marta R. Costa-juss{\`a}, Jos{\'e} A. R. Fonollosa
We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model.
no code implementations • 25 Feb 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino, Ona de Gibert, Aitor Gonzalez-Agirre, Marta Villegas
We computed both Word and Sub-word Embeddings using FastText.
2 code implementations • 15 Jul 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas
This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community.
no code implementations • Findings (ACL) 2021 • Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas
For this, we: (1) build a clean, high-quality textual Catalan corpus (CaText), the largest to date (but only a fraction of the usual size of the previous work in monolingual language models), (2) train a Transformer-based language model for Catalan (BERTa), and (3) devise a thorough evaluation in a diversity of settings, comprising a complete array of downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and Classification, Text Classification, Question Answering, and Semantic Textual Similarity, with most of the corresponding datasets being created ex novo.
no code implementations • 8 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Asier Gutiérrez-Fandiño, Joan Llop-Palao, Marc Pàmies, Aitor Gonzalez-Agirre, Marta Villegas
To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine.
no code implementations • 16 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas
We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4. 5GB (about 750M tokens) of clean plain text.
1 code implementation • 29 Sep 2023 • Casimiro Pio Carrino, Carlos Escolano, José A. R. Fonollosa
Our approach seeks to enhance cross-lingual QA transfer using a high-performing multilingual model trained on a large-scale dataset, complemented by a few thousand aligned QA examples across languages.
1 code implementation • BioNLP (ACL) 2022 • Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, Marta Villegas
This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1. 1B tokens and an EHR corpus of 95M tokens.