Search Results for author: Ona de Gibert Bonet

Found 8 papers, 2 papers with code

Quality versus Quantity: Building Catalan-English MT Resources

no code implementations • SIGUL (LREC) 2022 • Ona de Gibert Bonet, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, Maite Melero

In this work, we make the case of quality over quantity when training a MT system for a medium-to-low-resource language pair, namely Catalan-English.

Cross-Lingual Transfer Transfer Learning +1

Paper
Add Code

Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task

1 code implementation • WMT (EMNLP) 2021 • Ksenia Kharitonova, Ona de Gibert Bonet, Jordi Armengol-Estapé, Mar Rodriguez i Alvarez, Maite Melero

This paper describes the participation of the BSC team in the WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

Language Modelling Machine Translation +2

Paper
Code

Unsupervised Machine Translation in Real-World Scenarios

no code implementations • LREC 2022 • Ona de Gibert Bonet, Iakes Goenaga, Jordi Armengol-Estapé, Olatz Perez-de-Viñaspre, Carla Parra Escartín, Marina Sanchez, Mārcis Pinnis, Gorka Labaka, Maite Melero

In this work, we present the work that has been carried on in the MT4All CEF project and the resources that it has generated by leveraging recent research carried out in the field of unsupervised learning.

Translation Unsupervised Machine Translation

Paper
Add Code

Spanish Datasets for Sensitive Entity Detection in the Legal Domain

no code implementations • LREC 2022 • Ona de Gibert Bonet, Aitor García Pablos, Montse Cuadros, Maite Melero

In order to assess the quality of the generated datasets, we have used them to fine-tune a battery of entity-detection models, using as foundation different pre-trained language models: one multilingual, two general-domain monolingual and one in-domain monolingual.

De-identification

Paper
Add Code

The Catalan Language CLUB

no code implementations • 3 Dec 2021 • Carlos Rodriguez-Penagos, Carme Armentano-Oller, Marta Villegas, Maite Melero, Aitor Gonzalez, Ona de Gibert Bonet, Casimiro Carrino Pio

The Catalan Language Understanding Benchmark (CLUB) encompasses various datasets representative of different NLU tasks that enable accurate evaluations of language models, following the General Language Understanding Evaluation (GLUE) example.

Paper
Add Code

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

no code implementations • 16 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas

We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4. 5GB (about 750M tokens) of clean plain text.

Paper
Add Code

On the Multilingual Capabilities of Very Large-Scale English Language Models

2 code implementations • LREC 2022 • Jordi Armengol-Estapé, Ona de Gibert Bonet, Maite Melero

Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning.

Extractive Question-Answering Few-Shot Learning +3

Paper
Code

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

no code implementations • Findings (ACL) 2021 • Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

For this, we: (1) build a clean, high-quality textual Catalan corpus (CaText), the largest to date (but only a fraction of the usual size of the previous work in monolingual language models), (2) train a Transformer-based language model for Catalan (BERTa), and (3) devise a thorough evaluation in a diversity of settings, comprising a complete array of downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and Classification, Text Classification, Question Answering, and Semantic Textual Similarity, with most of the corresponding datasets being created ex novo.

Language Modelling named-entity-recognition +7

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.