Search Results for author: Marta Villegas

Found 19 papers, 7 papers with code

ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions

no code implementations • ParlaCLARIN (LREC) 2022 • Baybars Kulebi, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas

This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Pretrained Biomedical Language Models for Clinical NLP in Spanish

1 code implementation • BioNLP (ACL) 2022 • Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, Marta Villegas

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1. 1B tokens and an EHR corpus of 95M tokens.

NER

Paper
Code

The Catalan Language CLUB

no code implementations • 3 Dec 2021 • Carlos Rodriguez-Penagos, Carme Armentano-Oller, Marta Villegas, Maite Melero, Aitor Gonzalez, Ona de Gibert Bonet, Casimiro Carrino Pio

The Catalan Language Understanding Benchmark (CLUB) encompasses various datasets representative of different NLU tasks that enable accurate evaluations of language models, following the General Language Understanding Evaluation (GLUE) example.

Paper
Add Code

Spanish Legalese Language Model and Corpora

1 code implementation • 23 Oct 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, Marta Villegas

There are many Language Models for the English language according to its worldwide relevance.

Language Modelling

Paper
Code

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

no code implementations • 16 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas

We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4. 5GB (about 750M tokens) of clean plain text.

Paper
Add Code

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

no code implementations • 8 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Asier Gutiérrez-Fandiño, Joan Llop-Palao, Marc Pàmies, Aitor Gonzalez-Agirre, Marta Villegas

To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine.

named-entity-recognition Named Entity Recognition +1

Paper
Add Code

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

no code implementations • Findings (ACL) 2021 • Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

For this, we: (1) build a clean, high-quality textual Catalan corpus (CaText), the largest to date (but only a fraction of the usual size of the previous work in monolingual language models), (2) train a Transformer-based language model for Catalan (BERTa), and (3) devise a thorough evaluation in a diversity of settings, comprising a complete array of downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and Classification, Text Classification, Question Answering, and Semantic Textual Similarity, with most of the corresponding datasets being created ex novo.

Language Modelling named-entity-recognition +7

Paper
Add Code

MarIA: Spanish Language Models

2 code implementations • 15 Jul 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas

This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community.

Extractive Question-Answering Question Answering

241

Paper
Code

Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

no code implementations • 28 Jun 2021 • Anastasios Nentidis, Anastasia Krithara, Konstantinos Bougiatiotis, Martin Krallinger, Carlos Rodriguez-Penagos, Marta Villegas, Georgios Paliouras

In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020.

Question Answering

Paper
Add Code

Persistent Homology Captures the Generalization of Neural Networks Without A Validation Set

1 code implementation • NeurIPS 2021 • Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Marta Villegas

The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model.

Holdout Set

Paper
Code

Spanish Biomedical and Clinical Language Embeddings

no code implementations • 25 Feb 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino, Ona de Gibert, Aitor Gonzalez-Agirre, Marta Villegas

We computed both Word and Sub-word Embeddings using FastText.

Word Embeddings

Paper
Add Code

Characterizing and Measuring the Similarity of Neural Networks with Persistent Homology

1 code implementation • NeurIPS 2021 • David Pérez-Fernández, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

Characterizing the structural properties of neural networks is crucial yet poorly understood, and there are no well-established similarity measures between networks.

Topological Data Analysis

Paper
Code

A Vulnerability Study on Academic Collaboration Networks Based on Network Dynamics

1 code implementation • 21 Dec 2020 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

Email can be one of the most fruitful attack vectors of research institutions as they also contain access to all accounts and thus to all private information.

Cryptography and Security Social and Information Networks

Paper
Code

PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track

no code implementations • WS 2019 • Aitor Gonzalez-Agirre, Montserrat Marimon, Ander Intxaurrondo, Obdulia Rabal, Marta Villegas, Martin Krallinger

We foresee that the PharmaCoNER annotation guidelines, corpus and participant systems will foster the development of new resources for clinical and biomedical text mining systems of Spanish medical data.

named-entity-recognition Named Entity Recognition +1

Paper
Add Code

Medical Word Embeddings for Spanish: Development and Evaluation

no code implementations • WS 2019 • Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, Jordi Armengol-Estap{\'e}

We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain.

named-entity-recognition Named Entity Recognition +2

Paper
Add Code

Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries

1 code implementation • LREC 2016 • Marta Villegas, Maite Melero, N{\'u}ria Bel, Jorge Gracia

The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes{'} degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries.

Translation

Paper
Code

Metadata as Linked Open Data: mapping disparate XML metadata registries into one RDF/OWL registry.

no code implementations • LREC 2014 • Marta Villegas, Maite Melero, N{\'u}ria Bel

The proliferation of different metadata schemas and models pose serious problems of interoperability.

Paper
Add Code

The IULA Treebank

no code implementations • LREC 2012 • Montserrat Marimon, Beatriz Fisas, N{\'u}ria Bel, Jorge Vivaldi, Sergi Torner, Merc{\`e} Lorente, Silvia V{\'a}zquez, Marta Villegas

In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles.

POS

Paper
Add Code

Using Language Resources in Humanities research

no code implementations • LREC 2012 • Marta Villegas, Nuria Bel, Carlos Gonzalo, Amparo Moreno, Nuria Simelio

In this paper we present two real cases, in the fields of discourse analysis of newspapers and communication research which demonstrate the impact of Language Resources (LR) and NLP in the humanities.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.