Search Results for author: Jordi Armengol-Estapé

Found 22 papers, 12 papers with code

Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation

no code implementations • RANLP 2021 • Jordi Armengol-Estapé, Marta R. Costa-jussà, Carlos Escolano

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures.

Machine Translation Translation

Paper
Add Code

A Vulnerability Study on Academic Collaboration Networks Based on Network Dynamics

1 code implementation • 21 Dec 2020 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

Email can be one of the most fruitful attack vectors of research institutions as they also contain access to all accounts and thus to all private information.

Cryptography and Security Social and Information Networks

Paper
Code

Characterizing and Measuring the Similarity of Neural Networks with Persistent Homology

1 code implementation • NeurIPS 2021 • David Pérez-Fernández, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

Characterizing the structural properties of neural networks is crucial yet poorly understood, and there are no well-established similarity measures between networks.

Topological Data Analysis

Paper
Code

Spanish Biomedical and Clinical Language Embeddings

no code implementations • 25 Feb 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino, Ona de Gibert, Aitor Gonzalez-Agirre, Marta Villegas

We computed both Word and Sub-word Embeddings using FastText.

Word Embeddings

Paper
Add Code

Persistent Homology Captures the Generalization of Neural Networks Without A Validation Set

1 code implementation • NeurIPS 2021 • Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Marta Villegas

The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model.

Holdout Set

Paper
Code

MarIA: Spanish Language Models

2 code implementations • 15 Jul 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Aitor Gonzalez-Agirre, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Marta Villegas

This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community.

Extractive Question-Answering Question Answering

244

Paper
Code

Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

no code implementations • Findings (ACL) 2021 • Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

For this, we: (1) build a clean, high-quality textual Catalan corpus (CaText), the largest to date (but only a fraction of the usual size of the previous work in monolingual language models), (2) train a Transformer-based language model for Catalan (BERTa), and (3) devise a thorough evaluation in a diversity of settings, comprising a complete array of downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and Classification, Text Classification, Question Answering, and Semantic Textual Similarity, with most of the corresponding datasets being created ex novo.

Language Modelling named-entity-recognition +7

Paper
Add Code

Learning C to x86 Translation: An Experiment in Neural Compilation

1 code implementation • NeurIPS Workshop AIPLANS 2021 • Jordi Armengol-Estapé, Michael F. P. O'Boyle

Deep learning has had a significant impact on many fields.

C++ code Code Translation +1

Paper
Code

On the Multilingual Capabilities of Very Large-Scale English Language Models

2 code implementations • LREC 2022 • Jordi Armengol-Estapé, Ona de Gibert Bonet, Maite Melero

Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning.

Extractive Question-Answering Few-Shot Learning +3

Paper
Code

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

no code implementations • 8 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Asier Gutiérrez-Fandiño, Joan Llop-Palao, Marc Pàmies, Aitor Gonzalez-Agirre, Marta Villegas

To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine.

named-entity-recognition Named Entity Recognition +1

Paper
Add Code

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

no code implementations • 16 Sep 2021 • Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet, Asier Gutiérrez-Fandiño, Aitor Gonzalez-Agirre, Martin Krallinger, Marta Villegas

We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4. 5GB (about 750M tokens) of clean plain text.

Paper
Add Code

Spanish Legalese Language Model and Corpora

1 code implementation • 23 Oct 2021 • Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, Marta Villegas

There are many Language Models for the English language according to its worldwide relevance.

Language Modelling

Paper
Code

FinEAS: Financial Embedding Analysis of Sentiment

1 code implementation • 31 Oct 2021 • Asier Gutiérrez-Fandiño, Miquel Noguer i Alonso, Petter Kolm, Jordi Armengol-Estapé

We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS).

Sentence Sentence Embeddings +4

Paper
Code

The Large Labelled Logo Dataset (L3D): A Multipurpose and Hand-Labelled Continuously Growing Dataset

1 code implementation • 10 Dec 2021 • Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé

In this work, we present the Large Labelled Logo Dataset (L3D), a multipurpose, hand-labelled, continuously growing dataset.

Ranked #1 on Image Classification on Large Labelled Logo Dataset (L3D) (Eval F1 metric)

Classification Image Classification

Paper
Code

Sequence-to-Sequence Resources for Catalan

1 code implementation • 14 Feb 2022 • Ona de Gibert, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, Maite Melero

In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT).

Abstractive Text Summarization Machine Translation +1

Paper
Code

esCorpius: A Massive Spanish Crawling Corpus

no code implementations • 30 Jun 2022 • Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication.

Language Modelling

Paper
Add Code

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

no code implementations • 21 May 2023 • Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, Michael F. P. O'Boyle

SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.

Language Modelling Large Language Model

Paper
Add Code

Forklift: An Extensible Neural Lifter

no code implementations • 1 Apr 2024 • Jordi Armengol-Estapé, Rodrigo C. O. Rocha, Jackson Woodruff, Pasquale Minervini, Michael F. P. O'Boyle

The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages.

Decoder

Paper
Add Code

Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task

1 code implementation • WMT (EMNLP) 2021 • Ksenia Kharitonova, Ona de Gibert Bonet, Jordi Armengol-Estapé, Mar Rodriguez i Alvarez, Maite Melero

This paper describes the participation of the BSC team in the WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

Decoder Language Modelling +3

Paper
Code

Pretrained Biomedical Language Models for Clinical NLP in Spanish

1 code implementation • BioNLP (ACL) 2022 • Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, Marta Villegas

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1. 1B tokens and an EHR corpus of 95M tokens.

NER

Paper
Code

Unsupervised Machine Translation in Real-World Scenarios

no code implementations • LREC 2022 • Ona de Gibert Bonet, Iakes Goenaga, Jordi Armengol-Estapé, Olatz Perez-de-Viñaspre, Carla Parra Escartín, Marina Sanchez, Mārcis Pinnis, Gorka Labaka, Maite Melero

In this work, we present the work that has been carried on in the MT4All CEF project and the resources that it has generated by leveraging recent research carried out in the field of unsupervised learning.

Translation Unsupervised Machine Translation

Paper
Add Code

Quality versus Quantity: Building Catalan-English MT Resources

no code implementations • SIGUL (LREC) 2022 • Ona de Gibert Bonet, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, Maite Melero

In this work, we make the case of quality over quantity when training a MT system for a medium-to-low-resource language pair, namely Catalan-English.

Cross-Lingual Transfer Transfer Learning +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.