Search Results for author: Gema Ramírez-Sánchez

Found 10 papers, 3 papers with code

Bicleaner AI: Bicleaner Goes Neural

1 code implementation • LREC 2022 • Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Marta Bañón, Sergio Ortiz Rojas

This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora.

Binary Classification Machine Translation +2

Paper
Code

MultiTraiNMT: Training Materials to Approach Neural Machine Translation from Scratch

no code implementations • TRITON 2021 • Gema Ramírez-Sánchez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Caroline Rossi, Dorothy Kenny, Riccardo Superbo, Pilar Sánchez-Gijón, Olga Torres-Hostench

The MultiTraiNMT Erasmus+ project aims at developing an open innovative syllabus in neural machine translation (NMT) for language learners and translators as multilingual citizens.

Machine Translation NMT +1

Paper
Add Code

The EuroPat Corpus: A Parallel Corpus of European Patent Data

no code implementations • LREC 2022 • Kenneth Heafield, Elaine Farrow, Jelmer Van der Linde, Gema Ramírez-Sánchez, Dion Wiggins

We present the EuroPat corpus of patent-specific parallel data for 6 official European languages paired with English: German, Spanish, French, Croatian, Norwegian, and Polish.

Machine Translation Translation

Paper
Add Code

Human evaluation of web-crawled parallel corpora for machine translation

no code implementations • HumEval (ACL) 2022 • Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu, Sergio Ortiz Rojas

Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages.

Machine Translation Translation

Paper
Add Code

Bifixer and Bicleaner: two open-source tools to clean your parallel data

1 code implementation • EAMT 2020 • Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón, Sergio Ortiz Rojas

This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner.

Machine Translation Translation

144

Paper
Code

MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

no code implementations • EAMT 2022 • Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.

Paper
Add Code

FastSpell: the LangId Magic Spell

no code implementations • 12 Apr 2024 • Marta Bañón, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts.

Language Identification

Paper
Add Code

A New Massive Multilingual Dataset for High-Performance Language Technologies

no code implementations • 20 Mar 2024 • Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer Van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive.

Language Modelling Machine Translation +2

Paper
Add Code

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

no code implementations • 13 Mar 2024 • Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

Large, curated, web-crawled corpora play a vital role in training language models (LMs).

Paper
Add Code

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

2 code implementations • 24 Nov 2023 • Nikolay Bogoychev, Jelmer Van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo

Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field.

Data Augmentation Machine Translation +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.