1 code implementation • LREC 2022 • Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Marta Bañón, Sergio Ortiz Rojas
This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora.
no code implementations • TRITON 2021 • Gema Ramírez-Sánchez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Caroline Rossi, Dorothy Kenny, Riccardo Superbo, Pilar Sánchez-Gijón, Olga Torres-Hostench
The MultiTraiNMT Erasmus+ project aims at developing an open innovative syllabus in neural machine translation (NMT) for language learners and translators as multilingual citizens.
no code implementations • LREC 2022 • Kenneth Heafield, Elaine Farrow, Jelmer Van der Linde, Gema Ramírez-Sánchez, Dion Wiggins
We present the EuroPat corpus of patent-specific parallel data for 6 official European languages paired with English: German, Spanish, French, Croatian, Norwegian, and Polish.
no code implementations • HumEval (ACL) 2022 • Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu, Sergio Ortiz Rojas
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages.
1 code implementation • EAMT 2020 • Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón, Sergio Ortiz Rojas
This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner.
no code implementations • EAMT 2022 • Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza
We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages.
no code implementations • 12 Apr 2024 • Marta Bañón, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts.
no code implementations • 20 Mar 2024 • Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer Van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive.
no code implementations • 13 Mar 2024 • Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral
Large, curated, web-crawled corpora play a vital role in training language models (LMs).
2 code implementations • 24 Nov 2023 • Nikolay Bogoychev, Jelmer Van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field.