1 code implementation • ACL 2022 • Shu Okabe, Laurent Besacier, François Yvon
Word and morpheme segmentation are fundamental steps of language documentation as they allow to discover lexical units in a language for which the lexicon is unknown.
no code implementations • JEP/TALN/RECITAL 2021 • Guillaume Wisniewski, Lichao Zhou, Nicolas Ballier, François Yvon
Cet article présente les premiers résultats d’une étude en cours sur les biais de genre dans les corpus d’entraînements et dans les systèmes de traduction neuronale.
no code implementations • JEP/TALN/RECITAL 2021 • François Buet, François Yvon
Une façon de réaliser un sous-titrage automatique monolingue est d’associer un système de reconnaissance de parole avec un modèle de traduction de la transcription vers les sous-titres.
no code implementations • WMT (EMNLP) 2020 • Minh Quang Pham, Jitao Xu, Josep Crego, François Yvon, Jean Senellart
Priming is a well known and studied psychology phenomenon based on the prior presentation of one stimulus (cue) to influence the processing of a response.
1 code implementation • WMT (EMNLP) 2020 • Minh Quang Pham, Josep Maria Crego, François Yvon, Jean Senellart
Domain adaptation is an old and vexing problem for machine translation systems.
no code implementations • WMT (EMNLP) 2020 • Sadaf Abdul Rauf, José Carlos Rosales Núñez, Minh Quang Pham, François Yvon
This paper describes LIMSI’s submissions to the translation shared tasks at WMT’20.
no code implementations • WMT (EMNLP) 2021 • Jitao Xu, Minh Quang Pham, Sadaf Abdul Rauf, François Yvon
This paper describes LISN’s submissions to two shared tasks at WMT’21.
no code implementations • EAMT 2022 • Minh-Quang Pham, Josep Crego, François Yvon
In this paper, we study dynamic data selection strategies that are able to automatically re-evaluate the usefulness of data samples and to evolve a data selection policy in the course of training.
no code implementations • JEP/TALN/RECITAL 2022 • Lichao Zhu, Guillaume Wisniewski, Nicolas Ballier, François Yvon
Ce travail présente deux séries d’expériences visant à identifier les flux d’information dans les systèmes de traduction neuronaux.
no code implementations • Findings (NAACL) 2022 • Minh-Quang Pham, François Yvon, Josep Crego
Multidomain and multilingual machine translation often rely on parameter sharing strategies, where large portions of the network are meant to capture the commonalities of the tasks at hand, while smaller parts are reserved to model the peculiarities of a language or a domain.
no code implementations • MTSummit 2021 • Anh Khoa Ngo Ho, François Yvon
Word alignment identify translational correspondences between words in a parallel sentence pair and are used and for example and to train statistical machine translation and learn bilingual dictionaries or to perform quality estimation.
no code implementations • JEP/TALN/RECITAL 2022 • Shu Okabe, François Yvon
La segmentation automatique en mots et en morphèmes est une étape cruciale dans le processus de documentation des langues.
no code implementations • JEP/TALN/RECITAL 2022 • Nicolas Devatine, Caio Corro, François Yvon
Cet article s’intéresse au transfert cross-lingue d’analyseurs en dépendances et étudie des méthodes pour limiter l’effet potentiellement néfaste pour le transfert de divergences entre l’ordre des mots dans les langues source et cible.
no code implementations • EMNLP (IWSLT) 2019 • MinhQuang Pham, Josep Crego, François Yvon, Jean Senellart
Supervised machine translation works well when the train and test data are sampled from the same distribution.
no code implementations • IWSLT 2016 • Franck Burlot, Elena Knyazeva, Thomas Lavergne, François Yvon
This paper describes a two-step machine translation system that addresses the issue of translating into a morphologically rich language (English to Czech), by performing separately the translation and the generation of target morphology.
no code implementations • IWSLT 2016 • Franck Burlot, Matthieu Labeau, Elena Knyazeva, Thomas Lavergne, Alexandre Allauzen, François Yvon
This paper describes LIMSI’s submission to the MT track of IWSLT 2016.
no code implementations • 23 Dec 2024 • Ziqian Peng, Rachel Bawden, François Yvon
Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT).
2 code implementations • 31 Oct 2024 • Amir Hossein Kargaran, François Yvon, Hinrich Schütze
The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models.
1 code implementation • 8 Oct 2024 • Amir Hossein Kargaran, Ali Modarressi, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.
1 code implementation • 25 Sep 2024 • Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Ayyoob Imani, Orgest Xhelili, Haotian Ye, Chunlan Ma, François Yvon, Hinrich Schütze
However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.
no code implementations • 11 Sep 2024 • Matthieu Dubois, François Yvon, Pablo Piantanida
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities has vastly increased the threats posed by generative AI technologies by reducing the cost of producing harmful, toxic, faked or forged content.
1 code implementation • 10 Jun 2024 • Amir Hossein Kargaran, François Yvon, Hinrich Schütze
This method uses the LID itself to identify the features that require masking and does not rely on any external resource.
no code implementations • 23 May 2024 • Maxime Bouthors, Josep Crego, François Yvon
Retrieval-augmented machine translation leverages examples from a translation memory by retrieving similar instances.
no code implementations • 23 May 2024 • Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou
Third, we present the Language Model Evaluation Harness (lm-eval): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues.
1 code implementation • 1 Feb 2024 • Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo
We introduce CroissantLLM, a 1. 3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware.
3 code implementations • 24 Oct 2023 • Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.
Ranked #1 on Language Identification on GlotLID-C
no code implementations • 21 Oct 2023 • Alban Petit, Caio Corro, François Yvon
In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples.
1 code implementation • 13 Oct 2023 • Maxime Bouthors, Josep Crego, François Yvon
Retrieval-Augmented Machine Translation (RAMT) is attracting growing attention.
1 code implementation • 23 Sep 2023 • Amir Hossein Kargaran, François Yvon, Hinrich Schütze
We present GlotScript, an open resource and tool for low resource writing system identification.
1 code implementation • 1 Jun 2023 • Josep Crego, Jitao Xu, François Yvon
In our globalized world, a growing number of situations arise where people are required to communicate in one or several foreign languages.
1 code implementation • 31 May 2023 • Dávid Javorský, Ondřej Bojar, François Yvon
Many NLP tasks require to automatically identify the most significant words in a text.
1 code implementation • 20 May 2023 • Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i. e., making them better for about 100 languages.
1 code implementation • 3 Mar 2023 • Rachel Bawden, François Yvon
The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages.
7 code implementations • 9 Nov 2022 • BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas Wolf
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
1 code implementation • 24 Oct 2022 • Jitao Xu, Josep Crego, François Yvon
Machine Translation (MT) is usually viewed as a one-shot process that generates the target language equivalent of some source text from scratch.
1 code implementation • 18 Oct 2022 • Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, François Yvon, Hinrich Schütze
An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages.
1 code implementation • 12 Oct 2022 • Jitao Xu, Josep Crego, François Yvon
Non-autoregressive machine translation (NAT) has recently made great progress.
1 code implementation • LREC 2022 • Alina Karakanta, François Buet, Mauro Cettolo, François Yvon
Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference.
1 code implementation • IWSLT (ACL) 2022 • Jitao Xu, François Buet, Josep Crego, Elise Bertin-Lemée, François Yvon
As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs.
no code implementations • Findings (ACL) 2022 • Ayyoob Imani, Lütfi Kerem Şenel, Masoud Jalili Sabet, François Yvon, Hinrich Schütze
First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph.
no code implementations • EMNLP (BlackboxNLP) 2021 • Guillaume Wisniewski, Lichao Zhu, Nicolas Ballier, François Yvon
This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English.
1 code implementation • EMNLP 2021 • Jitao Xu, François Yvon
Machine translation is generally understood as generating one target text from an input source document.
1 code implementation • EMNLP 2021 • Ayyoob Imani, Masoud Jalili Sabet, Lütfi Kerem Şenel, Philipp Dufter, François Yvon, Hinrich Schütze
With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently.
no code implementations • NAACL (CALCS) 2021 • Jitao Xu, François Yvon
Code-Switching (CSW) is a common phenomenon that occurs in multilingual geographic or social contexts, which raises challenging problems for natural language processing tools.
no code implementations • AMTA 2020 • Anh Khoa Ngo Ho, François Yvon
Word alignments identify translational correspondences between words in a parallel sentence pair and are used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems or to perform quality estimation.
no code implementations • EMNLP (IWSLT) 2019 • Anh Khoa Ngo Ho, François Yvon
Word alignments identify translational correspondences between words in a parallel sentence pair and is used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems , or to perform quality estimation.
3 code implementations • Findings of the Association for Computational Linguistics 2020 • Masoud Jalili Sabet, Philipp Dufter, François Yvon, Hinrich Schütze
We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e. g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.
no code implementations • LREC 2020 • Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Marko Tadić, Dan Tufiş, Tamás Váradi, Kadri Vider, Andy Way, François Yvon
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality.
1 code implementation • WS 2018 • Franck Burlot, François Yvon
Our findings confirm that back-translation is very effective and give new explanations as to why this is the case.
no code implementations • 18 Jun 2018 • Pierre Godard, Marcely Zanon-Boito, Lucas Ondel, Alexandre Berard, François Yvon, Aline Villavicencio, Laurent Besacier
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL).
no code implementations • 16 Feb 2018 • Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur
Developing speech technologies for low-resource languages has become a very active research field over the last decade.