no code implementations • JEP/TALN/RECITAL 2022 • Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot, Djamé Seddah
Dans ce travail, en comparant des modèles multilingues et monolingues, nous montrons que de tels modèles se comportent de multiples façons sur des langues inconnues.
1 code implementation • EMNLP (WNUT) 2021 • Rob van der Goot, Alan Ramponi, Arkaitz Zubiaga, Barbara Plank, Benjamin Muller, Iñaki San Vicente Roncal, Nikola Ljubešić, Özlem Çetinoğlu, Rahmad Mahendra, Talha Çolakoğlu, Timothy Baldwin, Tommaso Caselli, Wladimir Sidorenko
This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation.
1 code implementation • 13 Dec 2024 • Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness.
no code implementations • 2 Oct 2024 • Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu
We focus on mathematical reasoning and without in-language math data, facilitate cross-lingual transfer by composing language and math capabilities.
1 code implementation • 8 Feb 2024 • Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units.
Ranked #1 on
Language Modelling
on SALMon
(using extra training data)
1 code implementation • 5 Sep 2023 • Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs.
Ranked #2 on
Text-to-Image Generation
on COCO
2 code implementations • 31 Aug 2023 • Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs).
1 code implementation • 31 Aug 2023 • Benjamin Muller, Belen Alastruey, Prangthip Hansanti, Elahe Kalbassi, Christophe Ropers, Eric Michael Smith, Adina Williams, Luke Zettlemoyer, Pierre Andrews, Marta R. Costa-jussà
We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation.
no code implementations • 23 May 2023 • Benjamin Muller, John Wieting, Jonathan H. Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Baldini Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang
Based on these models, we improve the attribution level of a cross-lingual question-answering system.
1 code implementation • 23 Feb 2023 • Asım Ersoy, Gerson Vizcarra, Tasmiah Tahsin Mayeesha, Benjamin Muller
Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages.
no code implementations • 4 Dec 2022 • Benjamin Muller, Deepanshu Gupta, Siddharth Patwardhan, Jean-Philippe Fauconnier, David Vandyke, Sachin Agarwal
For a given language, we are able to predict zero-shot performance, that increases on a logarithmic scale with the number of few-shot target language data points.
no code implementations • 14 Oct 2021 • Benjamin Muller, Luca Soldaini, Rik Koncel-Kedziorski, Eric Lind, Alessandro Moschitti
Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
1 code implementation • EACL 2021 • Benjamin Muller, Yanai Elazar, Benoît Sagot, Djamé Seddah
Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning.
1 code implementation • NAACL 2021 • Benjamin Muller, Antonis Anastasopoulos, Benoît Sagot, Djamé Seddah
Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages.
no code implementations • ACL 2020 • Djam{\'e} Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Su{\'a}rez, Beno{\^\i}t Sagot, Abhishek Srivastava
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching.
no code implementations • JEPTALNRECITAL 2020 • Louis Martin, Benjamin Muller, Pedro Javier Ortiz Su{\'a}rez, Yoann Dupont, Laurent Romary, {\'E}ric Villemonte de la Clergerie, Beno{\^\i}t Sagot, Djam{\'e} Seddah
L{'}utilisation pratique de ces mod{\`e}les {---} dans toutes les langues sauf l{'}anglais {---} {\'e}tait donc limit{\'e}e. La sortie r{\'e}cente de plusieurs mod{\`e}les monolingues fond{\'e}s sur BERT (Devlin et al., 2019), notamment pour le fran{\c{c}}ais, a d{\'e}montr{\'e} l{'}int{\'e}r{\^e}t de ces mod{\`e}les en am{\'e}liorant l{'}{\'e}tat de l{'}art pour toutes les t{\^a}ches {\'e}valu{\'e}es.
no code implementations • LREC 2020 • Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller, Laurent Romary, Benoît Sagot
The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French.
no code implementations • 1 May 2020 • Benjamin Muller, Benoit Sagot, Djamé Seddah
Building natural language processing systems for non standardized and low resource languages is a difficult challenge.
8 code implementations • ACL 2020 • Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot
We show that the use of web crawled data is preferable to the use of Wikipedia data.
Ranked #1 on
Natural Language Inference
on XNLI French
no code implementations • WS 2019 • Benjamin Muller, Benoit Sagot, Djam{\'e} Seddah
In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation.
no code implementations • CONLL 2018 • Ganesh Jawahar, Benjamin Muller, Amal Fethi, Louis Martin, {\'E}ric Villemonte de la Clergerie, Beno{\^\i}t Sagot, Djam{\'e} Seddah
We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set.