no code implementations • 22 Nov 2024 • Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge
In this technical report, we present the Zamba2 series -- a suite of 1. 2B, 2. 7B, and 7. 4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency.
1 code implementation • 7 Aug 2024 • Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge
Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length.
1 code implementation • 4 Jun 2024 • Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony
The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly.
no code implementations • 26 May 2024 • Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge
Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay.
no code implementations • 23 Oct 2023 • Mahan Fathi, Clement Gehring, Jonathan Pilault, David Kanaa, Pierre-Luc Bacon, Ross Goroshin
Koopman representations aim to learn features of nonlinear dynamical systems (NLDS) which lead to linear dynamics in the latent space.
no code implementations • 4 Jul 2023 • Jonathan Pilault, Can Liu, Mohit Bansal, Markus Dreyer
Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks.
1 code implementation • 27 Apr 2023 • Joo Hyung Lee, Wonpyo Park, Nicole Mitchell, Jonathan Pilault, Johan Obando-Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart Bik, Woohyun Han, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research.
no code implementations • 24 Jan 2023 • Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, Orhan Firat
Crosslingual conditional generation (e. g., machine translation) has long enjoyed the benefits of scaling.
no code implementations • 14 Oct 2022 • Jonathan Pilault, Michael Galkin, Bahare Fatemi, Perouz Taslakian, David Vasquez, Christopher Pal
While using our new path-finding algorithm as a pretraining signal provides 2-3% MRR improvements, we show that pretraining on all signals together gives the best knowledge graph completion results.
no code implementations • NeurIPS Workshop AIPLANS 2021 • Torsten Scholak, Jonathan Pilault, Joey Velez-Ginorio
This paper explores the capabilities of current transformer-based language models for program evaluation of simple functional programming languages.
no code implementations • 1 Jan 2021 • Jonathan Pilault, Jaehong Park, Christopher Pal
We introduce Mem2Mem, a memory-to-memory mechanism for hierarchical recurrent neural network based encoder decoder architectures and we explore its use for abstractive document summarization.
no code implementations • 21 Oct 2020 • Jaehong Park, Jonathan Pilault, Christopher Pal
We introduce Mem2Mem, a memory-to-memory mechanism for hierarchical recurrent neural network based encoder decoder architectures and we explore its use for abstractive document summarization.
1 code implementation • ICLR 2021 • Jonathan Pilault, Amine Elhattami, Christopher Pal
Through this construction (a hypernetwork adapter), we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed.
Ranked #1 on
Natural Language Inference
on SciTail
no code implementations • 21 Feb 2020 • Jonathan Pilault, Jae-hong Park, Christopher Pal
In this work, we investigate the performance of untrained randomly initialized encoders in a general class of sequence to sequence models and compare their performance with that of fully-trained encoders on the task of abstractive summarization.
1 code implementation • EMNLP 2020 • Sandeep Subramanian, Raymond Li, Jonathan Pilault, Christopher Pal
We present a method to produce abstractive summaries of long documents that exceed several thousand words via neural abstractive summarization.
Ranked #19 on
Text Summarization
on Pubmed