1 code implementation • 17 Jul 2023 • Tri Dao
We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes.
3 code implementations • 9 May 2023 • Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15. 5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention.
1 code implementation • 16 Mar 2023 • Michael Zhang, Khaled K. Saab, Michael Poli, Tri Dao, Karan Goel, Christopher Ré
For expressivity, we propose a new SSM parameterization based on the companion matrix -- a canonical representation for discrete-time processes -- which enables SpaceTime's SSM layers to learn desirable autoregressive processes.
3 code implementations • 21 Feb 2023 • Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré
Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale.
Ranked #41 on
Question Answering
on BoolQ
1 code implementation • 13 Feb 2023 • Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré
We find that a key requirement to achieving high performance is keeping the convolution kernels smooth.
3 code implementations • 28 Dec 2022 • Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré
First, we use synthetic language modeling tasks to understand the gap between SSMs and attention.
Ranked #5 on
Language Modelling
on WikiText-103
(using extra training data)
1 code implementation • 26 Nov 2022 • Michael Poli, Stefano Massaroli, Federico Berto, Jinykoo Park, Tri Dao, Christopher Ré, Stefano Ermon
Instead, this work introduces a blueprint for frequency domain learning through a single transform: transform once (T1).
1 code implementation • 12 Oct 2022 • Eric Nguyen, Karan Goel, Albert Gu, Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus, Christopher Ré
On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1. 5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D.
no code implementations • 28 Sep 2022 • Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, Stefano Ermon
Normalizing flows model complex probability distributions using maps obtained by composing invertible layers.
1 code implementation • 2 Jun 2022 • Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang
Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks.
1 code implementation • 2 Jun 2022 • Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang
Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network.
5 code implementations • 27 May 2022 • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.
1 code implementation • 1 Apr 2022 • Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré
To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms).
1 code implementation • ICLR 2022 • Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré
To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.
1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré
Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.
2 code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.
Ranked #2 on
Sequential Image Classification
on Sequential MNIST
no code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Kamal Saab, Tri Dao, Atri Rudra, Christopher Re
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.
1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré
Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.
1 code implementation • ICLR 2021 • Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey
A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.
3 code implementations • NeurIPS 2021 • Nicholas Roberts, Mikhail Khodak, Tri Dao, Liam Li, Christopher Ré, Ameet Talwalkar
An important goal of AutoML is to automate-away the design of neural networks on new tasks in under-explored domains.
no code implementations • 1 Jan 2021 • Nicholas Carl Roberts, Mikhail Khodak, Tri Dao, Liam Li, Nina Balcan, Christopher Re, Ameet Talwalkar
An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning.
no code implementations • ICLR 2021 • Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, Christopher Re
Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.
2 code implementations • ICLR 2020 • Tri Dao, Nimit S. Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré
Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps.
2 code implementations • NeurIPS 2020 • Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re
A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed.
Ranked #8 on
Sequential Image Classification
on Sequential MNIST
1 code implementation • NeurIPS 2019 • Jonathan Kuck, Tri Dao, Hamid Rezatofighi, Ashish Sabharwal, Stefano Ermon
Computing the permanent of a non-negative matrix is a core problem with practical applications ranging from target tracking to statistical thermodynamics.
1 code implementation • NeurIPS 2019 • Avner May, Jian Zhang, Tri Dao, Christopher Ré
Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to $2\times$ lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.
1 code implementation • 14 Mar 2019 • Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré
Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions.
1 code implementation • 31 Oct 2018 • Jian Zhang, Avner May, Tri Dao, Christopher Ré
We investigate how to train kernel approximation methods that generalize well under a memory budget.
1 code implementation • NeurIPS 2018 • Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, Christopher Ré
The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual.
no code implementations • 16 Mar 2018 • Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines.
no code implementations • NeurIPS 2017 • Tri Dao, Christopher De Sa, Christopher Ré
We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{e^\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0.