Search Results for author: Tri Dao

Found 37 papers, 30 papers with code

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

1 code implementation • 5 Mar 2024 • Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics.

101

Paper
Code

StarCoder 2 and The Stack v2: The Next Generation

no code implementations • 29 Feb 2024 • Anton Lozhkov, Raymond Li, Loubna Ben allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.

Ranked #24 on Code Generation on MBPP

Code Completion Code Generation +1

Paper
Add Code

BitDelta: Your Fine-Tune May Only Be Worth One Bit

1 code implementation • 15 Feb 2024 • James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.

158

Paper
Code

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

1 code implementation • 19 Jan 2024 • Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration.

1,837

Paper
Code

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

14 code implementations • 1 Dec 2023 • Albert Gu, Tri Dao

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.

Ranked #25 on Language Modelling on LAMBADA

2D Pose Estimation Common Sense Reasoning +2

9,160

Paper
Code

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

1 code implementation • 26 Oct 2023 • Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen

We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability.

In-Context Learning

208

Paper
Code

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

6 code implementations • 17 Jul 2023 • Tri Dao

We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes.

Language Modelling

77,603

Paper
Code

StarCoder: may the source be with you!

4 code implementations • 9 May 2023 • Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15. 5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention.

Ranked #42 on Code Generation on MBPP

8k Code Generation

7,095

Paper
Code

Effectively Modeling Time Series with Simple Discrete State Spaces

1 code implementation • 16 Mar 2023 • Michael Zhang, Khaled K. Saab, Michael Poli, Tri Dao, Karan Goel, Christopher Ré

For expressivity, we propose a new SSM parameterization based on the companion matrix -- a canonical representation for discrete-time processes -- which enables SpaceTime's SSM layers to learn desirable autoregressive processes.

Time Series Time Series Classification

152

Paper
Code

Hyena Hierarchy: Towards Larger Convolutional Language Models

5 code implementations • 21 Feb 2023 • Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale.

Ranked #37 on Language Modelling on WikiText-103

2k 8k +2

836

Paper
Code

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

1 code implementation • 13 Feb 2023 • Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré

We find that a key requirement to achieving high performance is keeping the convolution kernels smooth.

Image Classification Language Modelling

836

Paper
Code

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

3 code implementations • 28 Dec 2022 • Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

First, we use synthetic language modeling tasks to understand the gap between SSMs and attention.

Ranked #2 on Language Modelling on The Pile (Test perplexity metric)

8k Coreference Resolution +5

836

Paper
Code

Transform Once: Efficient Operator Learning in Frequency Domain

1 code implementation • 26 Nov 2022 • Michael Poli, Stefano Massaroli, Federico Berto, Jinykoo Park, Tri Dao, Christopher Ré, Stefano Ermon

Instead, this work introduces a blueprint for frequency domain learning through a single transform: transform once (T1).

Dimensionality Reduction Operator learning

Paper
Code

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

1 code implementation • 12 Oct 2022 • Eric Nguyen, Karan Goel, Albert Gu, Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus, Christopher Ré

On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1. 5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D.

Inductive Bias Video Classification

2,087

Paper
Code

ButterflyFlow: Building Invertible Layers with Butterfly Matrices

no code implementations • 28 Sep 2022 • Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, Stefano Ermon

Normalizing flows model complex probability distributions using maps obtained by composing invertible layers.

Density Estimation

Paper
Add Code

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

1 code implementation • 2 Jun 2022 • Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks.

Paper
Code

Decentralized Training of Foundation Models in Heterogeneous Environments

1 code implementation • 2 Jun 2022 • Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network.

Scheduling

Paper
Code

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

9 code implementations • 27 May 2022 • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

16k 4k +3

77,603

Paper
Code

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

1 code implementation • 1 Apr 2022 • Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré

To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms).

Language Modelling MRI Reconstruction

173

Paper
Code

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

1 code implementation • ICLR 2022 • Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré

To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.

Language Modelling

173

Paper
Code

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

Image Generation Language Modelling

173

Paper
Code

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

2 code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.

Ranked #2 on Sequential Image Classification on Sequential MNIST

Computational Efficiency Memorization +3

2,087

Paper
Code

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers

no code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Kamal Saab, Tri Dao, Atri Rudra, Christopher Re

Computational Efficiency Memorization +3

Paper
Add Code

Scatterbrain: Unifying Sparse and Low-rank Attention

1 code implementation • NeurIPS 2021 • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Image Generation Language Modelling

173

Paper
Code

Knowledge Distillation as Semiparametric Inference

1 code implementation • ICLR 2021 • Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.

Knowledge Distillation Model Compression

Paper
Code

Rethinking Neural Operations for Diverse Tasks

3 code implementations • NeurIPS 2021 • Nicholas Roberts, Mikhail Khodak, Tri Dao, Liam Li, Christopher Ré, Ameet Talwalkar

An important goal of AutoML is to automate-away the design of neural networks on new tasks in under-explored domains.

Image Classification Inductive Bias +3

Paper
Code

MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training

no code implementations • ICLR 2021 • Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, Christopher Re

Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.

Efficient Neural Network Language Modelling +2

Paper
Add Code

Searching for Convolutions and a More Ambitious NAS

no code implementations • 1 Jan 2021 • Nicholas Carl Roberts, Mikhail Khodak, Tri Dao, Liam Li, Nina Balcan, Christopher Re, Ameet Talwalkar

An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning.

Neural Architecture Search

Paper
Add Code

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

2 code implementations • ICLR 2020 • Tri Dao, Nimit S. Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps.

Image Classification speech-recognition +1

152

Paper
Code

HiPPO: Recurrent Memory with Optimal Polynomial Projections

2 code implementations • NeurIPS 2020 • Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re

A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed.

Ranked #8 on Sequential Image Classification on Sequential MNIST

Permuted-MNIST Sequential Image Classification +2

134

Paper
Code

Approximating the Permanent by Sampling from Adaptive Partitions

1 code implementation • NeurIPS 2019 • Jonathan Kuck, Tri Dao, Hamid Rezatofighi, Ashish Sabharwal, Stefano Ermon

Computing the permanent of a non-negative matrix is a core problem with practical applications ranging from target tracking to statistical thermodynamics.

Paper
Code

On the Downstream Performance of Compressed Word Embeddings

1 code implementation • NeurIPS 2019 • Avner May, Jian Zhang, Tri Dao, Christopher Ré

Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to $2\times$ lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.

Generalization Bounds Quantization +1

Paper
Code

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

1 code implementation • 14 Mar 2019 • Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions.

BIG-bench Machine Learning

152

Paper
Code

Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

1 code implementation • 31 Oct 2018 • Jian Zhang, Avner May, Tri Dao, Christopher Ré

We investigate how to train kernel approximation methods that generalize well under a memory budget.

Quantization

Paper
Code

Learning Compressed Transforms with Low Displacement Rank

1 code implementation • NeurIPS 2018 • Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, Christopher Ré

The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual.

Image Classification Language Modelling

Paper
Code

A Kernel Theory of Modern Data Augmentation

no code implementations • 16 Mar 2018 • Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines.

BIG-bench Machine Learning Data Augmentation

Paper
Add Code

Gaussian Quadrature for Kernel Features

no code implementations • NeurIPS 2017 • Tri Dao, Christopher De Sa, Christopher Ré

We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{e^\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0.

speech-recognition Speech Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.