Search Results for author: Tri Dao

Found 31 papers, 25 papers with code

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

1 code implementation17 Jul 2023 Tri Dao

We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes.

Language Modelling Video Generation

Effectively Modeling Time Series with Simple Discrete State Spaces

1 code implementation16 Mar 2023 Michael Zhang, Khaled K. Saab, Michael Poli, Tri Dao, Karan Goel, Christopher Ré

For expressivity, we propose a new SSM parameterization based on the companion matrix -- a canonical representation for discrete-time processes -- which enables SpaceTime's SSM layers to learn desirable autoregressive processes.

Time Series Time Series Classification

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

1 code implementation12 Oct 2022 Eric Nguyen, Karan Goel, Albert Gu, Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus, Christopher Ré

On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1. 5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D.

Inductive Bias Video Classification

ButterflyFlow: Building Invertible Layers with Butterfly Matrices

no code implementations28 Sep 2022 Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, Stefano Ermon

Normalizing flows model complex probability distributions using maps obtained by composing invertible layers.

Density Estimation

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

1 code implementation2 Jun 2022 Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks.

Decentralized Training of Foundation Models in Heterogeneous Environments

1 code implementation2 Jun 2022 Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network.

Scheduling

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

5 code implementations27 May 2022 Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

Document Classification Image Classification +1

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

1 code implementation1 Apr 2022 Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré

To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms).

Language Modelling MRI Reconstruction

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

1 code implementation ICLR 2022 Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré

To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.

Language Modelling

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

1 code implementation NeurIPS 2021 Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

Image Generation Language Modelling

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

2 code implementations NeurIPS 2021 Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.

Memorization Sequential Image Classification +2

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers

no code implementations NeurIPS 2021 Albert Gu, Isys Johnson, Karan Goel, Khaled Kamal Saab, Tri Dao, Atri Rudra, Christopher Re

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.

Memorization Sequential Image Classification +2

Scatterbrain: Unifying Sparse and Low-rank Attention

1 code implementation NeurIPS 2021 Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

Image Generation Language Modelling

Knowledge Distillation as Semiparametric Inference

1 code implementation ICLR 2021 Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.

Knowledge Distillation Model Compression

Searching for Convolutions and a More Ambitious NAS

no code implementations1 Jan 2021 Nicholas Carl Roberts, Mikhail Khodak, Tri Dao, Liam Li, Nina Balcan, Christopher Re, Ameet Talwalkar

An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning.

Neural Architecture Search

MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training

no code implementations ICLR 2021 Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, Christopher Re

Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.

Efficient Neural Network Language Modelling +2

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

2 code implementations ICLR 2020 Tri Dao, Nimit S. Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps.

Image Classification speech-recognition +1

Approximating the Permanent by Sampling from Adaptive Partitions

1 code implementation NeurIPS 2019 Jonathan Kuck, Tri Dao, Hamid Rezatofighi, Ashish Sabharwal, Stefano Ermon

Computing the permanent of a non-negative matrix is a core problem with practical applications ranging from target tracking to statistical thermodynamics.

On the Downstream Performance of Compressed Word Embeddings

1 code implementation NeurIPS 2019 Avner May, Jian Zhang, Tri Dao, Christopher Ré

Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to $2\times$ lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.

Generalization Bounds Quantization +1

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

1 code implementation14 Mar 2019 Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions.

BIG-bench Machine Learning

Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation

1 code implementation31 Oct 2018 Jian Zhang, Avner May, Tri Dao, Christopher Ré

We investigate how to train kernel approximation methods that generalize well under a memory budget.

Quantization

Learning Compressed Transforms with Low Displacement Rank

1 code implementation NeurIPS 2018 Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, Christopher Ré

The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual.

Image Classification Language Modelling

A Kernel Theory of Modern Data Augmentation

no code implementations16 Mar 2018 Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré

Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines.

BIG-bench Machine Learning Data Augmentation

Gaussian Quadrature for Kernel Features

no code implementations NeurIPS 2017 Tri Dao, Christopher De Sa, Christopher Ré

We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{e^\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0.

speech-recognition Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.