no code implementations • 3 Sep 2024 • Ricardo Buitrago Ruiz, Tanya Marwah, Albert Gu, Andrej Risteski
Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs).
1 code implementation • 19 Aug 2024 • Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu
In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs).
1 code implementation • 13 Jul 2024 • Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu
We identify a key axis of matrix parameterizations termed sequence alignment, which increases the flexibility and performance of matrix mixers, providing insights into the strong performance of Transformers and recent SSMs such as Mamba.
1 code implementation • 12 Jun 2024 • Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro
On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average.
2 code implementations • 31 May 2024 • Tri Dao, Albert Gu
While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale.
4 code implementations • 5 Mar 2024 • Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics.
3 code implementations • 29 Feb 2024 • Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando de Freitas, Caglar Gulcehre
Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale.
28 code implementations • 1 Dec 2023 • Albert Gu, Tri Dao
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.
Ranked #25 on Language Modelling on LAMBADA
no code implementations • 15 Sep 2023 • Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara Sainath
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems.
10 code implementations • 11 Mar 2023 • Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, Soham De
Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train.
2 code implementations • NeurIPS 2023 • Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, Feryal Behbahani
We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks.
1 code implementation • 25 Jan 2023 • David M. Knigge, David W. Romero, Albert Gu, Efstratios Gavves, Erik J. Bekkers, Jakub M. Tomczak, Mark Hoogendoorn, Jan-Jakob Sonke
Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to consider the length, resolution, and dimensionality of the input data.
1 code implementation • 20 Dec 2022 • Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush
Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
1 code implementation • 12 Oct 2022 • Eric Nguyen, Karan Goel, Albert Gu, Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus, Christopher Ré
On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1. 5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D.
1 code implementation • 24 Jun 2022 • Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, Christopher Ré
Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4).
Ranked #2 on Long-range modeling on LRA
2 code implementations • 23 Jun 2022 • Albert Gu, Ankit Gupta, Karan Goel, Christopher Ré
On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix.
1 code implementation • 7 Jun 2022 • David W. Romero, David M. Knigge, Albert Gu, Erik J. Bekkers, Efstratios Gavves, Jakub M. Tomczak, Mark Hoogendoorn
The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework.
2 code implementations • 27 Mar 2022 • Ankit Gupta, Albert Gu, Jonathan Berant
Modeling long range dependencies in sequential data is a fundamental step towards attaining human-level performance in many modalities such as text, vision, audio and video.
6 code implementations • 20 Feb 2022 • Karan Goel, Albert Gu, Chris Donahue, Christopher Ré
SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting.
7 code implementations • ICLR 2022 • Albert Gu, Karan Goel, Christopher Ré
A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies.
2 code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.
Ranked #2 on Sequential Image Classification on Sequential MNIST
1 code implementation • 7 Jun 2021 • Ines Chami, Albert Gu, Dat Nguyen, Christopher Ré
Given directions, PCA relies on: (1) a parameterization of subspaces spanned by these directions, (2) a method of projection onto subspaces that preserves information in these directions, and (3) an objective to optimize, namely the variance explained by projections.
no code implementations • NeurIPS 2021 • Albert Gu, Isys Johnson, Karan Goel, Khaled Kamal Saab, Tri Dao, Atri Rudra, Christopher Re
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.
2 code implementations • ICLR 2020 • Tri Dao, Nimit S. Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré
Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps.
1 code implementation • NeurIPS 2020 • Nimit S. Sohoni, Jared A. Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré
As the subclass labels are frequently unavailable, models trained using only the coarser-grained class labels often exhibit highly variable performance across different subclasses.
2 code implementations • NeurIPS 2020 • Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Ré
Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree.
2 code implementations • NeurIPS 2020 • Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re
A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed.
Ranked #8 on Sequential Image Classification on Sequential MNIST
1 code implementation • ICLR 2021 • Karan Goel, Albert Gu, Yixuan Li, Christopher Ré
Particularly concerning are models with inconsistent performance on specific subgroups of a class, e. g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage.
1 code implementation • ICML 2020 • Albert Gu, Caglar Gulcehre, Tom Le Paine, Matt Hoffman, Razvan Pascanu
Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time.
no code implementations • ICLR 2019 • Albert Gu, Frederic Sala, Beliz Gunel, Christopher Ré
The quality of the representations achieved by embeddings is determined by how well the geometry of the embedding space matches the structure of the data.
1 code implementation • 14 Mar 2019 • Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré
Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions.
1 code implementation • NeurIPS 2018 • Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, Christopher Ré
The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual.
3 code implementations • ICML 2018 • Christopher De Sa, Albert Gu, Christopher Ré, Frederic Sala
Given a tree, we give a combinatorial construction that embeds the tree in hyperbolic space with arbitrarily low distortion without using optimization.
no code implementations • 16 Mar 2018 • Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines.