no code implementations • 3 Oct 2024 • Michael E. Sander, Gabriel Peyré
Causal Transformers are trained to predict the next token for a given context.
no code implementations • 8 Feb 2024 • Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré
More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens.
1 code implementation • 3 Sep 2023 • Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau
Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition.
no code implementations • 2 Feb 2023 • Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel
In this paper, we propose new differentiable and sparse top-k operators.
no code implementations • 13 Oct 2022 • Samy Jelassi, Michael E. Sander, Yuanzhi Li
On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association.
no code implementations • 29 May 2022 • Michael E. Sander, Pierre Ablin, Gabriel Peyré
As a byproduct of our analysis, we consider the use of a memory-free discrete adjoint method to train a ResNet by recovering the activations on the fly through a backward pass of the network, and show that this method theoretically succeeds at large depth if the residual functions are Lipschitz with the input.
1 code implementation • 22 Oct 2021 • Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré
We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior.
1 code implementation • 15 Feb 2021 • Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré
We show on CIFAR and ImageNet that Momentum ResNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained Momentum ResNets are promising for fine-tuning models.
Ranked #137 on Image Classification on CIFAR-10