Search Results for author: Michael E. Sander

Found 8 papers, 3 papers with code

Towards Understanding the Universality of Transformers for Next-Token Prediction

no code implementations3 Oct 2024 Michael E. Sander, Gabriel Peyré

Causal Transformers are trained to predict the next token for a given context.

How do Transformers perform In-Context Autoregressive Learning?

no code implementations8 Feb 2024 Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré

More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens.

Language Modelling

Implicit regularization of deep residual networks towards neural ODEs

1 code implementation3 Sep 2023 Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau

Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition.

valid

Vision Transformers provably learn spatial structure

no code implementations13 Oct 2022 Samy Jelassi, Michael E. Sander, Yuanzhi Li

On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association.

Binary Classification Inductive Bias

Do Residual Neural Networks discretize Neural Ordinary Differential Equations?

no code implementations29 May 2022 Michael E. Sander, Pierre Ablin, Gabriel Peyré

As a byproduct of our analysis, we consider the use of a memory-free discrete adjoint method to train a ResNet by recovering the activations on the fly through a backward pass of the network, and show that this method theoretically succeeds at large depth if the residual functions are Lipschitz with the input.

Sinkformers: Transformers with Doubly Stochastic Attention

1 code implementation22 Oct 2021 Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior.

Image Classification

Momentum Residual Neural Networks

1 code implementation15 Feb 2021 Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

We show on CIFAR and ImageNet that Momentum ResNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained Momentum ResNets are promising for fine-tuning models.

Image Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.