Search Results for author: Michael E. Sander

Found 7 papers, 3 papers with code

How do Transformers perform In-Context Autoregressive Learning?

no code implementations • 8 Feb 2024 • Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré

More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens.

Language Modelling

Paper
Add Code

Implicit regularization of deep residual networks towards neural ODEs

1 code implementation • 3 Sep 2023 • Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau

Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition.

valid

Paper
Code

Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

no code implementations • 2 Feb 2023 • Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel

In this paper, we propose new differentiable and sparse top-k operators.

Paper
Add Code

Vision Transformers provably learn spatial structure

no code implementations • 13 Oct 2022 • Samy Jelassi, Michael E. Sander, Yuanzhi Li

On the theoretical side, we consider a binary classification task and show that while the learning problem admits multiple solutions that generalize, our model implicitly learns the spatial structure of the dataset while generalizing: we call this phenomenon patch association.

Binary Classification Inductive Bias

Paper
Add Code

Do Residual Neural Networks discretize Neural Ordinary Differential Equations?

no code implementations • 29 May 2022 • Michael E. Sander, Pierre Ablin, Gabriel Peyré

As a byproduct of our analysis, we consider the use of a memory-free discrete adjoint method to train a ResNet by recovering the activations on the fly through a backward pass of the network, and show that this method theoretically succeeds at large depth if the residual functions are Lipschitz with the input.

Paper
Add Code

Sinkformers: Transformers with Doubly Stochastic Attention

1 code implementation • 22 Oct 2021 • Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior.

Image Classification

Paper
Code

Momentum Residual Neural Networks

1 code implementation • 15 Feb 2021 • Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

We show on CIFAR and ImageNet that Momentum ResNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained Momentum ResNets are promising for fine-tuning models.

Ranked #127 on Image Classification on CIFAR-10

Image Classification

208

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.