no code implementations • 11 Sep 2023 • Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento
Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood.
no code implementations • 11 Jan 2023 • Max Vladymyrov, Andrey Zhmoginov, Mark Sandler
We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes.
no code implementations • 5 Jan 2023 • Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Nolan Miller
In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet.
1 code implementation • 15 Dec 2022 • Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov
We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss.
no code implementations • CVPR 2023 • Andrey Zhmoginov, Mark Sandler, Nolan Miller, Gus Kristiansen, Max Vladymyrov
We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.
no code implementations • CVPR 2022 • Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Andrew Jackson
In this paper we propose augmenting Vision Transformer models with learnable memory tokens.
1 code implementation • ICLR 2022 • Utku Evci, Bart van Merriënboer, Thomas Unterthiner, Max Vladymyrov, Fabian Pedregosa
The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified.
1 code implementation • 11 Jan 2022 • Andrey Zhmoginov, Mark Sandler, Max Vladymyrov
In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples.
no code implementations • 29 Sep 2021 • Andrey Zhmoginov, Max Vladymyrov, Mark Sandler
In this work we propose a HyperTransformer, a transformer based model that generates all weights of a CNN model directly from the support samples.
1 code implementation • 10 Apr 2021 • Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas
We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule.
no code implementations • 6 Nov 2020 • Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley
Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains.
no code implementations • NeurIPS 2019 • Max Vladymyrov
However, due to a complicated nonconvex objective function, these methods can easily get stuck in local minima and their embedding quality can be poor.
no code implementations • NeurIPS 2015 • Miguel A. Carreira-Perpinan, Max Vladymyrov
This has two advantages: 1) The algorithm is universal in that a specific learning algorithm for any choice of embedding and mapping can be constructed by simply reusing existing algorithms for the embedding and for the mapping.