Search Results for author: Yassir Akram

Found 7 papers, 3 papers with code

Weight decay induces low-rank attention layers

no code implementations31 Oct 2024 Seijin Kobayashi, Yassir Akram, Johannes von Oswald

The effect of regularizers such as weight decay when training deep neural networks is not well understood.

L2 Regularization Language Modelling

Learning Randomized Algorithms with Transformers

no code implementations20 Aug 2024 Johannes von Oswald, Seijin Kobayashi, Yassir Akram, Angelika Steger

Randomization is a powerful tool that endows algorithms with remarkable properties.

When can transformers compositionally generalize in-context?

no code implementations17 Jul 2024 Seijin Kobayashi, Simon Schug, Yassir Akram, Florian Redhardt, Johannes von Oswald, Razvan Pascanu, Guillaume Lajoie, João Sacramento

Under what circumstances can transformers compositionally generalize from a subset of tasks to all possible combinations of tasks that share similar components?

Attention as a Hypernetwork

1 code implementation9 Jun 2024 Simon Schug, Seijin Kobayashi, Yassir Akram, João Sacramento, Razvan Pascanu

To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork generated linear value network nonlinear strengthens compositionality.

Gated recurrent neural networks discover attention

no code implementations4 Sep 2023 Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento

In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers.

In-Context Learning

Random initialisations performing above chance and how to find them

1 code implementation15 Sep 2022 Frederik Benzing, Simon Schug, Robert Meier, Johannes von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, Angelika Steger

Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions.

Cannot find the paper you are looking for? You can Submit a new open access paper.