no code implementations • 29 Nov 2024 • Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities.
1 code implementation • 20 Nov 2024 • Allen Hao Huang, Imanol Schlag
Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration.
1 code implementation • 29 May 2024 • Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann
They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models.
1 code implementation • 11 Apr 2024 • Anton Schäfer, Shauli Ravfogel, Thomas Hofmann, Tiago Pimentel, Imanol Schlag
In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages.
1 code implementation • 9 Apr 2024 • Anton Schäfer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel
In this paper, we study the impact of near duplicate subwords on LM training efficiency.
no code implementations • 6 Nov 2023 • Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann
This leads to the notion of a `compute-optimal' model, i. e. a model that allocates a given level of compute during training optimally to maximize performance.
1 code implementation • 20 Sep 2023 • Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
no code implementations • 26 May 2023 • Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, Jinjie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-Ping Fan, Bernard Ghanem, Jürgen Schmidhuber
What should be the social structure of an NLSOM?
no code implementations • 9 May 2023 • Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen-tau Yih, Jason Weston, Jürgen Schmidhuber, Xian Li
In recent years, large pre-trained language models (LLMs) have demonstrated the ability to follow instructions and perform novel tasks from a few examples.
1 code implementation • 29 Jun 2022 • Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra
Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding.
Ranked #21 on
Math Word Problem Solving
on MATH
3 code implementations • 11 Mar 2022 • DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur
It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens.
2 code implementations • 11 Feb 2022 • Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
The weight matrix (WM) of a neural network (NN) is its program.
1 code implementation • 31 Dec 2021 • Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
We share our experience with the recently released WILDS benchmark, a collection of ten datasets dedicated to developing models and training strategies which are robust to domain shifts.
no code implementations • NeurIPS Workshop AIPLANS 2021 • Imanol Schlag, Jürgen Schmidhuber
We augment classic algorithms with learned components to adapt them to domains currently dominated by deep learning models.
5 code implementations • NeurIPS 2021 • Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
Transformers with linearised attention (''linear Transformers'') have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s.
9 code implementations • 22 Feb 2021 • Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber
We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values).
1 code implementation • ICLR 2021 • Imanol Schlag, Tsendsuren Munkhdalai, Jürgen Schmidhuber
Humans can quickly associate stimuli to solve problems in novel contexts.
Ranked #1 on
Question Answering
on catbAbI LM-mode
3 code implementations • 15 Oct 2019 • Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, Jianfeng Gao
We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure.
Ranked #1 on
Question Answering
on Mathematics Dataset
1 code implementation • NeurIPS 2018 • Imanol Schlag, Jürgen Schmidhuber
We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data.
1 code implementation • 29 Nov 2018 • Imanol Schlag, Jürgen Schmidhuber
We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data.
no code implementations • ICLR 2018 • Imanol Schlag, Jürgen Schmidhuber
We improve previous end-to-end differentiable neural networks (NNs) with fast weight memories.