Search Results for author: Kazuki Irie

Found 32 papers, 26 papers with code

MoEUT: Mixture-of-Experts Universal Transformers

1 code implementation25 May 2024 Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D. Manning

The resulting UT model, for the first time, slightly outperforms standard Transformers on language modeling tasks such as BLiMP and PIQA, while using significantly less compute and memory.

Language Modelling

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

1 code implementation24 May 2024 Lorenzo Tiberi, Francesca Mignacco, Kazuki Irie, Haim Sompolinsky

Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers.

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

1 code implementation13 Dec 2023 Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length.

Language Modelling

Automating Continual Learning

1 code implementation1 Dec 2023 Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments.

Continual Learning Image Classification +2

Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions

1 code implementation24 Oct 2023 Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

Recent studies of the computational power of recurrent neural networks (RNNs) reveal a hierarchy of RNN architectures, given real-time and finite-precision assumptions.

Approximating Two-Layer Feedforward Networks for Efficient Transformers

2 code implementations16 Oct 2023 Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs.

Exploring the Promise and Limits of Real-Time Recurrent Learning

1 code implementation30 May 2023 Kazuki Irie, Anand Gopalakrishnan, Jürgen Schmidhuber

To scale to such challenging tasks, we focus on certain well-known neural architectures with element-wise recurrence, allowing for tractable RTRL without approximation.

Accelerating Neural Self-Improvement via Bootstrapping

1 code implementation2 May 2023 Kazuki Irie, Jürgen Schmidhuber

Few-shot learning with sequence-processing neural networks (NNs) has recently attracted a new wave of attention in the context of large language models.

Few-Shot Learning

Self-Organising Neural Discrete Representation Learning à la Kohonen

2 code implementations15 Feb 2023 Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

Unsupervised learning of discrete representations in neural networks (NNs) from continuous ones is essential for many modern applications.

Representation Learning

Learning to Control Rapidly Changing Synaptic Connections: An Alternative Type of Memory in Sequence Processing Artificial Neural Networks

no code implementations17 Nov 2022 Kazuki Irie, Jürgen Schmidhuber

Short-term memory in standard, general-purpose, sequence-processing recurrent neural networks (RNNs) is stored as activations of nodes or "neurons."

CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations

1 code implementation12 Oct 2022 Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

While the original CTL is used to test length generalization or productivity, CTL++ is designed to test systematicity of NNs, that is, their capability to generalize to unseen compositions of known functions.

Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules

1 code implementation7 Oct 2022 Kazuki Irie, Jürgen Schmidhuber

Work on fast weight programmers has demonstrated the effectiveness of key/value outer product-based learning rules for sequentially generating a weight matrix (WM) of a neural net (NN) by another NN or itself.

Denoising Image Generation +1

Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

2 code implementations3 Jun 2022 Kazuki Irie, Francesco Faccio, Jürgen Schmidhuber

Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed.

Time Series Time Series Analysis +1

Unsupervised Learning of Temporal Abstractions with Slot-based Transformers

1 code implementation25 Mar 2022 Anand Gopalakrishnan, Kazuki Irie, Jürgen Schmidhuber, Sjoerd van Steenkiste

The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems.

Decision Making

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

1 code implementation11 Feb 2022 Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience.

Continual Learning Image Classification +1

Improving Baselines in the Wild

1 code implementation31 Dec 2021 Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

We share our experience with the recently released WILDS benchmark, a collection of ten datasets dedicated to developing models and training strategies which are robust to domain shifts.

The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

1 code implementation14 Oct 2021 Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

Despite progress across a broad range of applications, Transformers have limited success in systematic generalization.

ListOps Systematic Generalization

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

5 code implementations NeurIPS 2021 Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

Transformers with linearised attention (''linear Transformers'') have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s.

Atari Games ListOps

Linear Transformers Are Secretly Fast Weight Programmers

9 code implementations22 Feb 2021 Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values).

Language Modelling Machine Translation +2

The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

no code implementations2 Apr 2020 Wei Zhou, Wilfried Michel, Kazuki Irie, Markus Kitza, Ralf Schlüter, Hermann Ney

We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus.

Data Augmentation

Language Modeling with Deep Transformers

no code implementations10 May 2019 Kazuki Irie, Albert Zeyer, Ralf Schlüter, Hermann Ney

We explore deep autoregressive Transformer models in language modeling for speech recognition.

Decoder Language Modelling +2

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation

2 code implementations8 May 2019 Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

2 code implementations21 Feb 2019 Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition

3 code implementations5 Feb 2019 Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen

We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model.

Decoder Language Modelling +2

Improved training of end-to-end attention models for speech recognition

14 code implementations8 May 2018 Albert Zeyer, Kazuki Irie, Ralf Schlüter, Hermann Ney

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition.

Ranked #45 on Speech Recognition on LibriSpeech test-clean (using extra training data)

Language Modelling Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.