$\infty$-former: Infinite Memory Transformer

1 Sep 2021  Β·  Pedro Henrique Martins, Zita Marinho, AndrΓ© F. T. Martins Β·

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length, trading off memory length with precision. In order to control where precision is more important, $\infty$-former maintains "sticky memories" being able to model arbitrarily long contexts while keeping the computation budget fixed. Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the $\infty$-former's ability to retain information from long sequences.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Dialogue Generation CMU-DoG ∞-former (Sticky memories) F1 9.01 # 1
ROUGE-1 15.37 # 1
Rouge-L 12.56 # 1
Meteor 7.55 # 1
Dialogue Generation PG-19 ∞-former (Sticky memories + initialized GPT-2 Small) Perplexity 32.48 # 1
Language Modelling WikiText-103 ∞-former (initialized GPT-2 Small) Test perplexity 16.64 # 17
Language Modelling WikiText-103 ∞-former (Sticky memories) Test perplexity 24.22 # 56
Language Modelling WikiText-103 \infty-former (Sticky memories) Test perplexity 24.22 # 56
Language Modelling WikiText-103 -former (SM) Test perplexity 16.61 # 14
Language Modelling WikiText-103 [?]-former (Sticky memories) Test perplexity 24.22 # 56
Language Modelling WikiText-103 [?]-former (SM) Test perplexity 16.61 # 14
Language Modelling WikiText-103 ∞-former (Sticky memories + initialized GPT-2 Small) Test perplexity 16.61 # 14

Methods