Compressive Transformers for Long-Range Sequence Modelling

We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Language Modelling enwik8 Compressive Transformer (24 layers) Bit per Character (BPC) 0.97 # 8
Number of params 277M # 2
Language Modelling Hutter Prize Compressive Transformer Bit per Character (BPC) 0.97 # 2
Language Modelling WikiText-103 Compressive Transformer (18L, M=1024) Validation perplexity 16.0 # 5
Test perplexity 17.1 # 20

Methods