Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

ACL 2019 Zihang DaiZhilin YangYiming YangJaime CarbonellQuoc V. LeRuslan Salakhutdinov

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK COMPARE
Language Modelling enwiki8 Transformer-XL - 12 layers Bit per Character (BPC) 1.06 # 9
Language Modelling enwiki8 Transformer-XL - 12 layers Number of params 41M # 1
Language Modelling enwiki8 Transformer-XL - 24 layers Bit per Character (BPC) 0.99 # 5
Language Modelling enwiki8 Transformer-XL - 24 layers Number of params 277M # 1
Language Modelling enwiki8 Transformer-XL - 18 layers Bit per Character (BPC) 1.03 # 8
Language Modelling enwiki8 Transformer-XL - 18 layers Number of params 88M # 1
Language Modelling Hutter Prize 24-layer Transformer-XL Bit per Character (BPC) 0.99 # 1
Language Modelling Hutter Prize 24-layer Transformer-XL Number of params 277M # 1
Language Modelling Hutter Prize 18-layer Transformer-XL Bit per Character (BPC) 1.03 # 2
Language Modelling Hutter Prize 18-layer Transformer-XL Number of params 88M # 1
Language Modelling Hutter Prize 12-layer Transformer-XL Bit per Character (BPC) 1.06 # 3
Language Modelling Hutter Prize 12-layer Transformer-XL Number of params 41M # 1
Language Modelling One Billion Word Transformer-XL Large PPL 21.8 # 1
Language Modelling One Billion Word Transformer-XL Large Number of params 0.8B # 1
Language Modelling One Billion Word Transformer-XL Base PPL 23.5 # 3
Language Modelling One Billion Word Transformer-XL Base Number of params 0.46B # 1
Language Modelling Penn Treebank (Word Level) Transformer-XL Validation perplexity 56.72 # 12
Language Modelling Penn Treebank (Word Level) Transformer-XL Test perplexity 54.55 # 15
Language Modelling Penn Treebank (Word Level) Transformer-XL Params 24M # 1
Language Modelling Text8 Transformer-XL - 24 layers Bit per Character (BPC) 1.08 # 4
Language Modelling Text8 Transformer-XL - 24 layers Number of params 277M # 1
Language Modelling WikiText-103 Transformer-XL Large Validation perplexity 18.2 # 6
Language Modelling WikiText-103 Transformer-XL Large Test perplexity 18.3 # 7
Language Modelling WikiText-103 Transformer-XL Large Number of params 257M # 1
Language Modelling WikiText-103 Transformer-XL Standard Validation perplexity 23.1 # 9
Language Modelling WikiText-103 Transformer-XL Standard Test perplexity 24.0 # 12
Language Modelling WikiText-103 Transformer-XL Standard Number of params 151M # 1