Longformer: The Long-Document Transformer

10 Apr 2020 Iz Beltagy Matthew E. Peters Arman Cohan

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling enwik8 Longformer (30 layers, h=512) Bit per Character (BPC) 0.99 # 7
Number of params 102M # 6
Language Modelling enwik8 Longformer (12 layers, h=512) Bit per Character (BPC) 1.00 # 8
Number of params 41M # 17

Methods used in the Paper