BP-Transformer: Modelling Long-Range Context via Binary Partitioning

11 Nov 2019  ·  Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang ·

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text... In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields $O(k\cdot n\log (n/k))$ connections where $k$ is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch. read more

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Language Modelling enwik8 BP-Transformer (12 layers) Bit per Character (BPC) 1.02 # 15
Number of params 38M # 29
Sentiment Analysis IMDb BP-Transformer + GloVe Accuracy 92.12 # 19
Machine Translation IWSLT2015 Chinese-English BP-Transformer BLEU 19.84 # 1
Sentiment Analysis SST-5 Fine-grained classification BP-Transformer + GloVe Accuracy 52.71 # 11
Language Modelling Text8 BP-Transformer - 12 Layers Bit per Character (BPC) 1.11 # 6