BP-Transformer: Modelling Long-Range Context via Binary Partitioning
The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields $O(k\cdot n\log (n/k))$ connections where $k$ is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch.
PDF AbstractCode
Datasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Sentiment Analysis | IMDb | BP-Transformer + GloVe | Accuracy | 92.12 | # 32 | |
Machine Translation | IWSLT2015 Chinese-English | BP-Transformer | BLEU | 19.84 | # 1 | |
Sentiment Analysis | SST-5 Fine-grained classification | BP-Transformer + GloVe | Accuracy | 52.71 | # 14 | |
Language Modelling | Text8 | BP-Transformer - 12 Layers | Bit per Character (BPC) | 1.11 | # 8 |