Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Text Summarization Arxiv HEP-TH citation graph BigBird-Pegasus ROUGE-1 46.63 # 14
ROUGE-2 19.02 # 13
ROUGE-L 41.77 # 11
Text Classification Arxiv HEP-TH citation graph BigBird Accuracy 92.31 # 1
Document Summarization BBC XSum BigBird-Pegasus ROUGE-1 47.12 # 1
ROUGE-2 24.05 # 1
ROUGE-L 38.8 # 1
Text Summarization BigPatent BigBird-Pegasus ROUGE-1 60.64 # 2
ROUGE-2 42.46 # 2
ROUGE-L 50.01 # 2
Document Summarization CNN / Daily Mail BigBird-Pegasus ROUGE-1 43.84 # 10
ROUGE-2 21.11 # 7
ROUGE-L 40.74 # 7
Linguistic Acceptability CoLA BigBird Accuracy 58.5% # 32
Chromatin-Profile Prediction DeepSea BigBird TF 96.1 # 1
HM 88.7 # 1
DHS 92.1 # 1
Question Answering HotpotQA BigBird-etc ANS-F1 0.755 # 14
SUP-F1 0.891 # 2
JOINT-F1 0.736 # 2
Text Classification Hyperpartisan News Detection BigBird Accuracy 92.2 # 1
Semantic Textual Similarity MRPC BigBird F1 91.5% # 6
Natural Language Inference MultiNLI BigBird Matched 87.5 # 20
Text Classification Patents BigBird Accuracy 69.3 # 1
Text Summarization Pubmed BigBird-Pegasus ROUGE-1 46.32 # 14
ROUGE-2 20.65 # 10
ROUGE-L 42.33 # 10
Natural Language Inference QNLI BigBird Accuracy 92.2% # 26
Question Answering Quora Question Pairs BigBird Accuracy 88.6% # 15
Natural Language Inference RTE BigBird Accuracy 75.0% # 44
Sentiment Analysis SST-2 Binary classification BigBird Accuracy 94.6 # 32
Semantic Textual Similarity STS Benchmark BigBird Spearman Correlation .878 # 16
Question Answering TriviaQA BigBird-etc F1 80.9 # 2
Question Answering WikiHop BigBird-etc Test 82.3 # 1
Text Classification Yelp-5 BigBird Accuracy 72.16% # 3

Methods