TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwik8	Cluster-Former (#C=512)	Bit per Character (BPC)	1.22	# 32
Question Answering	Natural Questions (long)	Cluster-Former (#C=512)	F1	76.5	# 2
Question Answering	Quasart-T	Cluster-Former (#C=512)	EM	54	# 1
Open-Domain Question Answering	SearchQA	Cluster-Former (#C=512)	EM	68.0	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cluster-former-clustering-based-sparse/question-answering-on-quasart-t)](https://paperswithcode.com/sota/question-answering-on-quasart-t?p=cluster-former-clustering-based-sparse)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cluster-former-clustering-based-sparse/open-domain-question-answering-on-searchqa)](https://paperswithcode.com/sota/open-domain-question-answering-on-searchqa?p=cluster-former-clustering-based-sparse)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cluster-former-clustering-based-sparse/question-answering-on-natural-questions-long)](https://paperswithcode.com/sota/question-answering-on-natural-questions-long?p=cluster-former-clustering-based-sparse)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cluster-former-clustering-based-sparse/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=cluster-former-clustering-based-sparse)`

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

13 Sep 2020 · Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng, Jingjing Liu ·

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer, which encode local sequence information and global context jointly and iteratively. This new design allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Clustering

Language Modelling

Open-Domain Question Answering

Question Answering

Datasets

Natural Questions

WikiText-2

WikiText-103

SearchQA

QUASAR-T

QUASAR

Results from the Paper

Edit

Ranked #1 on Open-Domain Question Answering on SearchQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwik8	Cluster-Former (#C=512)	Bit per Character (BPC)	1.22	# 32	Compare
Question Answering	Natural Questions (long)	Cluster-Former (#C=512)	F1	76.5	# 2	Compare
Question Answering	Quasart-T	Cluster-Former (#C=512)	EM	54	# 1	Compare
Open-Domain Question Answering	SearchQA	Cluster-Former (#C=512)	EM	68.0	# 1	Compare

Methods

Add Remove

Adam • Attention Dropout • Cosine Annealing • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Sparse Transformer • Weight Decay

Edit Social Preview

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove