TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Large)	Number of params	144M	# 23
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Large)	Validation perplexity	20.25	# 1
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Base)	Number of params	53M	# 20
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Base)	Validation perplexity	23.95	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/h-transformer-1d-fast-one-dimensional/language-modelling-on-one-billion-word)](https://paperswithcode.com/sota/language-modelling-on-one-billion-word?p=h-transformer-1d-fast-one-dimensional)`

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

ACL 2021 · Zhenhai Zhu, Radu Soricut ·

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

PDF Abstract ACL 2021 PDF ACL 2021 Abstract

Code

Add Remove Mark official

lucidrains/h-transformer-1d

↳ Quickstart in

Colab

153

jinmang2/hierarchical-transformer-1d

Tasks

Add Remove

Inductive Bias

Language Modelling

Datasets

LRA Billion Word Benchmark

Results from the Paper

Edit

Ranked #1 on Language Modelling on One Billion Word (Validation perplexity metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Large)	Number of params	144M	# 23	Compare
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Large)	Validation perplexity	20.25	# 1	Compare
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Base)	Number of params	53M	# 20	Compare
Language Modelling	One Billion Word	H-Transformer-1D Nr=16 (Base)	Validation perplexity	23.95	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove