TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	DAFT-full	Top 1 Accuracy	79.8%	# 676
Image Classification	ImageNet	DAFT-full	Number of params	22.6M	# 570
Image Classification	ImageNet	DAFT-conv (384 heads, 200 epochs)	Top 1 Accuracy	80.1%	# 659
Image Classification	ImageNet	DAFT-conv (384 heads, 200 epochs)	Number of params	23M	# 574
Image Classification	ImageNet	DAFT-conv (16 heads)	Top 1 Accuracy	80.2%	# 655
Image Classification	ImageNet	DAFT-conv (16 heads)	Number of params	20.3M	# 541
Image Classification	ImageNet	DAFT-conv (384 heads, 300 epochs)	Top 1 Accuracy	80.8%	# 623
Image Classification	ImageNet	DAFT-conv (384 heads, 300 epochs)	Number of params	23M	# 574

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-dot-product-attention-free-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=a-dot-product-attention-free-transformer)`

A Dot Product Attention Free Transformer

29 Sep 2021 · Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Joshua M. Susskind ·

We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers \citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Classification

Language Modelling

Datasets

CIFAR-10

ImageNet

Results from the Paper

Add Remove

Ranked #620 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	DAFT-full	Top 1 Accuracy	79.8%	# 676	Compare
Image Classification	ImageNet	DAFT-full	Number of params	22.6M	# 570	Compare
Image Classification	ImageNet	DAFT-conv (384 heads, 200 epochs)	Top 1 Accuracy	80.1%	# 659	Compare
Image Classification	ImageNet	DAFT-conv (384 heads, 200 epochs)	Number of params	23M	# 574	Compare
Image Classification	ImageNet	DAFT-conv (16 heads)	Top 1 Accuracy	80.2%	# 655	Compare
Image Classification	ImageNet	DAFT-conv (16 heads)	Number of params	20.3M	# 541	Compare
Image Classification	ImageNet	DAFT-conv (384 heads, 300 epochs)	Top 1 Accuracy	80.8%	# 623	Compare
Image Classification	ImageNet	DAFT-conv (384 heads, 300 epochs)	Number of params	23M	# 574	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Free Transformer • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

A Dot Product Attention Free Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove