TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ConViT-B+	Top 1 Accuracy	82.5%	# 482
Image Classification	ImageNet	ConViT-B+	Number of params	152M	# 882
Image Classification	ImageNet	ConViT-B+	Hardware Burden	None	# 1
Image Classification	ImageNet	ConViT-B+	Operations per network pass	None	# 1
Image Classification	ImageNet	ConViT-B+	GFLOPs	30	# 390
Image Classification	ImageNet	ConViT-Ti+	Top 1 Accuracy	76.7%	# 832
Image Classification	ImageNet	ConViT-Ti+	Number of params	10M	# 475
Image Classification	ImageNet	ConViT-Ti+	GFLOPs	2	# 146
Image Classification	ImageNet	ConViT-Ti	Top 1 Accuracy	73.1%	# 917
Image Classification	ImageNet	ConViT-Ti	Number of params	6M	# 437
Image Classification	ImageNet	ConViT-Ti	GFLOPs	1	# 103
Image Classification	ImageNet	ConViT-B	Top 1 Accuracy	82.4%	# 491
Image Classification	ImageNet	ConViT-B	Number of params	86M	# 814
Image Classification	ImageNet	ConViT-B	GFLOPs	17	# 352
Image Classification	ImageNet	ConViT-S	Top 1 Accuracy	81.3%	# 595
Image Classification	ImageNet	ConViT-S	Number of params	27M	# 615
Image Classification	ImageNet	ConViT-S	GFLOPs	5.4	# 236
Image Classification	ImageNet	ConViT-S+	Top 1 Accuracy	82.2%	# 510
Image Classification	ImageNet	ConViT-S+	Number of params	48M	# 713
Image Classification	ImageNet	ConViT-S+	GFLOPs	10	# 299

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convit-improving-vision-transformers-with/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=convit-improving-vision-transformers-with)`

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

19 Mar 2021 · Stéphane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, Levent Sagun ·

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analysing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/facebookresearch/convit.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/convit official

454

rwightman/pytorch-image-models

29,902

facebookresearch/vissl

↳ Quickstart in

Colab

3,230

mindspore-ecosystem/mindcv

222

SforAiDl/vformer

161

See all 9 implementations

Tasks

Add Remove

Image Classification

Inductive Bias

Datasets

ImageNet

Results from the Paper

Edit

Ranked #482 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ConViT-B+	Top 1 Accuracy	82.5%	# 482	Compare
			Number of params	152M	# 882	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
			GFLOPs	30	# 390	Compare
Image Classification	ImageNet	ConViT-Ti+	Top 1 Accuracy	76.7%	# 832	Compare
			Number of params	10M	# 475	Compare
			GFLOPs	2	# 146	Compare
Image Classification	ImageNet	ConViT-Ti	Top 1 Accuracy	73.1%	# 917	Compare
			Number of params	6M	# 437	Compare
			GFLOPs	1	# 103	Compare
Image Classification	ImageNet	ConViT-B	Top 1 Accuracy	82.4%	# 491	Compare
			Number of params	86M	# 814	Compare
			GFLOPs	17	# 352	Compare
Image Classification	ImageNet	ConViT-S	Top 1 Accuracy	81.3%	# 595	Compare
			Number of params	27M	# 615	Compare
			GFLOPs	5.4	# 236	Compare
Image Classification	ImageNet	ConViT-S+	Top 1 Accuracy	82.2%	# 510	Compare
			Number of params	48M	# 713	Compare
			GFLOPs	10	# 299	Compare

Methods

Add Remove

Attention Dropout • ConViT • DeiT • Dense Connections • Dropout • Feedforward Network • GPSA • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax

Edit Social Preview

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove