TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	CrossViT-18	Top 1 Accuracy	82.5%	# 482
Image Classification	ImageNet	CrossViT-18	Number of params	43.3M	# 697
Image Classification	ImageNet	CrossViT-18	GFLOPs	9	# 285
Image Classification	ImageNet	CrossViT-18+	Top 1 Accuracy	82.8%	# 453
Image Classification	ImageNet	CrossViT-18+	Number of params	44.3M	# 699
Image Classification	ImageNet	CrossViT-18+	GFLOPs	9.5	# 291
Image Classification	ImageNet	CrossViT-15+	Top 1 Accuracy	82.3%	# 501
Image Classification	ImageNet	CrossViT-15+	Number of params	28.2M	# 637
Image Classification	ImageNet	CrossViT-15+	GFLOPs	6.1	# 243
Image Classification	ImageNet	CrossViT-15	Top 1 Accuracy	81.5%	# 577
Image Classification	ImageNet	CrossViT-15	Number of params	27.4M	# 624
Image Classification	ImageNet	CrossViT-15	GFLOPs	5.8	# 239

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2103-14899/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=2103-14899)`

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

ICCV 2021 · Chun-Fu Chen, Quanfu Fan, Rameswar Panda ·

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2\% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at \url{https://github.com/IBM/CrossViT}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

IBM/CrossViT official

299

rwightman/pytorch-image-models

29,648

BR-IDL/PaddleViT

1,183

rishikksh20/CrossViT-pytorch

169

SforAiDl/vformer

161

See all 14 implementations

Tasks

Add Remove

General Classification

Image Classification

Datasets

CIFAR-10

ImageNet

CIFAR-100

JFT-300M

Results from the Paper

Edit

Ranked #450 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	CrossViT-18	Top 1 Accuracy	82.5%	# 482	Compare
			Number of params	43.3M	# 697	Compare
			GFLOPs	9	# 285	Compare
Image Classification	ImageNet	CrossViT-18+	Top 1 Accuracy	82.8%	# 453	Compare
			Number of params	44.3M	# 699	Compare
			GFLOPs	9.5	# 291	Compare
Image Classification	ImageNet	CrossViT-15+	Top 1 Accuracy	82.3%	# 501	Compare
			Number of params	28.2M	# 637	Compare
			GFLOPs	6.1	# 243	Compare
Image Classification	ImageNet	CrossViT-15	Top 1 Accuracy	81.5%	# 577	Compare
			Number of params	27.4M	# 624	Compare
			GFLOPs	5.8	# 239	Compare

Methods

Add Remove

Attention Dropout • Concatenated Skip Connection • Cross-Attention Module • CrossViT • DeiT • Dense Connections • Dropout • Feedforward Network • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove