TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO 2017	MaxViT-B	AP	53.4	# 1
Object Detection	COCO 2017	MaxViT-B	AP50	72.9	# 1
Object Detection	COCO 2017	MaxViT-B	AP75	58.1	# 1
Object Detection	COCO 2017	MaxViT-B	APM	45.7	# 1
Object Detection	COCO 2017	MaxViT-B	APM50	70.3	# 1
Object Detection	COCO 2017	MaxViT-B	APM75	50	# 1
Object Detection	COCO 2017	MaxViT-S	AP	53.1	# 2
Object Detection	COCO 2017	MaxViT-S	AP50	72.5	# 2
Object Detection	COCO 2017	MaxViT-S	AP75	58.1	# 1
Object Detection	COCO 2017	MaxViT-S	APM	45.4	# 2
Object Detection	COCO 2017	MaxViT-S	APM50	69.8	# 2
Object Detection	COCO 2017	MaxViT-S	APM75	49.5	# 2
Object Detection	COCO 2017	MaxViT-T	AP	52.1	# 3
Object Detection	COCO 2017	MaxViT-T	AP50	71.9	# 3
Object Detection	COCO 2017	MaxViT-T	AP75	56.8	# 3
Object Detection	COCO 2017	MaxViT-T	APM	44.6	# 3
Object Detection	COCO 2017	MaxViT-T	APM50	69.1	# 3
Object Detection	COCO 2017	MaxViT-T	APM75	48.4	# 3
Image Classification	ImageNet	MaxViT-B (384res, JFT)	Top 1 Accuracy	88.69%	# 41
Image Classification	ImageNet	MaxViT-XL (512res, 21K)	Top 1 Accuracy	88.7%	# 40
Image Classification	ImageNet	MaxViT-B (512res, JFT)	Top 1 Accuracy	88.82%	# 37
Image Classification	ImageNet	MaxViT-L (512res, JFT)	Top 1 Accuracy	89.41%	# 29
Image Classification	ImageNet	MaxViT-XL (384res, JFT)	Top 1 Accuracy	89.41%	# 29
Image Classification	ImageNet	MaxViT-XL (512res, JFT)	Top 1 Accuracy	89.53%	# 27
Image Classification	ImageNet	MaxViT-B (384res, 21K)	Top 1 Accuracy	88.24%	# 64
Image Classification	ImageNet	MaxViT-L (384res, 21K)	Top 1 Accuracy	88.32%	# 61
Image Classification	ImageNet	MaxViT-B (512res, 21K)	Top 1 Accuracy	88.38%	# 57
Image Classification	ImageNet	MaxViT-L (512res, 21K)	Top 1 Accuracy	88.46%	# 54
Image Classification	ImageNet	MaxViT-XL (384res, 21K)	Top 1 Accuracy	88.51%	# 49
Image Classification	ImageNet	MaxViT-L (384res, JFT)	Top 1 Accuracy	89.12%	# 32
Image Classification	ImageNet	MaxViT-T(512res)	Top 1 Accuracy	85.72%	# 199
Image Classification	ImageNet	MaxViT-L (384res)	Top 1 Accuracy	86.4%	# 143
Image Classification	ImageNet	MaxViT-T (384res)	Top 1 Accuracy	85.24%	# 237
Image Classification	ImageNet	MaxViT-S (224res)	Top 1 Accuracy	84.45%	# 298
Image Classification	ImageNet	MaxViT-B (224res)	Top 1 Accuracy	84.95%	# 263
Image Classification	ImageNet	MaxViT-T (224res)	Top 1 Accuracy	83.62%	# 376
Image Classification	ImageNet	MaxViT-L (512res)	Top 1 Accuracy	86.7%	# 126
Image Classification	ImageNet	MaxViT-S (512res)	Top 1 Accuracy	86.19%	# 169
Image Classification	ImageNet	MaxViT-B (384res)	Top 1 Accuracy	86.34%	# 152

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/maxvit-multi-axis-vision-transformer/object-detection-on-coco-2017)](https://paperswithcode.com/sota/object-detection-on-coco-2017?p=maxvit-multi-axis-vision-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/maxvit-multi-axis-vision-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=maxvit-multi-axis-vision-transformer)`

MaxViT: Multi-Axis Vision Transformer

4 Apr 2022 · Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li ·

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

PDF Abstract

Code

Add Remove Mark official

google-research/maxvit official

↳ Quickstart in

Colab

418

huggingface/pytorch-image-models

29,725

lucidrains/vit-pytorch

17,935

lucidrains/imagen-pytorch

7,780

towhee-io/towhee

2,986

See all 14 implementations

Tasks

Add Remove

Image Classification

object-detection

Object Detection

Datasets

ImageNet

MS COCO

JFT-300M

Results from the Paper

Edit

Ranked #1 on Object Detection on COCO 2017

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO 2017	MaxViT-B	AP	53.4	# 1	Compare
			AP50	72.9	# 1	Compare
			AP75	58.1	# 1	Compare
			APM	45.7	# 1	Compare
			APM50	70.3	# 1	Compare
			APM75	50	# 1	Compare
Object Detection	COCO 2017	MaxViT-S	AP	53.1	# 2	Compare
			AP50	72.5	# 2	Compare
			AP75	58.1	# 1	Compare
			APM	45.4	# 2	Compare
			APM50	69.8	# 2	Compare
			APM75	49.5	# 2	Compare
Object Detection	COCO 2017	MaxViT-T	AP	52.1	# 3	Compare
			AP50	71.9	# 3	Compare
			AP75	56.8	# 3	Compare
			APM	44.6	# 3	Compare
			APM50	69.1	# 3	Compare
			APM75	48.4	# 3	Compare
Image Classification	ImageNet	MaxViT-B (384res, JFT)	Top 1 Accuracy	88.69%	# 41	Compare
Image Classification	ImageNet	MaxViT-XL (512res, 21K)	Top 1 Accuracy	88.7%	# 40	Compare
Image Classification	ImageNet	MaxViT-B (512res, JFT)	Top 1 Accuracy	88.82%	# 37	Compare
Image Classification	ImageNet	MaxViT-L (512res, JFT)	Top 1 Accuracy	89.41%	# 29	Compare
Image Classification	ImageNet	MaxViT-XL (384res, JFT)	Top 1 Accuracy	89.41%	# 29	Compare
Image Classification	ImageNet	MaxViT-XL (512res, JFT)	Top 1 Accuracy	89.53%	# 27	Compare
Image Classification	ImageNet	MaxViT-B (384res, 21K)	Top 1 Accuracy	88.24%	# 64	Compare
Image Classification	ImageNet	MaxViT-L (384res, 21K)	Top 1 Accuracy	88.32%	# 61	Compare
Image Classification	ImageNet	MaxViT-B (512res, 21K)	Top 1 Accuracy	88.38%	# 57	Compare
Image Classification	ImageNet	MaxViT-L (512res, 21K)	Top 1 Accuracy	88.46%	# 54	Compare
Image Classification	ImageNet	MaxViT-XL (384res, 21K)	Top 1 Accuracy	88.51%	# 49	Compare
Image Classification	ImageNet	MaxViT-L (384res, JFT)	Top 1 Accuracy	89.12%	# 32	Compare
Image Classification	ImageNet	MaxViT-T(512res)	Top 1 Accuracy	85.72%	# 199	Compare
Image Classification	ImageNet	MaxViT-L (384res)	Top 1 Accuracy	86.4%	# 143	Compare
Image Classification	ImageNet	MaxViT-T (384res)	Top 1 Accuracy	85.24%	# 237	Compare
Image Classification	ImageNet	MaxViT-S (224res)	Top 1 Accuracy	84.45%	# 298	Compare
Image Classification	ImageNet	MaxViT-B (224res)	Top 1 Accuracy	84.95%	# 263	Compare
Image Classification	ImageNet	MaxViT-T (224res)	Top 1 Accuracy	83.62%	# 376	Compare
Image Classification	ImageNet	MaxViT-L (512res)	Top 1 Accuracy	86.7%	# 126	Compare
Image Classification	ImageNet	MaxViT-S (512res)	Top 1 Accuracy	86.19%	# 169	Compare
Image Classification	ImageNet	MaxViT-B (384res)	Top 1 Accuracy	86.34%	# 152	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MaxViT: Multi-Axis Vision Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove