TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	Next-ViT-S	Top 1 Accuracy	82.5%	# 482
Image Classification	ImageNet	Next-ViT-S	Number of params	31.7M	# 652
Image Classification	ImageNet	Next-ViT-S	GFLOPs	5.8	# 239
Image Classification	ImageNet	Next-ViT-B	Top 1 Accuracy	83.2%	# 413
Image Classification	ImageNet	Next-ViT-B	Number of params	44.8M	# 705
Image Classification	ImageNet	Next-ViT-B	GFLOPs	8.3	# 274
Image Classification	ImageNet	Next-ViT-L @384	Top 1 Accuracy	84.7%	# 281
Image Classification	ImageNet	Next-ViT-L @384	Number of params	57.8M	# 762
Image Classification	ImageNet	Next-ViT-L @384	GFLOPs	32	# 396

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/next-vit-next-generation-vision-transformer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=next-vit-next-generation-vision-transformer)`

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

12 Jul 2022 · Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan ·

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT

PDF Abstract

Code

Add Remove Mark official

bytedance/next-vit official

519

rwightman/pytorch-image-models

29,774

wilile26811249/Next-ViT

IMvision12/NextViT-tf

Tasks

Add Remove

Image Classification

Datasets

ImageNet

MS COCO

ADE20K ImageNet-1K

Results from the Paper

Edit

Ranked #281 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	Next-ViT-S	Top 1 Accuracy	82.5%	# 482	Compare
			Number of params	31.7M	# 652	Compare
			GFLOPs	5.8	# 239	Compare
Image Classification	ImageNet	Next-ViT-B	Top 1 Accuracy	83.2%	# 413	Compare
			Number of params	44.8M	# 705	Compare
			GFLOPs	8.3	# 274	Compare
Image Classification	ImageNet	Next-ViT-L @384	Top 1 Accuracy	84.7%	# 281	Compare
			Number of params	57.8M	# 762	Compare
			GFLOPs	32	# 396	Compare

Methods

Add Remove

1x1 Convolution • Absolute Position Encodings • Adam • Average Pooling • Batch Normalization • Bottleneck Residual Block • BPE • Convolution • Dense Connections • Dropout • Global Average Pooling • Kaiming Initialization • Label Smoothing • Layer Normalization • Linear Layer • Max Pooling • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Block • Residual Connection • ResNet • Scaled Dot-Product Attention • Softmax • SPEED • Transformer • Vision Transformer

Edit Social Preview

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove