TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	HVT-S-1	Top 1 Accuracy	78.00%	# 788
Image Classification	ImageNet	HVT-S-1	Number of params	21.74M	# 552
Image Classification	ImageNet	HVT-S-1	GFLOPs	2.4	# 159
Image Classification	ImageNet	HVT-Ti-1	Top 1 Accuracy	69.64%	# 952
Image Classification	ImageNet	HVT-Ti-1	Number of params	5.74M	# 432
Image Classification	ImageNet	HVT-Ti-1	GFLOPs	0.64	# 76
Efficient ViTs	ImageNet-1K (with DeiT-S)	HVT-S-1	Top 1 Accuracy	78.3	# 38
Efficient ViTs	ImageNet-1K (with DeiT-S)	HVT-S-1	GFLOPs	2.7	# 17
Efficient ViTs	ImageNet-1K (with DeiT-T)	HVT-Ti-1	Top 1 Accuracy	69.6	# 22
Efficient ViTs	ImageNet-1K (with DeiT-T)	HVT-Ti-1	GFLOPs	0.6	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scalable-visual-transformers-with/efficient-vits-on-imagenet-1k-with-deit-t)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-t?p=scalable-visual-transformers-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scalable-visual-transformers-with/efficient-vits-on-imagenet-1k-with-deit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-s?p=scalable-visual-transformers-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scalable-visual-transformers-with/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=scalable-visual-transformers-with)`

Scalable Vision Transformers with Hierarchical Pooling

ICCV 2021 · Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai ·

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets. Code is available at https://github.com/MonashAI/HVT

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

MonashAI/HVT official

BR-IDL/PaddleViT

1,183

Tasks

Add Remove

Efficient ViTs

Image Classification

Datasets

ImageNet

CIFAR-100

Results from the Paper

Edit

Ranked #22 on Efficient ViTs on ImageNet-1K (with DeiT-T)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	HVT-S-1	Top 1 Accuracy	78.00%	# 788	Compare
			Number of params	21.74M	# 552	Compare
			GFLOPs	2.4	# 159	Compare
Image Classification	ImageNet	HVT-Ti-1	Top 1 Accuracy	69.64%	# 952	Compare
			Number of params	5.74M	# 432	Compare
			GFLOPs	0.64	# 76	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	HVT-S-1	Top 1 Accuracy	78.3	# 38	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	HVT-S-1	GFLOPs	2.7	# 17	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	HVT-Ti-1	Top 1 Accuracy	69.6	# 22	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	HVT-Ti-1	GFLOPs	0.6	# 1	Compare

Methods

Add Remove

Linear Layer • Multi-Head Attention • Scaled Dot-Product Attention • Softmax

Edit Social Preview

Scalable Vision Transformers with Hierarchical Pooling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove