TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	LITv2-B\|384	Top 1 Accuracy	84.7%	# 281
Image Classification	ImageNet	LITv2-B\|384	Number of params	87M	# 822
Image Classification	ImageNet	LITv2-B\|384	GFLOPs	39.7	# 412
Image Classification	ImageNet	LITv2-B	Top 1 Accuracy	83.6%	# 378
Image Classification	ImageNet	LITv2-B	GFLOPs	13.2	# 324
Image Classification	ImageNet	LITv2-M	Top 1 Accuracy	83.3%	# 403
Image Classification	ImageNet	LITv2-M	Number of params	49M	# 720
Image Classification	ImageNet	LITv2-M	GFLOPs	7.5	# 255
Image Classification	ImageNet	LITv2-S	Top 1 Accuracy	82%	# 530
Image Classification	ImageNet	LITv2-S	Number of params	28M	# 629
Image Classification	ImageNet	LITv2-S	GFLOPs	3.7	# 184

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fast-vision-transformers-with-hilo-attention/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=fast-vision-transformers-with-hilo-attention)`

Fast Vision Transformers with HiLo Attention

26 May 2022 · Zizheng Pan, Jianfei Cai, Bohan Zhuang ·

Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.

PDF Abstract

Code

Add Remove Mark official

ziplab/litv2 official

209

zip-group/litv2 official

209

Westlake-AI/openmixup

567

zhuang-group/lit

MonashAI/LIT

Tasks

Add Remove

Benchmarking

Efficient ViTs

Image Classification

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K ImageNet-1K

Results from the Paper

Edit

Ranked #281 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	LITv2-B\|384	Top 1 Accuracy	84.7%	# 281	Compare
			Number of params	87M	# 822	Compare
			GFLOPs	39.7	# 412	Compare
Image Classification	ImageNet	LITv2-B	Top 1 Accuracy	83.6%	# 378	Compare
Image Classification	ImageNet	LITv2-B	GFLOPs	13.2	# 324	Compare
Image Classification	ImageNet	LITv2-M	Top 1 Accuracy	83.3%	# 403	Compare
			Number of params	49M	# 720	Compare
			GFLOPs	7.5	# 255	Compare
Image Classification	ImageNet	LITv2-S	Top 1 Accuracy	82%	# 530	Compare
			Number of params	28M	# 629	Compare
			GFLOPs	3.7	# 184	Compare

Methods

Add Remove

SPEED

Edit Social Preview

Fast Vision Transformers with HiLo Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove