TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	CloFormer-S	Top 1 Accuracy	81.6%	# 569
Image Classification	ImageNet	CloFormer-S	Number of params	12.3M	# 501
Image Classification	ImageNet	CloFormer-S	GFLOPs	2	# 146
Image Classification	ImageNet	CloFormer-XXS	Top 1 Accuracy	77%	# 822
Image Classification	ImageNet	CloFormer-XXS	Number of params	4.2M	# 384
Image Classification	ImageNet	CloFormer-XXS	GFLOPs	0.6	# 65
Image Classification	ImageNet	CloFormer-XS	Top 1 Accuracy	79.8%	# 676
Image Classification	ImageNet	CloFormer-XS	Number of params	7.2M	# 454
Image Classification	ImageNet	CloFormer-XS	GFLOPs	1.1	# 109

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-local-perception-in-lightweight/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=rethinking-local-perception-in-lightweight)`

Rethinking Local Perception in Lightweight Vision Transformer

31 Mar 2023 · Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He ·

Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.

PDF Abstract

Code

Add Remove Mark official

qhfan/CloFormer official

Tasks

Add Remove

Image Classification

object-detection

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #569 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	CloFormer-S	Top 1 Accuracy	81.6%	# 569	Compare
			Number of params	12.3M	# 501	Compare
			GFLOPs	2	# 146	Compare
Image Classification	ImageNet	CloFormer-XXS	Top 1 Accuracy	77%	# 822	Compare
			Number of params	4.2M	# 384	Compare
			GFLOPs	0.6	# 65	Compare
Image Classification	ImageNet	CloFormer-XS	Top 1 Accuracy	79.8%	# 676	Compare
			Number of params	7.2M	# 454	Compare
			GFLOPs	1.1	# 109	Compare

Methods

Add Remove

Convolution • Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Rethinking Local Perception in Lightweight Vision Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove