TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	Refiner-ViT-L	Top 1 Accuracy	86.03	# 174
Image Classification	ImageNet	Refiner-ViT-L	Number of params	81M	# 806
Image Classification	ImageNet	Refiner-ViT-L	Hardware Burden	None	# 1
Image Classification	ImageNet	Refiner-ViT-L	Operations per network pass	None	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/refiner-refining-self-attention-for-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=refiner-refining-self-attention-for-vision)`

Refiner: Refining Self-attention for Vision Transformers

7 Jun 2021 · Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, Jiashi Feng ·

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

PDF Abstract

Code

Add Remove Mark official

zhoudaquan/Refiner_ViT official

106

Tasks

Add Remove

Image Classification

Datasets

ImageNet

GLUE

SST

MultiNLI

QNLI

CoLA

Results from the Paper

Edit

Ranked #174 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	Refiner-ViT-L	Top 1 Accuracy	86.03	# 174	Compare
			Number of params	81M	# 806	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare

Methods

Add Remove

Linear Layer • Softmax

Edit Social Preview

Refiner: Refining Self-attention for Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove