TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	QnA-ViT-Base	Top 1 Accuracy	83.7%	# 365
Image Classification	ImageNet	QnA-ViT-Base	Number of params	56M	# 748
Image Classification	ImageNet	QnA-ViT-Base	GFLOPs	9.7	# 294
Image Classification	ImageNet	QnA-ViT-Small	Top 1 Accuracy	83.2%	# 413
Image Classification	ImageNet	QnA-ViT-Small	Number of params	25M	# 587
Image Classification	ImageNet	QnA-ViT-Small	GFLOPs	4.4	# 208
Image Classification	ImageNet	QnA-ViT-Tiny	Top 1 Accuracy	81.7%	# 563
Image Classification	ImageNet	QnA-ViT-Tiny	Number of params	16M	# 519
Image Classification	ImageNet	QnA-ViT-Tiny	GFLOPs	2.5	# 161

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learned-queries-for-efficient-local-attention/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=learned-queries-for-efficient-local-attention)`

Learned Queries for Efficient Local Attention

CVPR 2022 · Moab Arar, Ariel Shamir, Amit H. Bermano ·

Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods. The code is publicly available at \url{https://github.com/moabarar/qna}.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

moabarar/qna official

112

Tasks

Add Remove

Image Classification

Object Detection

Datasets

ImageNet

MS COCO

CelebA ImageNet-1K

Results from the Paper

Add Remove

Ranked #362 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	QnA-ViT-Base	Top 1 Accuracy	83.7%	# 365	Compare
			Number of params	56M	# 748	Compare
			GFLOPs	9.7	# 294	Compare
Image Classification	ImageNet	QnA-ViT-Small	Top 1 Accuracy	83.2%	# 413	Compare
			Number of params	25M	# 587	Compare
			GFLOPs	4.4	# 208	Compare
Image Classification	ImageNet	QnA-ViT-Tiny	Top 1 Accuracy	81.7%	# 563	Compare
			Number of params	16M	# 519	Compare
			GFLOPs	2.5	# 161	Compare

Methods

Add Remove

Dense Connections • High-resolution input • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • SPEED • Vision Transformer

Edit Social Preview

Learned Queries for Efficient Local Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove