TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Instance Segmentation	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	mask AP	44.4	# 52
Object Detection	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	box AP	49.7	# 78
Object Detection	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	AP50	71.3	# 12
Object Detection	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	AP75	54.6	# 19
Instance Segmentation	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	mask AP	43.7	# 55
Object Detection	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	box AP	49.5	# 79
Object Detection	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	AP50	71	# 13
Object Detection	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	AP75	54.2	# 20
Instance Segmentation	COCO minival	BoTNet 50 (72 epochs)	mask AP	40.7	# 68
Object Detection	COCO minival	BoTNet 50 (72 epochs)	box AP	45.9	# 103
Image Classification	ImageNet	SENet-350	Top 1 Accuracy	83.8%	# 358
Image Classification	ImageNet	BoTNet T5	Top 1 Accuracy	83.5%	# 391
Image Classification	ImageNet	BoTNet T5	GFLOPs	19.3	# 364
Image Classification	ImageNet	BoTNet T4	Top 1 Accuracy	82.8%	# 453
Image Classification	ImageNet	BoTNet T4	Number of params	54.7M	# 741
Image Classification	ImageNet	BoTNet T4	GFLOPs	10.9	# 304
Image Classification	ImageNet	SENet-101	Top 1 Accuracy	81.4%	# 586
Image Classification	ImageNet	SENet-101	Number of params	49.2M	# 721
Image Classification	ImageNet	ResNet-101	Top 1 Accuracy	80%	# 664
Image Classification	ImageNet	ResNet-101	Number of params	44.4M	# 700
Image Classification	ImageNet	SENet-50	Top 1 Accuracy	79.4%	# 695
Image Classification	ImageNet	SENet-50	Number of params	28.02M	# 636
Image Classification	ImageNet	SENet-152	Top 1 Accuracy	82.2%	# 510
Image Classification	ImageNet	SENet-152	Number of params	66.6M	# 783
Image Classification	ImageNet	BoTNet T3	Top 1 Accuracy	81.7%	# 563
Image Classification	ImageNet	BoTNet T3	Number of params	33.5M	# 655
Image Classification	ImageNet	BoTNet T3	GFLOPs	7.3	# 253
Image Classification	ImageNet	BoTNet T7	Top 1 Accuracy	84.7%	# 281
Image Classification	ImageNet	BoTNet T7	Number of params	75.1M	# 798
Image Classification	ImageNet	ResNet-50	Top 1 Accuracy	78.8%	# 738
Image Classification	ImageNet	ResNet-50	Number of params	25.5M	# 596
Image Classification	ImageNet	BoTNet T7-320	Top 1 Accuracy	84.2%	# 313
Image Classification	ImageNet	BoTNet T6	Top 1 Accuracy	84%	# 336
Image Classification	ImageNet	BoTNet T6	Number of params	53.9M	# 736

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bottleneck-transformers-for-visual/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=bottleneck-transformers-for-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bottleneck-transformers-for-visual/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=bottleneck-transformers-for-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bottleneck-transformers-for-visual/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=bottleneck-transformers-for-visual)`

Bottleneck Transformers for Visual Recognition

CVPR 2021 · Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani ·

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in compute time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Code

Add Remove Mark official

rwightman/pytorch-image-models

29,774

BR-IDL/PaddleViT

1,185

The-AI-Summer/self_attention

1,141

lucidrains/bottleneck-transformer-p…

667

leondgarse/keras_cv_attention_models

556

See all 13 implementations

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

Segmentation

Datasets

ImageNet

MS COCO

Results from the Paper

Edit

Ranked #52 on Instance Segmentation on COCO minival

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Instance Segmentation	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	mask AP	44.4	# 52	Compare
Object Detection	COCO minival	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	box AP	49.7	# 78	Compare
			AP50	71.3	# 12	Compare
			AP75	54.6	# 19	Compare
Instance Segmentation	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	mask AP	43.7	# 55	Compare
Object Detection	COCO minival	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	box AP	49.5	# 79	Compare
			AP50	71	# 13	Compare
			AP75	54.2	# 20	Compare
Instance Segmentation	COCO minival	BoTNet 50 (72 epochs)	mask AP	40.7	# 68	Compare
Object Detection	COCO minival	BoTNet 50 (72 epochs)	box AP	45.9	# 103	Compare
Image Classification	ImageNet	SENet-350	Top 1 Accuracy	83.8%	# 358	Compare
Image Classification	ImageNet	BoTNet T5	Top 1 Accuracy	83.5%	# 391	Compare
Image Classification	ImageNet	BoTNet T5	GFLOPs	19.3	# 364	Compare
Image Classification	ImageNet	BoTNet T4	Top 1 Accuracy	82.8%	# 453	Compare
			Number of params	54.7M	# 741	Compare
			GFLOPs	10.9	# 304	Compare
Image Classification	ImageNet	SENet-101	Top 1 Accuracy	81.4%	# 586	Compare
Image Classification	ImageNet	SENet-101	Number of params	49.2M	# 721	Compare
Image Classification	ImageNet	ResNet-101	Top 1 Accuracy	80%	# 664	Compare
Image Classification	ImageNet	ResNet-101	Number of params	44.4M	# 700	Compare
Image Classification	ImageNet	SENet-50	Top 1 Accuracy	79.4%	# 695	Compare
Image Classification	ImageNet	SENet-50	Number of params	28.02M	# 636	Compare
Image Classification	ImageNet	SENet-152	Top 1 Accuracy	82.2%	# 510	Compare
Image Classification	ImageNet	SENet-152	Number of params	66.6M	# 783	Compare
Image Classification	ImageNet	BoTNet T3	Top 1 Accuracy	81.7%	# 563	Compare
			Number of params	33.5M	# 655	Compare
			GFLOPs	7.3	# 253	Compare
Image Classification	ImageNet	BoTNet T7	Top 1 Accuracy	84.7%	# 281	Compare
Image Classification	ImageNet	BoTNet T7	Number of params	75.1M	# 798	Compare
Image Classification	ImageNet	ResNet-50	Top 1 Accuracy	78.8%	# 738	Compare
Image Classification	ImageNet	ResNet-50	Number of params	25.5M	# 596	Compare
Image Classification	ImageNet	BoTNet T7-320	Top 1 Accuracy	84.2%	# 313	Compare
Image Classification	ImageNet	BoTNet T6	Top 1 Accuracy	84%	# 336	Compare
Image Classification	ImageNet	BoTNet T6	Number of params	53.9M	# 736	Compare

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Transformer • Bottleneck Transformer Block • Channel-wise Soft Attention • Convolution • Cosine Annealing • Dense Connections • Label Smoothing • Mask R-CNN • Max Pooling • Pointwise Convolution • RandAugment • Random Resized Crop • ReLU • Residual Connection • ResNeSt • RoIAlign • RPN • Scaled Dot-Product Attention • SGD with Momentum • Sigmoid Activation • SiLU • Softmax • Split Attention • Squeeze-and-Excitation Block • Weight Decay

Edit Social Preview

Bottleneck Transformers for Visual Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove