TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	FocalNet-L (Mask2Former)	Validation mIoU	58.5	# 13
Object Detection	COCO minival	FocalNet-H (DINO)	box AP	64.2	# 8
Panoptic Segmentation	COCO minival	FocalNet-L (Mask2Former (200 queries))	PQ	57.9	# 11
Panoptic Segmentation	COCO minival	FocalNet-L (Mask2Former (200 queries))	AP	48.4	# 9
Object Detection	COCO minival	FocalNet-T (LRF, Cascade Mask R-CNN)	box AP	51.5	# 70
Object Detection	COCO minival	FocalNet-T (LRF, Cascade Mask R-CNN)	AP50	70.3	# 20
Object Detection	COCO minival	FocalNet-T (LRF, Cascade Mask R-CNN)	AP75	56.0	# 12
Object Detection	COCO minival	FocalNet-T (SRF, Cascade Mask R-CNN)	AP50	70.1	# 21
Object Detection	COCO minival	FocalNet-T (SRF, Cascade Mask R-CNN)	AP75	55.8	# 14
Object Detection	COCO test-dev	FocalNet-H (DINO)	box mAP	64.4	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-modulation-networks/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=focal-modulation-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-modulation-networks/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=focal-modulation-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-modulation-networks/panoptic-segmentation-on-coco-minival)](https://paperswithcode.com/sota/panoptic-segmentation-on-coco-minival?p=focal-modulation-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-modulation-networks/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=focal-modulation-networks)`

Focal Modulation Networks

22 Mar 2022 · Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao ·

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

PDF Abstract

Code

Add Remove Mark official

microsoft/FocalNet official

↳ Quickstart in

Spaces

647

PaddlePaddle/PaddleDetection

12,034

keras-team/keras-io

2,634

PaddlePaddle/PaddleYOLO

503

shinya7y/UniverseNet

417

See all 6 implementations

Tasks

Add Remove

Image Classification

Object Detection

Panoptic Segmentation

Segmentation

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #8 on Object Detection on COCO minival (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	FocalNet-L (Mask2Former)	Validation mIoU	58.5	# 13	Compare
Object Detection	COCO minival	FocalNet-H (DINO)	box AP	64.2	# 8	Compare
Panoptic Segmentation	COCO minival	FocalNet-L (Mask2Former (200 queries))	PQ	57.9	# 11	Compare
Panoptic Segmentation	COCO minival	FocalNet-L (Mask2Former (200 queries))	AP	48.4	# 9	Compare
Object Detection	COCO minival	FocalNet-T (LRF, Cascade Mask R-CNN)	box AP	51.5	# 70	Compare
			AP50	70.3	# 20	Compare
			AP75	56.0	# 12	Compare
Object Detection	COCO minival	FocalNet-T (SRF, Cascade Mask R-CNN)	AP50	70.1	# 21	Compare
Object Detection	COCO minival	FocalNet-T (SRF, Cascade Mask R-CNN)	AP75	55.8	# 14	Compare
Object Detection	COCO test-dev	FocalNet-H (DINO)	box mAP	64.4	# 9	Compare

Methods

Add Remove

BASE • Convolution • Dense Connections • Layer Normalization • Linear Layer • Mask R-CNN • Multi-Head Attention • Residual Connection • RoIAlign • RPN • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Focal Modulation Networks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove