ID	138363239
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.652
Training Memory (GB)	6.3
inference time (s/im)	0.145

ID	138363294
Max Iter	270000
lr sched	3x
Backbone Layers	101
train time (s/iter)	0.545
Training Memory (GB)	7.6
inference time (s/im)	0.092

ID	144219035
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.371
Training Memory (GB)	7.8
inference time (s/im)	0.114

ID	138205316
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.34
Training Memory (GB)	4.6
inference time (s/im)	0.056

ID	137259246
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.584
Training Memory (GB)	5.2
inference time (s/im)	0.11

ID	137849525
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.575
Training Memory (GB)	5.2
inference time (s/im)	0.111

ID	142202221
Max Iter	18000
Warmup Steps	100
Backbone Layers	50
train time (s/iter)	0.537
Training Memory (GB)	4.8
inference time (s/im)	0.081

ID	137260150
Max Iter	90000
lr sched	1x
Backbone Layers	50
train time (s/iter)	0.471
Training Memory (GB)	6.5
inference time (s/im)	0.076

ID	137849551
Max Iter	270000
lr sched	3x
Backbone Layers	50
train time (s/iter)	0.47
Training Memory (GB)	6.5
inference time (s/im)	0.076

ID	137260431
Max Iter	90000
lr sched	1x
Backbone Layers	50
train time (s/iter)	0.261
Training Memory (GB)	3.4
inference time (s/im)	0.043

ID	144219072
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.292
Training Memory (GB)	7.1
inference time (s/im)	0.107

ID	137849600
Max Iter	270000
lr sched	3x
Backbone Layers	50
train time (s/iter)	0.261
Training Memory (GB)	3.4
inference time (s/im)	0.043

ID	142423278
LR	0.01
Max Iter	24000
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.24
Training Memory (GB)	4.4
inference time (s/im)	0.078

ID	144219108
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.712
Training Memory (GB)	10.2
inference time (s/im)	0.151

ID	139653917
Max Iter	270000
lr sched	3x
Backbone Layers	101
train time (s/iter)	0.69
Training Memory (GB)	7.2
inference time (s/im)	0.103

Mask R-CNN

facebookresearch / detectron2

Last updated on Feb 19, 2021

Parameters 55 Million

FLOPs 937 Billion

File Size 210.10 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 2.04 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	138363239
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.652
Training Memory (GB)	6.3
inference time (s/im)	0.145
SHOW MORE
SHOW LESS

Parameters 191 Million

inference time (s/im) 0.092

File Size 730.60 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 1.7 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	138363294
Max Iter	270000
lr sched	3x
Backbone Layers	101
train time (s/iter)	0.545
Training Memory (GB)	7.6
inference time (s/im)	0.092
SHOW MORE
SHOW LESS

Parameters 70 Million

FLOPs 527 Billion

File Size 265.90 MB

Training Data

Training Resources 8 NVIDIA V100 GPUs

Training Time 9 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	144219035
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.371
Training Memory (GB)	7.8
inference time (s/im)	0.114
SHOW MORE
SHOW LESS

Parameters 63 Million

FLOPs 290 Billion

File Size 242.29 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 1.06 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	138205316
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.34
Training Memory (GB)	4.6
inference time (s/im)	0.056
SHOW MORE
SHOW LESS

Parameters 36 Million

FLOPs 890 Billion

File Size 137.42 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 15 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137259246
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.584
Training Memory (GB)	5.2
inference time (s/im)	0.11
SHOW MORE
SHOW LESS

Parameters 36 Million

FLOPs 890 Billion

File Size 137.42 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 1.8 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137849525
Max Iter	270000
lr sched	3x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.575
Training Memory (GB)	5.2
inference time (s/im)	0.111
SHOW MORE
SHOW LESS

Parameters 33 Million

inference time (s/im) 0.081

File Size 127.00 MB

Training Data PASCAL VOC 2007

Training Resources 8 NVIDIA V100 GPUs

Training Time 3 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	142202221
Max Iter	18000
Warmup Steps	100
Backbone Layers	50
train time (s/iter)	0.537
Training Memory (GB)	4.8
inference time (s/im)	0.081
SHOW MORE
SHOW LESS

Parameters 172 Million

inference time (s/im) 0.076

File Size 657.92 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 12 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137260150
Max Iter	90000
lr sched	1x
Backbone Layers	50
train time (s/iter)	0.471
Training Memory (GB)	6.5
inference time (s/im)	0.076
SHOW MORE
SHOW LESS

Parameters 172 Million

inference time (s/im) 0.076

File Size 657.92 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 1.47 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137849551
Max Iter	270000
lr sched	3x
Backbone Layers	50
train time (s/iter)	0.47
Training Memory (GB)	6.5
inference time (s/im)	0.076
SHOW MORE
SHOW LESS

Parameters 44 Million

inference time (s/im) 0.043

File Size 169.60 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 7 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137260431
Max Iter	90000
lr sched	1x
Backbone Layers	50
train time (s/iter)	0.261
Training Memory (GB)	3.4
inference time (s/im)	0.043
SHOW MORE
SHOW LESS

Parameters 50 Million

FLOPs 460 Billion

File Size 193.21 MB

Training Data

Training Resources 8 NVIDIA V100 GPUs

Training Time 7 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	144219072
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.292
Training Memory (GB)	7.1
inference time (s/im)	0.107
SHOW MORE
SHOW LESS

Parameters 44 Million

inference time (s/im) 0.043

File Size 169.60 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 20 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	137849600
Max Iter	270000
lr sched	3x
Backbone Layers	50
train time (s/iter)	0.261
Training Memory (GB)	3.4
inference time (s/im)	0.043
SHOW MORE
SHOW LESS

Parameters 44 Million

FLOPs 464 Billion

File Size 168.13 MB

Training Data Cityscapes

Training Resources 8 NVIDIA V100 GPUs

Training Time 2 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID	142423278
LR	0.01
Max Iter	24000
FLOPs Input No	100
Backbone Layers	50
train time (s/iter)	0.24
Training Memory (GB)	4.4
inference time (s/im)	0.078
SHOW MORE
SHOW LESS

Parameters 114 Million

FLOPs 686 Billion

File Size 435.04 MB

Training Data

Training Resources 8 NVIDIA V100 GPUs

Training Time 18 hours

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNeXt
ID	144219108
Max Iter	90000
lr sched	1x
FLOPs Input No	100
Backbone Layers	101
train time (s/iter)	0.712
Training Memory (GB)	10.2
inference time (s/im)	0.151
SHOW MORE
SHOW LESS

Parameters 107 Million

inference time (s/im) 0.103

File Size 411.43 MB

Training Data MS COCO

Training Resources 8 NVIDIA V100 GPUs

Training Time 2.16 days

Architecture	Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNeXt
ID	139653917
Max Iter	270000
lr sched	3x
Backbone Layers	101
train time (s/iter)	0.69
Training Memory (GB)	7.2
inference time (s/im)	0.103
SHOW MORE
SHOW LESS

README.md

Summary

Mask R-CNN extends Faster R-CNN to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster R-CNN, but constructing the mask branch properly is critical for good results.

Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.

How do I load this model?

There are several Mask R-CNN models available in Detectron2, with different backbones and learning schedules.

To load from the Detectron2 model zoo:

from detectron2 import model_zoo
model = model_zoo.get("COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml", trained=True)

Replace the configuration path with the variant you want to use. You can find the paths in the model summaries at the top of this page.

How do I train this model?

You can follow the Getting Started guide on Colab to see how to train a model.

You can also read the official Detectron2 documentation.

Citation

@misc{wu2019detectron2,
  author =       {Yuxin Wu and Alexander Kirillov and Francisco Massa and
                  Wan-Yen Lo and Ross Girshick},
  title =        {Detectron2},
  howpublished = {\url{https://github.com/facebookresearch/detectron2}},
  year =         {2019}
}

Results

Object Detection on COCO minival

MODEL	BOX AP
Mask R-CNN (X101-FPN, 3x)	44.3
Mask R-CNN (R101-FPN, 3x)	42.9
Mask R-CNN (R101-C4, 3x)	42.6
Mask R-CNN (R101-DC5, 3x)	41.9
Mask R-CNN (R50-FPN, 3x)	41.0
Mask R-CNN (R50-DC5, 3x)	40.0
Mask R-CNN (R50-C4, 3x)	39.8
Mask R-CNN (R50-FPN, 1x)	38.6
Mask R-CNN (R50-DC5, 1x)	38.3
Mask R-CNN (R50-C4, 1x)	36.8

Instance Segmentation on COCO minival

MODEL	MASK AP
Mask R-CNN (X101-FPN, 3x)	39.5
Mask R-CNN (R101-FPN, 3x)	38.6
Mask R-CNN (R101-DC5, 3x)	37.3
Mask R-CNN (R50-FPN, 3x)	37.2
Mask R-CNN (R101-C4, 3x)	36.7
Mask R-CNN (R50-DC5, 3x)	35.9
Mask R-CNN (R50-FPN, 1x)	35.2
Mask R-CNN (R50-C4, 3x)	34.4
Mask R-CNN (R50-DC5, 1x)	34.2
Mask R-CNN (R50-C4, 1x)	32.2

Object Detection on PASCAL VOC 2007

MODEL	AP50	BOX AP
Mask R-CNN (R50-C4, VOC)	80.3	51.9

Instance Segmentation on Cityscapes test

MODEL	MASK AP
Mask R-CNN (R50-FPN, Cityscapes)	36.5