Mask R-CNN

Last updated on Feb 19, 2021

Mask R-CNN (R101-C4, 3x)

Parameters 55 Million
FLOPs 937 Billion
File Size 210.10 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 2.04 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 138363239
Max Iter 270000
lr sched 3x
FLOPs Input No 100
Backbone Layers 101
train time (s/iter) 0.652
Training Memory (GB) 6.3
inference time (s/im) 0.145
SHOW MORE
SHOW LESS
Mask R-CNN (R101-DC5, 3x)

Parameters 191 Million
inference time (s/im) 0.092
File Size 730.60 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 1.7 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 138363294
Max Iter 270000
lr sched 3x
Backbone Layers 101
train time (s/iter) 0.545
Training Memory (GB) 7.6
inference time (s/im) 0.092
SHOW MORE
SHOW LESS
Mask R-CNN (R101-FPN, 1x, LVIS)

Parameters 70 Million
FLOPs 527 Billion
File Size 265.90 MB
Training Data
Training Resources 8 NVIDIA V100 GPUs
Training Time 9 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 144219035
Max Iter 90000
lr sched 1x
FLOPs Input No 100
Backbone Layers 101
train time (s/iter) 0.371
Training Memory (GB) 7.8
inference time (s/im) 0.114
SHOW MORE
SHOW LESS
Mask R-CNN (R101-FPN, 3x)

Parameters 63 Million
FLOPs 290 Billion
File Size 242.29 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 1.06 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 138205316
Max Iter 270000
lr sched 3x
FLOPs Input No 100
Backbone Layers 101
train time (s/iter) 0.34
Training Memory (GB) 4.6
inference time (s/im) 0.056
SHOW MORE
SHOW LESS
Mask R-CNN (R50-C4, 1x)

Parameters 36 Million
FLOPs 890 Billion
File Size 137.42 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 15 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137259246
Max Iter 90000
lr sched 1x
FLOPs Input No 100
Backbone Layers 50
train time (s/iter) 0.584
Training Memory (GB) 5.2
inference time (s/im) 0.11
SHOW MORE
SHOW LESS
Mask R-CNN (R50-C4, 3x)

Parameters 36 Million
FLOPs 890 Billion
File Size 137.42 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 1.8 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137849525
Max Iter 270000
lr sched 3x
FLOPs Input No 100
Backbone Layers 50
train time (s/iter) 0.575
Training Memory (GB) 5.2
inference time (s/im) 0.111
SHOW MORE
SHOW LESS
Mask R-CNN (R50-C4, VOC)

Parameters 33 Million
inference time (s/im) 0.081
File Size 127.00 MB
Training Data PASCAL VOC 2007
Training Resources 8 NVIDIA V100 GPUs
Training Time 3 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 142202221
Max Iter 18000
Warmup Steps 100
Backbone Layers 50
train time (s/iter) 0.537
Training Memory (GB) 4.8
inference time (s/im) 0.081
SHOW MORE
SHOW LESS
Mask R-CNN (R50-DC5, 1x)

Parameters 172 Million
inference time (s/im) 0.076
File Size 657.92 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 12 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137260150
Max Iter 90000
lr sched 1x
Backbone Layers 50
train time (s/iter) 0.471
Training Memory (GB) 6.5
inference time (s/im) 0.076
SHOW MORE
SHOW LESS
Mask R-CNN (R50-DC5, 3x)

Parameters 172 Million
inference time (s/im) 0.076
File Size 657.92 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 1.47 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137849551
Max Iter 270000
lr sched 3x
Backbone Layers 50
train time (s/iter) 0.47
Training Memory (GB) 6.5
inference time (s/im) 0.076
SHOW MORE
SHOW LESS
Mask R-CNN (R50-FPN, 1x)

Parameters 44 Million
inference time (s/im) 0.043
File Size 169.60 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 7 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137260431
Max Iter 90000
lr sched 1x
Backbone Layers 50
train time (s/iter) 0.261
Training Memory (GB) 3.4
inference time (s/im) 0.043
SHOW MORE
SHOW LESS
Mask R-CNN (R50-FPN, 1x, LVIS)

Parameters 50 Million
FLOPs 460 Billion
File Size 193.21 MB
Training Data
Training Resources 8 NVIDIA V100 GPUs
Training Time 7 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 144219072
Max Iter 90000
lr sched 1x
FLOPs Input No 100
Backbone Layers 50
train time (s/iter) 0.292
Training Memory (GB) 7.1
inference time (s/im) 0.107
SHOW MORE
SHOW LESS
Mask R-CNN (R50-FPN, 3x)

Parameters 44 Million
inference time (s/im) 0.043
File Size 169.60 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 20 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 137849600
Max Iter 270000
lr sched 3x
Backbone Layers 50
train time (s/iter) 0.261
Training Memory (GB) 3.4
inference time (s/im) 0.043
SHOW MORE
SHOW LESS
Mask R-CNN (R50-FPN, Cityscapes)

Parameters 44 Million
FLOPs 464 Billion
File Size 168.13 MB
Training Data Cityscapes
Training Resources 8 NVIDIA V100 GPUs
Training Time 2 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNet
ID 142423278
LR 0.01
Max Iter 24000
FLOPs Input No 100
Backbone Layers 50
train time (s/iter) 0.24
Training Memory (GB) 4.4
inference time (s/im) 0.078
SHOW MORE
SHOW LESS
Mask R-CNN (X101-FPN, 1x, LVIS)

Parameters 114 Million
FLOPs 686 Billion
File Size 435.04 MB
Training Data
Training Resources 8 NVIDIA V100 GPUs
Training Time 18 hours

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNeXt
ID 144219108
Max Iter 90000
lr sched 1x
FLOPs Input No 100
Backbone Layers 101
train time (s/iter) 0.712
Training Memory (GB) 10.2
inference time (s/im) 0.151
SHOW MORE
SHOW LESS
Mask R-CNN (X101-FPN, 3x)

Parameters 107 Million
inference time (s/im) 0.103
File Size 411.43 MB
Training Data MS COCO
Training Resources 8 NVIDIA V100 GPUs
Training Time 2.16 days

Architecture Convolution, RoIAlign, Softmax, RPN, Dense Connections, ResNeXt
ID 139653917
Max Iter 270000
lr sched 3x
Backbone Layers 101
train time (s/iter) 0.69
Training Memory (GB) 7.2
inference time (s/im) 0.103
SHOW MORE
SHOW LESS
README.md

Summary

Mask R-CNN extends Faster R-CNN to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster R-CNN, but constructing the mask branch properly is critical for good results.

Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.

How do I load this model?

There are several Mask R-CNN models available in Detectron2, with different backbones and learning schedules.

To load from the Detectron2 model zoo:

from detectron2 import model_zoo
model = model_zoo.get("COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml", trained=True)

Replace the configuration path with the variant you want to use. You can find the paths in the model summaries at the top of this page.

How do I train this model?

You can follow the Getting Started guide on Colab to see how to train a model.

You can also read the official Detectron2 documentation.

Citation

@misc{wu2019detectron2,
  author =       {Yuxin Wu and Alexander Kirillov and Francisco Massa and
                  Wan-Yen Lo and Ross Girshick},
  title =        {Detectron2},
  howpublished = {\url{https://github.com/facebookresearch/detectron2}},
  year =         {2019}
}

Results

Object Detection on COCO minival

Object Detection on COCO minival
MODEL BOX AP
Mask R-CNN (X101-FPN, 3x) 44.3
Mask R-CNN (R101-FPN, 3x) 42.9
Mask R-CNN (R101-C4, 3x) 42.6
Mask R-CNN (R101-DC5, 3x) 41.9
Mask R-CNN (R50-FPN, 3x) 41.0
Mask R-CNN (R50-DC5, 3x) 40.0
Mask R-CNN (R50-C4, 3x) 39.8
Mask R-CNN (R50-FPN, 1x) 38.6
Mask R-CNN (R50-DC5, 1x) 38.3
Mask R-CNN (R50-C4, 1x) 36.8
Instance Segmentation on COCO minival
MODEL MASK AP
Mask R-CNN (X101-FPN, 3x) 39.5
Mask R-CNN (R101-FPN, 3x) 38.6
Mask R-CNN (R101-DC5, 3x) 37.3
Mask R-CNN (R50-FPN, 3x) 37.2
Mask R-CNN (R101-C4, 3x) 36.7
Mask R-CNN (R50-DC5, 3x) 35.9
Mask R-CNN (R50-FPN, 1x) 35.2
Mask R-CNN (R50-C4, 3x) 34.4
Mask R-CNN (R50-DC5, 1x) 34.2
Mask R-CNN (R50-C4, 1x) 32.2
Object Detection on PASCAL VOC 2007
MODEL AP50 BOX AP
Mask R-CNN (R50-C4, VOC) 80.3 51.9
Instance Segmentation on Cityscapes test
MODEL MASK AP
Mask R-CNN (R50-FPN, Cityscapes) 36.5