InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


 Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K InternImage-H (M3I Pre-training) Params (M) 1310 # 3
Semantic Segmentation ADE20K InternImage-L Validation mIoU 54.1 # 61
Params (M) 256 # 15
GFLOPs 2526 # 4
Semantic Segmentation ADE20K InternImage-XL Validation mIoU 55.3 # 42
Params (M) 368 # 13
GFLOPs 3142 # 5
Semantic Segmentation ADE20K InternImage-H Validation mIoU 62.9 # 2
Params (M) 1310 # 3
GFLOPs 4635 # 6
Semantic Segmentation ADE20K InternImage-T Validation mIoU 48.1 # 142
Params (M) 59 # 43
GFLOPs 944 # 1
Semantic Segmentation ADE20K InternImage-S Validation mIoU 50.9 # 100
Params (M) 80 # 37
GFLOPs 1017 # 2
Semantic Segmentation ADE20K InternImage-B Validation mIoU 51.3 # 91
Params (M) 128 # 22
GFLOPs 1185 # 3
2D Object Detection BDD100K val InternImage-H mAP 38.8 # 1
Semantic Segmentation Cityscapes test InternImage-H Mean IoU (class) 86.1% # 3
Semantic Segmentation Cityscapes val InternImage-XL mIoU 86.4 # 5
Semantic Segmentation Cityscapes val InternImage-H mIoU 87 # 3
Instance Segmentation COCO minival InternImage-XL mask AP 48.8 # 30
Params (M) 387 # 1
GFLOPs 1782 # 5
Object Detection COCO minival InternImage-XL box AP 64.2 # 8
Object Detection COCO minival InternImage-H box AP 65.0 # 2
Instance Segmentation COCO minival InternImage-L mask AP 48.5 # 33
Params (M) 277 # 2
GFLOPs 1399 # 4
box AP 56.1 # 1
Instance Segmentation COCO minival InternImage-S mask AP 44.5 # 50
Params (M) 69 # 4
GFLOPs 340 # 2
box AP 49.7 # 2
Instance Segmentation COCO minival InternImage-T mask AP 43.7 # 55
Params (M) 49 # 5
GFLOPs 270 # 1
box AP 49.1 # 3
Instance Segmentation COCO minival InternImage-H mask AP 55.4 # 1
AP50 80.1 # 1
AP75 61.5 # 1
APL 74.4 # 1
APM 58.4 # 1
APS 37.9 # 1
Instance Segmentation COCO minival InternImage-B Params (M) 115 # 3
GFLOPs 501 # 3
Object Detection COCO-O InternImage-L (Cascade Mask R-CNN) Average mAP 37.0 # 8
Effective Robustness 11.72 # 8
Object Detection COCO test-dev InternImage-H box mAP 65.4 # 2
Params (M) 2180 # 2
Object Detection COCO test-dev InternImage-XL box mAP 64.3 # 10
Params (M) 602 # 4
Object Detection COCO test-dev InternImage-H (M3I Pre-training) Params (M) 2180 # 2
Instance Segmentation COCO test-dev InternImage-H AP50 80.8 # 1
AP75 62.2 # 1
APS 41.0 # 1
APM 58.9 # 1
APL 70.3 # 4
Object Detection CrowdHuman (full body) InternImage-H AP 97.2 # 1
Image Classification ImageNet InternImage-S Top 1 Accuracy 84.2% # 313
Number of params 50M # 725
GFLOPs 8 # 267
Image Classification ImageNet InternImage-B Top 1 Accuracy 84.9% # 265
Number of params 97M # 858
GFLOPs 16 # 346
Image Classification ImageNet InternImage-L Top 1 Accuracy 87.7% # 82
Number of params 223M # 905
GFLOPs 108 # 454
Image Classification ImageNet InternImage-XL Top 1 Accuracy 88% # 69
Number of params 335M # 921
GFLOPs 163 # 463
Image Classification ImageNet InternImage-DCNv3-G (M3I Pre-training) Top 1 Accuracy 90.1% # 16
Number of params 3000M # 973
Image Classification ImageNet InternImage-H Top 1 Accuracy 89.6% # 24
Number of params 1080M # 957
GFLOPs 1478 # 490
Image Classification ImageNet InternImage-T Top 1 Accuracy 83.5% # 391
Number of params 30M # 646
GFLOPs 5 # 231
Image Classification iNaturalist 2018 InternImage-H Top-1 Accuracy 92.6% # 2
Object Detection LVIS v1.0 minival InternImage-H box AP 65.8 # 2
Object Detection LVIS v1.0 val InternImage-H box AP 63.2 # 2
Object Detection OpenImages-v6 InternImage-H box AP 74.1 # 2
Semantic Segmentation PASCAL Context InternImage-H mIoU 70.3 # 2
Object Detection PASCAL VOC 2012 InternImage-H MAP 97.2 # 1
Image Classification Places205 InternImage-H Top 1 Accuracy 71.7% # 1
Image Classification Places365 InternImage-H(CNN) Top 1 Accuracy 61.2% # 2

Methods