This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Object Detection COCO minival GLIP (Swin-L, multi-scale) box AP 60.8 # 11
Object Detection COCO test-dev GLIP (Swin-L, multi-scale) box AP 61.5 # 15
AP50 79.5 # 5
AP75 67.7 # 5
APS 45.3 # 5
APM 64.9 # 5
APL 75.0 # 5
Phrase Grounding Flickr30k Entities Test GLIP R@1 87.1 # 3
R@10 98.1 # 1
R@5 96.9 # 1
2D object detection RF100 GLIP Average mAP 0.112 # 1

Methods


No methods listed for this paper. Add relevant methods here