TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO minival	GLIP (Swin-L, multi-scale)	box AP	60.8	# 17
Object Detection	COCO-O	GLIP-T (Swin-T)	Average mAP	29.1	# 21
Object Detection	COCO-O	GLIP-T (Swin-T)	Effective Robustness	8.11	# 12
Object Detection	COCO-O	GLIP-L (Swin-L)	Average mAP	48.0	# 3
Object Detection	COCO-O	GLIP-L (Swin-L)	Effective Robustness	24.89	# 2
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	box mAP	61.5	# 21
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	AP50	79.5	# 4
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	AP75	67.7	# 4
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	APS	45.3	# 4
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	APM	64.9	# 4
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	APL	75.0	# 4
Described Object Detection	Description Detection Dataset	GLIP-T	Intra-scenario FULL mAP	19.1	# 4
Described Object Detection	Description Detection Dataset	GLIP-T	Intra-scenario PRES mAP	18.3	# 5
Described Object Detection	Description Detection Dataset	GLIP-T	Intra-scenario ABS mAP	21.5	# 3
Phrase Grounding	Flickr30k Entities Test	GLIP	R@1	87.1	# 3
Phrase Grounding	Flickr30k Entities Test	GLIP	R@10	98.1	# 1
Phrase Grounding	Flickr30k Entities Test	GLIP	R@5	96.9	# 1
Zero-Shot Object Detection	LVIS v1.0 minival	GLIP-L	AP	37.3	# 3
Zero-Shot Object Detection	LVIS v1.0 val	GLIP-L	AP	26.9	# 3
Few-Shot Object Detection	ODinW-13	GLIP-T	Average Score	50.7	# 2
Few-Shot Object Detection	ODinW-35	GLIP-T	Average Score	38.9	# 2
2D Object Detection	RF100	GLIP	Average mAP	0.112	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/2d-object-detection-on-rf100)](https://paperswithcode.com/sota/2d-object-detection-on-rf100?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/few-shot-object-detection-on-odinw-13)](https://paperswithcode.com/sota/few-shot-object-detection-on-odinw-13?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/few-shot-object-detection-on-odinw-35)](https://paperswithcode.com/sota/few-shot-object-detection-on-odinw-35?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/object-detection-on-coco-o)](https://paperswithcode.com/sota/object-detection-on-coco-o?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/phrase-grounding-on-flickr30k-entities-test)](https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/zero-shot-object-detection-on-lvis-v1-0)](https://paperswithcode.com/sota/zero-shot-object-detection-on-lvis-v1-0?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/zero-shot-object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/zero-shot-object-detection-on-lvis-v1-0-val?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/described-object-detection-on-description)](https://paperswithcode.com/sota/described-object-detection-on-description?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounded-language-image-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounded-language-image-pre-training/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grounded-language-image-pre-training)`

Grounded Language-Image Pre-training

CVPR 2022 · Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao ·

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

microsoft/GLIP official

↳ Quickstart in

Colab

Spaces

1,983

brown-palm/ObjectPrompt

Tasks

Add Remove

2D Object Detection

Described Object Detection

Few-Shot Object Detection

Object Detection

Zero-Shot Object Detection

Datasets

MS COCO

Visual Genome

LVIS

Objects365

Flickr30K Entities

COCO-O

Description Detection Dataset

RF100

Results from the Paper

Add Remove

Ranked #1 on 2D Object Detection on RF100

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO minival	GLIP (Swin-L, multi-scale)	box AP	60.8	# 17	Compare
Object Detection	COCO-O	GLIP-T (Swin-T)	Average mAP	29.1	# 21	Compare
Object Detection	COCO-O	GLIP-T (Swin-T)	Effective Robustness	8.11	# 12	Compare
Object Detection	COCO-O	GLIP-L (Swin-L)	Average mAP	48.0	# 3	Compare
Object Detection	COCO-O	GLIP-L (Swin-L)	Effective Robustness	24.89	# 2	Compare
Object Detection	COCO test-dev	GLIP (Swin-L, multi-scale)	box mAP	61.5	# 21	Compare
			AP50	79.5	# 4	Compare
			AP75	67.7	# 4	Compare
			APS	45.3	# 4	Compare
			APM	64.9	# 4	Compare
			APL	75.0	# 4	Compare
Described Object Detection	Description Detection Dataset	GLIP-T	Intra-scenario FULL mAP	19.1	# 4	Compare
			Intra-scenario PRES mAP	18.3	# 5	Compare
			Intra-scenario ABS mAP	21.5	# 3	Compare
Phrase Grounding	Flickr30k Entities Test	GLIP	R@1	87.1	# 3	Compare
			R@10	98.1	# 1	Compare
			R@5	96.9	# 1	Compare
Zero-Shot Object Detection	LVIS v1.0 minival	GLIP-L	AP	37.3	# 3	Compare
Zero-Shot Object Detection	LVIS v1.0 val	GLIP-L	AP	26.9	# 3	Compare
Few-Shot Object Detection	ODinW-13	GLIP-T	Average Score	50.7	# 2	Compare
Few-Shot Object Detection	ODinW-35	GLIP-T	Average Score	38.9	# 2	Compare
2D Object Detection	RF100	GLIP	Average mAP	0.112	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Grounded Language-Image Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove