TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO test-dev	GLIPv2 (CoSwin-H, multi-scale)	box mAP	62.4	# 18
Phrase Grounding	Flickr30k Entities Test	GLIPv2	R@1	87.7	# 1
Object Detection	LVIS v1.0 minival	GLIPv2	box AP	59.8	# 4
Referring Expression Segmentation	PhraseCut	GLIPv2	Mean IoU	61.3	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glipv2-unifying-localization-and-vision/phrase-grounding-on-flickr30k-entities-test)](https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?p=glipv2-unifying-localization-and-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glipv2-unifying-localization-and-vision/referring-expression-segmentation-on)](https://paperswithcode.com/sota/referring-expression-segmentation-on?p=glipv2-unifying-localization-and-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glipv2-unifying-localization-and-vision/object-detection-on-lvis-v1-0-minival)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=glipv2-unifying-localization-and-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glipv2-unifying-localization-and-vision/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=glipv2-unifying-localization-and-vision)`

GLIPv2: Unifying Localization and Vision-Language Understanding

12 Jun 2022 · Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao ·

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

PDF Abstract

Code

Add Remove Mark official

microsoft/GLIP official

↳ Quickstart in

Colab

Spaces

1,983

Tasks

Add Remove

2D Object Detection

Contrastive Learning

Image Captioning

Instance Segmentation

Language Modelling

Masked Language Modeling

Object Detection

Open Vocabulary Object Detection

Phrase Grounding

Referring Expression Segmentation

Semantic Segmentation

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Genome

LVIS

Flickr30K Entities

PhraseCut

Results from the Paper

Add Remove

Ranked #1 on Phrase Grounding on Flickr30k Entities Test (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO test-dev	GLIPv2 (CoSwin-H, multi-scale)	box mAP	62.4	# 18	Compare
Phrase Grounding	Flickr30k Entities Test	GLIPv2	R@1	87.7	# 1	Compare
Object Detection	LVIS v1.0 minival	GLIPv2	box AP	59.8	# 4	Compare
Referring Expression Segmentation	PhraseCut	GLIPv2	Mean IoU	61.3	# 1	Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

GLIPv2: Unifying Localization and Vision-Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove