TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO-O	GRiT (ViT-H)	Average mAP	42.9	# 4
Object Detection	COCO-O	GRiT (ViT-H)	Effective Robustness	15.72	# 5
Object Detection	COCO test-dev	GRiT (ViT-H, single-scale testing)	box mAP	60.4	# 26
Dense Captioning	Visual Genome	GRiT (ViT-B)	mAP	15.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grit-a-generative-region-to-text-transformer/dense-captioning-on-visual-genome)](https://paperswithcode.com/sota/dense-captioning-on-visual-genome?p=grit-a-generative-region-to-text-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grit-a-generative-region-to-text-transformer/object-detection-on-coco-o)](https://paperswithcode.com/sota/object-detection-on-coco-o?p=grit-a-generative-region-to-text-transformer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grit-a-generative-region-to-text-transformer/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grit-a-generative-region-to-text-transformer)`

GRiT: A Generative Region-to-text Transformer for Object Understanding

1 Dec 2022 · Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang ·

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

PDF Abstract

Code

Add Remove Mark official

JialianW/GRiT official

↳ Quickstart in

Colab

271

Tasks

Add Remove

Dense Captioning

Descriptive

Object

object-detection

Object Detection

Datasets

ImageNet

MS COCO

Visual Genome

COCO-O

Results from the Paper

Edit

Ranked #2 on Dense Captioning on Visual Genome

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO-O	GRiT (ViT-H)	Average mAP	42.9	# 4	Compare
Object Detection	COCO-O	GRiT (ViT-H)	Effective Robustness	15.72	# 5	Compare
Object Detection	COCO test-dev	GRiT (ViT-H, single-scale testing)	box mAP	60.4	# 26	Compare
Dense Captioning	Visual Genome	GRiT (ViT-B)	mAP	15.5	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

GRiT: A Generative Region-to-text Transformer for Object Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove