TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-4	41.4	# 12
Image Captioning	COCO Captions	Xmodal-Ctx	METEOR	30.4	# 14
Image Captioning	COCO Captions	Xmodal-Ctx	ROUGE-L	60.4	# 3
Image Captioning	COCO Captions	Xmodal-Ctx	CIDER	139.9	# 19
Image Captioning	COCO Captions	Xmodal-Ctx	SPICE	24.0	# 16
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-1	83.4	# 3
Image Captioning	COCO Captions	Xmodal-Ctx + OSCAR	BLEU-4	41.3	# 13
Image Captioning	COCO Captions	Xmodal-Ctx + OSCAR	CIDER	142.2	# 14
Image Captioning	COCO Captions	Xmodal-Ctx + OSCAR	SPICE	24.9	# 9
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-4	39.7	# 21
Image Captioning	COCO Captions	Xmodal-Ctx	METEOR	30.0	# 16
Image Captioning	COCO Captions	Xmodal-Ctx	ROUGE-L	59.5	# 5
Image Captioning	COCO Captions	Xmodal-Ctx	CIDER	135.9	# 21
Image Captioning	COCO Captions	Xmodal-Ctx	SPICE	23.7	# 17
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-1	81.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beyond-a-pre-trained-object-detector-cross/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=beyond-a-pre-trained-object-detector-cross)`

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

CVPR 2022 · Chia-Wen Kuo, Zsolt Kira ·

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

GT-RIPL/Xmodal-Ctx official

Tasks

Add Remove

Image Captioning

Object

Datasets

ImageNet

MS COCO

Visual Genome

COCO Captions

JFT-300M

Results from the Paper

Edit

Ranked #12 on Image Captioning on COCO Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-4	41.4	# 12	Compare
			METEOR	30.4	# 14	Compare
			ROUGE-L	60.4	# 3	Compare
			CIDER	139.9	# 19	Compare
			SPICE	24.0	# 16	Compare
			BLEU-1	83.4	# 3	Compare
Image Captioning	COCO Captions	Xmodal-Ctx + OSCAR	BLEU-4	41.3	# 13	Compare
			CIDER	142.2	# 14	Compare
			SPICE	24.9	# 9	Compare
Image Captioning	COCO Captions	Xmodal-Ctx	BLEU-4	39.7	# 21	Compare
			METEOR	30.0	# 16	Compare
			ROUGE-L	59.5	# 5	Compare
			CIDER	135.9	# 21	Compare
			SPICE	23.7	# 17	Compare
			BLEU-1	81.5	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove