TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	CLIP Text Encoder (RL w/ CIDEr-reward)	BLEU-4	38.2	# 26
Image Captioning	COCO Captions	CLIP Text Encoder (RL w/ CIDEr-reward)	METEOR	28.7	# 21
Image Captioning	COCO Captions	CLIP Text Encoder (RL w/ CIDEr-reward)	ROUGE-L	58.5	# 10
Image Captioning	COCO Captions	CLIP Text Encoder (RL w/ CIDEr-reward)	CIDER	124.9	# 28

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fine-grained-image-captioning-with-clip/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=fine-grained-image-captioning-with-clip)`

Fine-grained Image Captioning with CLIP Reward

Findings (NAACL) 2022 · Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung Bui, Mohit Bansal ·

Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer the CLIP reward to the CIDEr and MLE objectives according to various criteria. Code and Data: https://github.com/j-min/CLIP-Caption-Reward

PDF Abstract Findings (NAACL) 2022 PDF Findings (NAACL) 2022 Abstract

Code

Add Remove Mark official

j-min/clip-caption-reward official

↳ Quickstart in

Colab

Spaces

Replicate

224

Tasks

Add Remove

Caption Generation

Descriptive

Image Captioning

Image Retrieval

Retrieval

text annotation

text similarity

Datasets

MS COCO

COCO Captions

Results from the Paper

Add Remove

Ranked #26 on Image Captioning on COCO Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	CLIP Text Encoder (RL w/ CIDEr-reward)	BLEU-4	38.2	# 26	Compare
			METEOR	28.7	# 21	Compare
			ROUGE-L	58.5	# 10	Compare
			CIDER	124.9	# 28	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Fine-grained Image Captioning with CLIP Reward

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove