TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions	LEMON	BLEU-4	42.6	# 7
Image Captioning	COCO Captions	LEMON	METEOR	31.4	# 7
Image Captioning	COCO Captions	LEMON	CIDER	145.5	# 7
Image Captioning	COCO Captions	LEMON	SPICE	25.5	# 6
Image Captioning	nocaps-val-in-domain	LEMON_large	CIDEr	116.9	# 4
Image Captioning	nocaps-val-in-domain	LEMON_large	SPICE	15.8	# 2
Image Captioning	nocaps-val-in-domain	LEMON_large	Pre-train (#images)	200M	# 10
Image Captioning	nocaps-val-in-domain	LEMON_base	CIDEr	107.7	# 8
Image Captioning	nocaps-val-in-domain	LEMON_base	SPICE	14.7	# 8
Image Captioning	nocaps-val-in-domain	LEMON_base	Pre-train (#images)	200M	# 10
Image Captioning	nocaps-val-near-domain	LEMON_large	CIDEr	113.3	# 4
Image Captioning	nocaps-val-near-domain	LEMON_large	SPICE	15.1	# 4
Image Captioning	nocaps-val-near-domain	LEMON_large	Pre-train (#images)	200M	# 10
Image Captioning	nocaps-val-out-domain	LEMON_large	CIDEr	111.3	# 7
Image Captioning	nocaps-val-out-domain	LEMON_large	SPICE	14.0	# 7
Image Captioning	nocaps-val-out-domain	LEMON_large	Pretrain (#images)	200M	# 10
Image Captioning	nocaps-val-overall	LEMON_large	CIDEr	113.4	# 4
Image Captioning	nocaps-val-overall	LEMON_large	SPICE	15.0	# 4
Image Captioning	nocaps-val-overall	LEMON_large	Pretrain (#images)	200M	# 10
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	CIDEr	114.25	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	B1	85.62	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	B2	71.36	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	B3	53.62	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	B4	34.65	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	ROUGE-L	61.2	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	METEOR	31.27	# 3
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	SPICE	14.85	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-nocaps-xd-entire)](https://paperswithcode.com/sota/image-captioning-on-nocaps-xd-entire?p=scaling-up-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-nocaps-val-in-domain)](https://paperswithcode.com/sota/image-captioning-on-nocaps-val-in-domain?p=scaling-up-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-nocaps-val-near-domain)](https://paperswithcode.com/sota/image-captioning-on-nocaps-val-near-domain?p=scaling-up-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-nocaps-val-overall)](https://paperswithcode.com/sota/image-captioning-on-nocaps-val-overall?p=scaling-up-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=scaling-up-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-vision-language-pre-training-for/image-captioning-on-nocaps-val-out-domain)](https://paperswithcode.com/sota/image-captioning-on-nocaps-val-out-domain?p=scaling-up-vision-language-pre-training-for)`

Scaling Up Vision-Language Pre-training for Image Captioning

CVPR 2022 · Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang ·

In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Attribute

Image Captioning

Datasets

Visual Genome

Conceptual Captions

COCO Captions

NoCaps

Results from the Paper

Edit

Ranked #3 on Image Captioning on nocaps-XD entire (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions	LEMON	BLEU-4	42.6	# 7	Compare
			METEOR	31.4	# 7	Compare
			CIDER	145.5	# 7	Compare
			SPICE	25.5	# 6	Compare
Image Captioning	nocaps-val-in-domain	LEMON_large	CIDEr	116.9	# 4	Compare
			SPICE	15.8	# 2	Compare
			Pre-train (#images)	200M	# 10	Compare
Image Captioning	nocaps-val-in-domain	LEMON_base	CIDEr	107.7	# 8	Compare
			SPICE	14.7	# 8	Compare
			Pre-train (#images)	200M	# 10	Compare
Image Captioning	nocaps-val-near-domain	LEMON_large	CIDEr	113.3	# 4	Compare
			SPICE	15.1	# 4	Compare
			Pre-train (#images)	200M	# 10	Compare
Image Captioning	nocaps-val-out-domain	LEMON_large	CIDEr	111.3	# 7	Compare
			SPICE	14.0	# 7	Compare
			Pretrain (#images)	200M	# 10	Compare
Image Captioning	nocaps-val-overall	LEMON_large	CIDEr	113.4	# 4	Compare
			SPICE	15.0	# 4	Compare
			Pretrain (#images)	200M	# 10	Compare
Image Captioning	nocaps-XD entire	Microsoft Cognitive Services team	CIDEr	114.25	# 3	Compare
			B1	85.62	# 3	Compare
			B2	71.36	# 3	Compare
			B3	53.62	# 3	Compare
			B4	34.65	# 3	Compare
			ROUGE-L	61.2	# 3	Compare
			METEOR	31.27	# 3	Compare
			SPICE	14.85	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Scaling Up Vision-Language Pre-training for Image Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove