TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	COCO Captions test	Unified VLP	BLEU-4	36.5	# 2
Image Captioning	COCO Captions test	Unified VLP	CIDEr	116.9	# 1
Image Captioning	COCO Captions test	Unified VLP	METEOR	28.4	# 2
Image Captioning	COCO Captions test	Unified VLP	SPICE	21.2	# 1
Image Captioning	Flickr30k Captions test	Unified VLP	BLEU-4	30.1	# 1
Image Captioning	Flickr30k Captions test	Unified VLP	CIDEr	67.4	# 1
Image Captioning	Flickr30k Captions test	Unified VLP	METEOR	23	# 1
Image Captioning	Flickr30k Captions test	Unified VLP	SPICE	17	# 1
Visual Question Answering (VQA)	VQA v2 test-std	Unified VLP	overall	70.7	# 27

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-vision-language-pre-training-for/image-captioning-on-flickr30k-captions-test)](https://paperswithcode.com/sota/image-captioning-on-flickr30k-captions-test?p=unified-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-vision-language-pre-training-for/image-captioning-on-coco-captions-test)](https://paperswithcode.com/sota/image-captioning-on-coco-captions-test?p=unified-vision-language-pre-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-vision-language-pre-training-for/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=unified-vision-language-pre-training-for)`

Unified Vision-Language Pre-Training for Image Captioning and VQA

24 Sep 2019 · Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao ·

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

PDF Abstract

Code

Add Remove Mark official

LuoweiZhou/VLP official

403

rmokady/clip_prefix_caption

↳ Quickstart in

Colab

Spaces

1,201

WebQnA/WebQA_Baseline

Tasks

Add Remove

Image Captioning

Question Answering

Text Generation

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Visual Question Answering

Visual Genome

Flickr30k

Visual Question Answering v2.0

Conceptual Captions

COCO Captions

Results from the Paper

Edit

Ranked #1 on Image Captioning on Flickr30k Captions test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	COCO Captions test	Unified VLP	BLEU-4	36.5	# 2	Compare
			CIDEr	116.9	# 1	Compare
			METEOR	28.4	# 2	Compare
			SPICE	21.2	# 1	Compare
Image Captioning	Flickr30k Captions test	Unified VLP	BLEU-4	30.1	# 1	Compare
			CIDEr	67.4	# 1	Compare
			METEOR	23	# 1	Compare
			SPICE	17	# 1	Compare
Visual Question Answering (VQA)	VQA v2 test-std	Unified VLP	overall	70.7	# 27	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Unified VLP

Edit Social Preview

Unified Vision-Language Pre-Training for Image Captioning and VQA

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove