Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Image Captioning COCO Captions test Unified VLP BLEU-4 36.5 # 1
CIDEr 116.9 # 1
METEOR 28.4 # 1
SPICE 21.2 # 1
Image Captioning Flickr30k Captions test Unified VLP BLEU-4 30.1 # 1
CIDEr 67.4 # 1
METEOR 23 # 1
SPICE 17 # 1
Visual Question Answering VQA v2 test-std Unified VLP Accuracy 70.7 # 8

Methods used in the Paper