Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Cross-Modal Retrieval COCO 2014 PTP-BLIP Image-to-text R@1 69.7 # 5
Image-to-text R@5 90.0 # 4
Image-to-text R@10 94.7 # 4
Text-to-image R@1 49.5 # 8
Text-to-image R@5 75.9 # 6
Text-to-image R@10 84.2 # 6
Cross-Modal Retrieval COCO 2014 PTP-BLIP (14M) Image-to-text R@1 81.5 # 7
Image-to-text R@10 97.9 # 7
Image-to-text R@5 95.9 # 5
Text-to-image R@1 64.9 # 7
Text-to-image R@10 92.2 # 4
Text-to-image R@5 87.4 # 4
Image Captioning COCO Captions PTP-BLIP (14M) BLEU-4 40.1 # 19
METEOR 30.4 # 14
CIDER 135.0 # 22
SPICE 23.7 # 17
Zero-Shot Cross-Modal Retrieval Flickr30k PTP-BLIP (14M) Image-to-text R@1 87.1 # 13
Image-to-text R@5 98.4 # 14
Image-to-text R@10 99.3 # 12
Text-to-image R@1 73.1 # 13
Text-to-image R@5 91.0 # 14
Text-to-image R@10 94.8 # 13

Methods