Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Zero-Shot Video Retrieval DiDeMo ALPRO text-to-video R@1 23.8 # 14
text-to-video R@5 47.3 # 14
text-to-video R@10 57.9 # 15
text-to-video Median Rank 6 # 4
Video Retrieval DiDeMo ALPRO text-to-video R@1 35.9 # 28
text-to-video R@5 67.5 # 28
text-to-video R@10 78.8 # 27
text-to-video Median Rank 3 # 16
Zero-Shot Video Retrieval MSR-VTT ALPRO text-to-video R@1 24.1 # 20
text-to-video R@5 44.7 # 19
text-to-video R@10 55.4 # 19
text-to-video Median Rank 8 # 7
Visual Question Answering (VQA) MSRVTT-QA ALPRO Accuracy 0.421 # 20
Visual Question Answering (VQA) MSVD-QA ALPRO Accuracy 0.459 # 26


No methods listed for this paper. Add relevant methods here