TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	DiDeMo	ALPRO	text-to-video R@1	23.8	# 19
Zero-Shot Video Retrieval	DiDeMo	ALPRO	text-to-video R@5	47.3	# 21
Zero-Shot Video Retrieval	DiDeMo	ALPRO	text-to-video R@10	57.9	# 23
Zero-Shot Video Retrieval	DiDeMo	ALPRO	text-to-video Median Rank	6	# 6
Video Retrieval	DiDeMo	ALPRO	text-to-video R@1	35.9	# 34
Video Retrieval	DiDeMo	ALPRO	text-to-video R@5	67.5	# 32
Video Retrieval	DiDeMo	ALPRO	text-to-video R@10	78.8	# 31
Video Retrieval	DiDeMo	ALPRO	text-to-video Median Rank	3	# 17
Zero-Shot Video Retrieval	MSR-VTT	ALPRO	text-to-video R@1	24.1	# 25
Zero-Shot Video Retrieval	MSR-VTT	ALPRO	text-to-video R@5	44.7	# 25
Zero-Shot Video Retrieval	MSR-VTT	ALPRO	text-to-video R@10	55.4	# 25
Zero-Shot Video Retrieval	MSR-VTT	ALPRO	text-to-video Median Rank	8	# 8
Visual Question Answering (VQA)	MSRVTT-QA	ALPRO	Accuracy	0.421	# 20
Visual Question Answering (VQA)	MSVD-QA	ALPRO	Accuracy	0.459	# 27

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/align-and-prompt-video-and-language-pre/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=align-and-prompt-video-and-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/align-and-prompt-video-and-language-pre/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=align-and-prompt-video-and-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/align-and-prompt-video-and-language-pre/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=align-and-prompt-video-and-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/align-and-prompt-video-and-language-pre/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=align-and-prompt-video-and-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/align-and-prompt-video-and-language-pre/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=align-and-prompt-video-and-language-pre)`

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

CVPR 2022 · Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi ·

Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models are available at https://github.com/salesforce/ALPRO.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

salesforce/alpro official

183

Tasks

Add Remove

Entity Alignment

Retrieval

Video Retrieval

Visual Question Answering (VQA)

Zero-Shot Video Retrieval

Datasets

MS COCO

MSR-VTT

MSVD

HowTo100M

DiDeMo

WebVid MSRVTT-QA MSVD-QA

Results from the Paper

Add Remove

Ranked #19 on Zero-Shot Video Retrieval on DiDeMo

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	DiDeMo	ALPRO	text-to-video R@1	23.8	# 19	Compare
			text-to-video R@5	47.3	# 21	Compare
			text-to-video R@10	57.9	# 23	Compare
			text-to-video Median Rank	6	# 6	Compare
Video Retrieval	DiDeMo	ALPRO	text-to-video R@1	35.9	# 34	Compare
			text-to-video R@5	67.5	# 32	Compare
			text-to-video R@10	78.8	# 31	Compare
			text-to-video Median Rank	3	# 17	Compare
Zero-Shot Video Retrieval	MSR-VTT	ALPRO	text-to-video R@1	24.1	# 25	Compare
			text-to-video R@5	44.7	# 25	Compare
			text-to-video R@10	55.4	# 25	Compare
			text-to-video Median Rank	8	# 8	Compare
Visual Question Answering (VQA)	MSRVTT-QA	ALPRO	Accuracy	0.421	# 20	Compare
Visual Question Answering (VQA)	MSVD-QA	ALPRO	Accuracy	0.459	# 27	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove