TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	DiDeMo	Y. Ge et. al.	text-to-video R@1	25.6	# 18
Zero-Shot Video Retrieval	DiDeMo	Y. Ge et. al.	text-to-video R@5	50.6	# 17
Zero-Shot Video Retrieval	DiDeMo	Y. Ge et. al.	text-to-video R@10	61.1	# 18
Zero-Shot Video Retrieval	DiDeMo	Y. Ge et. al.	text-to-video Median Rank	5.0	# 4
Zero-Shot Video Retrieval	LSMDC	Y. Ge et. al.	text-to-video R@1	12.2	# 12
Zero-Shot Video Retrieval	LSMDC	Y. Ge et. al.	text-to-video R@5	25.9	# 12
Zero-Shot Video Retrieval	LSMDC	Y. Ge et. al.	text-to-video R@10	32.2	# 12
Zero-Shot Video Retrieval	LSMDC	Y. Ge et. al.	text-to-video Median Rank	42.0	# 3
Zero-Shot Video Retrieval	MSR-VTT	Y. Ge et. al.	text-to-video R@1	26.0	# 22
Zero-Shot Video Retrieval	MSR-VTT	Y. Ge et. al.	text-to-video R@5	46.4	# 24
Zero-Shot Video Retrieval	MSR-VTT	Y. Ge et. al.	text-to-video R@10	56.4	# 23
Zero-Shot Video Retrieval	MSR-VTT	Y. Ge et. al.	text-to-video Median Rank	7.0	# 5
Video Retrieval	MSR-VTT-1kA	BridgeFormer	text-to-video R@1	37.6	# 40
Video Retrieval	MSR-VTT-1kA	BridgeFormer	text-to-video R@5	64.8	# 39
Video Retrieval	MSR-VTT-1kA	BridgeFormer	text-to-video R@10	75.1	# 42
Video Retrieval	MSR-VTT-1kA	BridgeFormer	text-to-video Median Rank	3	# 24
Video Retrieval	MSR-VTT-1kA	BridgeFormer (Zero-shot)	text-to-video R@1	26	# 51
Video Retrieval	MSR-VTT-1kA	BridgeFormer (Zero-shot)	text-to-video R@5	46.4	# 52
Video Retrieval	MSR-VTT-1kA	BridgeFormer (Zero-shot)	text-to-video R@10	56.4	# 55
Video Retrieval	MSR-VTT-1kA	BridgeFormer (Zero-shot)	text-to-video Median Rank	7	# 35
Zero-Shot Video Retrieval	MSVD	Y. Ge et. al.	text-to-video R@1	43.6	# 8
Zero-Shot Video Retrieval	MSVD	Y. Ge et. al.	text-to-video R@5	74.9	# 8
Zero-Shot Video Retrieval	MSVD	Y. Ge et. al.	text-to-video R@10	84.9	# 7
Zero-Shot Video Retrieval	MSVD	Y. Ge et. al.	text-to-video Median Rank	2.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridgeformer-bridging-video-text-retrieval/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=bridgeformer-bridging-video-text-retrieval)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridgeformer-bridging-video-text-retrieval/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=bridgeformer-bridging-video-text-retrieval)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridgeformer-bridging-video-text-retrieval/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=bridgeformer-bridging-video-text-retrieval)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridgeformer-bridging-video-text-retrieval/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=bridgeformer-bridging-video-text-retrieval)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bridgeformer-bridging-video-text-retrieval/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=bridgeformer-bridging-video-text-retrieval)`

Bridging Video-text Retrieval with Multiple Choice Questions

CVPR 2022 · Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo ·

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

tencentarc/mcq official

130

towhee-io/towhee

2,986

Tasks

Add Remove

Action Recognition

Multiple-choice

Retrieval

Text Retrieval

Text to Video Retrieval

Text-to-video search

Video Retrieval

Video-Text Retrieval

Video to Text Retrieval

Zero-Shot Action Recognition

Zero-Shot Video Retrieval

Datasets

UCF101

HMDB51

MSR-VTT

MSVD

HowTo100M

DiDeMo

WebVid

LSMDC

Results from the Paper

Edit

Ranked #8 on Zero-Shot Video Retrieval on MSVD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	DiDeMo	Y. Ge et. al.	text-to-video R@1	25.6	# 18	Compare
			text-to-video R@5	50.6	# 17	Compare
			text-to-video R@10	61.1	# 18	Compare
			text-to-video Median Rank	5.0	# 4	Compare
Zero-Shot Video Retrieval	LSMDC	Y. Ge et. al.	text-to-video R@1	12.2	# 12	Compare
			text-to-video R@5	25.9	# 12	Compare
			text-to-video R@10	32.2	# 12	Compare
			text-to-video Median Rank	42.0	# 3	Compare
Zero-Shot Video Retrieval	MSR-VTT	Y. Ge et. al.	text-to-video R@1	26.0	# 22	Compare
			text-to-video R@5	46.4	# 24	Compare
			text-to-video R@10	56.4	# 23	Compare
			text-to-video Median Rank	7.0	# 5	Compare
Video Retrieval	MSR-VTT-1kA	BridgeFormer	text-to-video R@1	37.6	# 40	Compare
			text-to-video R@5	64.8	# 39	Compare
			text-to-video R@10	75.1	# 42	Compare
			text-to-video Median Rank	3	# 24	Compare
Video Retrieval	MSR-VTT-1kA	BridgeFormer (Zero-shot)	text-to-video R@1	26	# 51	Compare
			text-to-video R@5	46.4	# 52	Compare
			text-to-video R@10	56.4	# 55	Compare
			text-to-video Median Rank	7	# 35	Compare
Zero-Shot Video Retrieval	MSVD	Y. Ge et. al.	text-to-video R@1	43.6	# 8	Compare
			text-to-video R@5	74.9	# 8	Compare
			text-to-video R@10	84.9	# 7	Compare
			text-to-video Median Rank	2.0	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Bridging Video-text Retrieval with Multiple Choice Questions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove