TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	DiDeMo	MuLTI	text-to-video R@1	56.5	# 12
Video Retrieval	DiDeMo	MuLTI	text-to-video R@5	80.2	# 11
Video Retrieval	DiDeMo	MuLTI	text-to-video R@10	87.0	# 12
Video Retrieval	MSR-VTT-1kA	MuLTI	text-to-video R@1	54.7	# 6
Video Retrieval	MSR-VTT-1kA	MuLTI	text-to-video R@5	77.7	# 10
Video Retrieval	MSR-VTT-1kA	MuLTI	text-to-video R@10	86.0	# 10
Visual Question Answering (VQA)	MSRVTT-QA	MuLTI	Accuracy	0.478	# 4
Visual Question Answering (VQA)	MSVD-QA	MuLTI	Accuracy	0.547	# 15
TGIF-Frame	TGIF-QA	MuLTI	Accuracy	75.6	# 5
TGIF-Action	TGIF-QA	MuLTI	Accuracy	97.9	# 1
TGIF-Transition	TGIF-QA	MuLTI	Accuracy	99.1	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/tgif-action-on-tgif-qa)](https://paperswithcode.com/sota/tgif-action-on-tgif-qa?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/tgif-transition-on-tgif-qa)](https://paperswithcode.com/sota/tgif-transition-on-tgif-qa?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=multi-efficient-video-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-efficient-video-and-language/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=multi-efficient-video-and-language)`

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

10 Mar 2023 · Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi ·

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Multi-Label Classification

Multiple-choice

Question Answering

Retrieval

TGIF-Action

TGIF-Frame

TGIF-Transition

Video Question Answering

Video Retrieval

Visual Question Answering (VQA)

Datasets

MSR-VTT

DiDeMo

TGIF-QA MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #1 on TGIF-Transition on TGIF-QA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	DiDeMo	MuLTI	text-to-video R@1	56.5	# 12	Compare
			text-to-video R@5	80.2	# 11	Compare
			text-to-video R@10	87.0	# 12	Compare
Video Retrieval	MSR-VTT-1kA	MuLTI	text-to-video R@1	54.7	# 6	Compare
			text-to-video R@5	77.7	# 10	Compare
			text-to-video R@10	86.0	# 10	Compare
Visual Question Answering (VQA)	MSRVTT-QA	MuLTI	Accuracy	0.478	# 4	Compare
Visual Question Answering (VQA)	MSVD-QA	MuLTI	Accuracy	0.547	# 15	Compare
TGIF-Frame	TGIF-QA	MuLTI	Accuracy	75.6	# 5	Compare
TGIF-Action	TGIF-QA	MuLTI	Accuracy	97.9	# 1	Compare
TGIF-Transition	TGIF-QA	MuLTI	Accuracy	99.1	# 1	Compare

Methods

Add Remove

Adapter • ALIGN

Edit Social Preview

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove