TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT-1kA	All-in-one-B	text-to-video R@1	37.9	# 38
Video Retrieval	MSR-VTT-1kA	All-in-one-B	text-to-video R@5	68.1	# 36
Video Retrieval	MSR-VTT-1kA	All-in-one-B	text-to-video R@10	77.1	# 39
Visual Question Answering (VQA)	MSRVTT-QA	All-in-one-B	Accuracy	0.443	# 17
Visual Question Answering (VQA)	MSVD-QA	All-in-one-B	Accuracy	0.483	# 24
Video Question Answering	STAR Benchmark	All-in-one	Average Accuracy	47.5	# 8
TGIF-Frame	TGIF-QA	All-in-one-B	Accuracy	64.2	# 15
TGIF-Action	TGIF-QA	All-in-one-B	Accuracy	92.7	# 6
TGIF-Transition	TGIF-QA	All-on-one-B	Accuracy	94.3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/tgif-action-on-tgif-qa)](https://paperswithcode.com/sota/tgif-action-on-tgif-qa?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/tgif-transition-on-tgif-qa)](https://paperswithcode.com/sota/tgif-transition-on-tgif-qa?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/video-question-answering-on-situated)](https://paperswithcode.com/sota/video-question-answering-on-situated?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=all-in-one-exploring-unified-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/all-in-one-exploring-unified-video-language/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=all-in-one-exploring-unified-video-language)`

All in One: Exploring Unified Video-Language Pre-training

CVPR 2023 · Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou ·

Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained model have been released in https://github.com/showlab/all-in-one.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

showlab/all-in-one official

272

Tasks

Add Remove

Language Modelling

Multiple-choice

Question Answering

Representation Learning

Retrieval

TGIF-Action

TGIF-Frame

TGIF-Transition

Video Question Answering

Video Retrieval

Visual Commonsense Reasoning

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Genome

MSR-VTT

Conceptual Captions

HowTo100M

DiDeMo

WebVid

VCR

TGIF-QA MSRVTT-QA MSVD-QA

STAR Benchmark

Results from the Paper

Edit

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT-1kA	All-in-one-B	text-to-video R@1	37.9	# 38	Compare
			text-to-video R@5	68.1	# 36	Compare
			text-to-video R@10	77.1	# 39	Compare
Visual Question Answering (VQA)	MSRVTT-QA	All-in-one-B	Accuracy	0.443	# 17	Compare
Visual Question Answering (VQA)	MSVD-QA	All-in-one-B	Accuracy	0.483	# 24	Compare
Video Question Answering	STAR Benchmark	All-in-one	Average Accuracy	47.5	# 8	Compare
TGIF-Frame	TGIF-QA	All-in-one-B	Accuracy	64.2	# 15	Compare
TGIF-Action	TGIF-QA	All-in-one-B	Accuracy	92.7	# 6	Compare
TGIF-Transition	TGIF-QA	All-on-one-B	Accuracy	94.3	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

All in One: Exploring Unified Video-Language Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove