TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	DiDeMo	VIOLETv2	text-to-video R@1	47.9	# 27
Video Retrieval	DiDeMo	VIOLETv2	text-to-video R@5	76.5	# 23
Video Retrieval	DiDeMo	VIOLETv2	text-to-video R@10	84.1	# 24
Fill Mask	LSMDC	VIOLETv2	Accuracy	56.9	# 2
Video Retrieval	LSMDC	VIOLETv2	text-to-video R@1	24	# 21
Video Retrieval	LSMDC	VIOLETv2	text-to-video R@5	43.5	# 16
Video Retrieval	LSMDC	VIOLETv2	text-to-video R@10	54.1	# 14
Video Question Answering	LSMDC-MC	VIOLETv2	Accuracy	84.4	# 1
Video Captioning	MSR-VTT	VIOLETv2	CIDEr	58	# 17
Video Retrieval	MSR-VTT	VIOLETv2	text-to-video R@1	37.2	# 15
Video Retrieval	MSR-VTT	VIOLETv2	text-to-video R@5	64.8	# 13
Video Retrieval	MSR-VTT	VIOLETv2	text-to-video R@10	75.8	# 14
Video Question Answering	MSRVTT-MC	VIOLETv2	Accuracy	97.6	# 1
Video Question Answering	MSRVTT-QA	VIOLETv2	Accuracy	44.5	# 11
Video Captioning	MSVD	VIOLETv2	CIDEr	139.2	# 8
Visual Question Answering (VQA)	MSVD-QA	VIOLETv2	Accuracy	0.547	# 15
TGIF-Frame	TGIF-QA	VIOLETv2	Accuracy	72.8	# 8
TGIF-Action	TGIF-QA	VIOLETv2	Accuracy	94.8	# 5
TGIF-Transition	TGIF-QA	VIOLETv2	Accuracy	99	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-question-answering-on-lsmdc-mc)](https://paperswithcode.com/sota/video-question-answering-on-lsmdc-mc?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-question-answering-on-msrvtt-mc)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-mc?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/fill-mask-on-lsmdc)](https://paperswithcode.com/sota/fill-mask-on-lsmdc?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/tgif-transition-on-tgif-qa)](https://paperswithcode.com/sota/tgif-transition-on-tgif-qa?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/tgif-action-on-tgif-qa)](https://paperswithcode.com/sota/tgif-action-on-tgif-qa?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-captioning-on-msvd-1)](https://paperswithcode.com/sota/video-captioning-on-msvd-1?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-question-answering-on-msrvtt-qa)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-qa?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=an-empirical-study-of-end-to-end-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-end-to-end-video/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=an-empirical-study-of-end-to-end-video)`

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

CVPR 2023 · Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu ·

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

tsujuifu/pytorch_empirical-mvm official

Tasks

Add Remove

Fill Mask

Optical Flow Estimation

Question Answering

Retrieval

Text to Video Retrieval

TGIF-Action

TGIF-Frame

TGIF-Transition

Video Captioning

Video Question Answering

Video Retrieval

Visual Question Answering (VQA)

Datasets

ImageNet

Kinetics

MSR-VTT

MSVD

DiDeMo

WebVid

LSMDC

TGIF-QA MSRVTT-QA MSVD-QA MSRVTT-MC

Results from the Paper

Edit

Ranked #1 on Video Question Answering on LSMDC-MC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	DiDeMo	VIOLETv2	text-to-video R@1	47.9	# 27	Compare
			text-to-video R@5	76.5	# 23	Compare
			text-to-video R@10	84.1	# 24	Compare
Fill Mask	LSMDC	VIOLETv2	Accuracy	56.9	# 2	Compare
Video Retrieval	LSMDC	VIOLETv2	text-to-video R@1	24	# 21	Compare
			text-to-video R@5	43.5	# 16	Compare
			text-to-video R@10	54.1	# 14	Compare
Video Question Answering	LSMDC-MC	VIOLETv2	Accuracy	84.4	# 1	Compare
Video Captioning	MSR-VTT	VIOLETv2	CIDEr	58	# 17	Compare
Video Retrieval	MSR-VTT	VIOLETv2	text-to-video R@1	37.2	# 15	Compare
			text-to-video R@5	64.8	# 13	Compare
			text-to-video R@10	75.8	# 14	Compare
Video Question Answering	MSRVTT-MC	VIOLETv2	Accuracy	97.6	# 1	Compare
Video Question Answering	MSRVTT-QA	VIOLETv2	Accuracy	44.5	# 11	Compare
Video Captioning	MSVD	VIOLETv2	CIDEr	139.2	# 8	Compare
Visual Question Answering (VQA)	MSVD-QA	VIOLETv2	Accuracy	0.547	# 15	Compare
TGIF-Frame	TGIF-QA	VIOLETv2	Accuracy	72.8	# 8	Compare
TGIF-Action	TGIF-QA	VIOLETv2	Accuracy	94.8	# 5	Compare
TGIF-Transition	TGIF-QA	VIOLETv2	Accuracy	99	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BASE • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove