TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	DiDeMo	VLAB	text-to-video R@1	56.8	# 10
Video Retrieval	DiDeMo	VLAB	text-to-video R@5	81.6	# 9
Video Retrieval	DiDeMo	VLAB	text-to-video R@10	88.7	# 9
Video Retrieval	MSR-VTT	VLAB	text-to-video R@1	55.1	# 8
Video Retrieval	MSR-VTT	VLAB	text-to-video R@5	78.8	# 5
Video Retrieval	MSR-VTT	VLAB	text-to-video R@10	87.6	# 3
Video Captioning	MSR-VTT	VLAB	CIDEr	74.9	# 4
Video Captioning	MSR-VTT	VLAB	METEOR	33.4	# 3
Video Captioning	MSR-VTT	VLAB	ROUGE-L	68.3	# 2
Video Captioning	MSR-VTT	VLAB	BLEU-4	54.6	# 4
Visual Question Answering (VQA)	MSRVTT-QA	VLAB	Accuracy	0.496	# 1
Video Captioning	MSVD	VLAB	CIDEr	179.8	# 2
Video Captioning	MSVD	VLAB	BLEU-4	79.3	# 2
Video Captioning	MSVD	VLAB	METEOR	51.2	# 1
Video Captioning	MSVD	VLAB	ROUGE-L	87.9	# 1
Video Retrieval	MSVD	VLAB	text-to-video R@1	57.5	# 6
Video Retrieval	MSVD	VLAB	text-to-video R@5	83.6	# 4
Video Retrieval	MSVD	VLAB	text-to-video R@10	89.9	# 3
Visual Question Answering (VQA)	MSVD-QA	VLAB	Accuracy	0.61	# 1
TGIF-Frame	TGIF-QA	VLAB	Accuracy	79.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/video-captioning-on-msvd-1)](https://paperswithcode.com/sota/video-captioning-on-msvd-1?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=vlab-enhancing-video-language-pre-training-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlab-enhancing-video-language-pre-training-by/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=vlab-enhancing-video-language-pre-training-by)`

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

22 May 2023 · Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng ·

Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0, respectively. Codes and models will be released.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Question Answering

Retrieval

Text Retrieval

TGIF-Frame

Video Captioning

Video Question Answering

Video Retrieval

Video-Text Retrieval

Visual Question Answering (VQA)

Datasets

MSR-VTT

MSVD

DiDeMo

WebVid

CC12M

TGIF-QA MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #1 on Visual Question Answering (VQA) on MSVD-QA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	DiDeMo	VLAB	text-to-video R@1	56.8	# 10	Compare
			text-to-video R@5	81.6	# 9	Compare
			text-to-video R@10	88.7	# 9	Compare
Video Retrieval	MSR-VTT	VLAB	text-to-video R@1	55.1	# 8	Compare
			text-to-video R@5	78.8	# 5	Compare
			text-to-video R@10	87.6	# 3	Compare
Video Captioning	MSR-VTT	VLAB	CIDEr	74.9	# 4	Compare
			METEOR	33.4	# 3	Compare
			ROUGE-L	68.3	# 2	Compare
			BLEU-4	54.6	# 4	Compare
Visual Question Answering (VQA)	MSRVTT-QA	VLAB	Accuracy	0.496	# 1	Compare
Video Captioning	MSVD	VLAB	CIDEr	179.8	# 2	Compare
			BLEU-4	79.3	# 2	Compare
			METEOR	51.2	# 1	Compare
			ROUGE-L	87.9	# 1	Compare
Video Retrieval	MSVD	VLAB	text-to-video R@1	57.5	# 6	Compare
			text-to-video R@5	83.6	# 4	Compare
			text-to-video R@10	89.9	# 3	Compare
Visual Question Answering (VQA)	MSVD-QA	VLAB	Accuracy	0.61	# 1	Compare
TGIF-Frame	TGIF-QA	VLAB	Accuracy	79.0	# 3	Compare

Methods

Add Remove

Adapter • CLIP

Edit Social Preview

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove