TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	text-to-video R@1	34.5	# 8
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	video-to-text R@1	33.0	# 7
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	text-to-video R@10	76.6	# 8
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	text-to-video R@5	63.2	# 8
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	video-to-text R@5	61.6	# 7
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	video-to-text R@10	75.3	# 7
Video Captioning	ActivityNet Captions	VideoCoCa	ROUGE-L	35.0	# 3
Video Captioning	ActivityNet Captions	VideoCoCa	BLEU4	14.7	# 1
Video Captioning	ActivityNet Captions	VideoCoCa	CIDEr	39.3	# 1
Video Question Answering	ActivityNet-QA	VideoCoCa	Accuracy	56.1	# 3
Zero-Shot Action Recognition	Charades	VideoCoCa	mAP	25.8	# 2
Zero-Shot Action Recognition	HMDB51	VideoCoCa	Top-1 Accuracy	58.7	# 6
Zero-Shot Action Recognition	HMDB51	VideoCoCa	Top-5 Accuracy	84.5	# 1
Video Question Answering	iVQA	VideoCoCa	Accuracy	39.0	# 3
Zero-Shot Action Recognition	Kinetics	VideoCoCa	Top-1 Accuracy	70.1	# 5
Zero-Shot Action Recognition	Kinetics	VideoCoCa	Top-5 Accuracy	88.9	# 4
Video Captioning	MSR-VTT	VideoCoCa	CIDEr	73.2	# 8
Video Captioning	MSR-VTT	VideoCoCa	ROUGE-L	68.0	# 4
Video Captioning	MSR-VTT	VideoCoCa	BLEU-4	53.8	# 6
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	text-to-video R@1	34.3	# 17
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	text-to-video R@5	57.8	# 20
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	text-to-video R@10	67.0	# 21
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	video-to-text R@1	64.7	# 1
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	video-to-text R@5	85.2	# 2
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	video-to-text R@10	91.4	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	text-to-video R@1	34.3	# 3
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	text-to-video R@5	57.8	# 3
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	text-to-video R@10	67.0	# 3
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	video-to-text R@1	64.7	# 1
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	video-to-text R@5	85.2	# 1
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	video-to-text R@10	91.4	# 1
Visual Question Answering (VQA)	MSRVTT-QA	VideoCoCa	Accuracy	0.463	# 10
Visual Question Answering (VQA)	MSVD-QA	VideoCoCa	Accuracy	0.569	# 8
Zero-Shot Action Recognition	UCF101	VideoCoCa	Top-1 Accuracy	86.6	# 4
Zero-Shot Action Recognition	UCF101	VideoCoCa	Top-5 accuracy	98.4	# 1
Zero-Shot Video Retrieval	VATEX	VideoCoCa	text-to-video R@1	53.2	# 3
Zero-Shot Video Retrieval	VATEX	VideoCoCa	video-to-text R@1	73.6	# 3
Zero-Shot Video Retrieval	VATEX	VideoCoCa	text-to-video R@5	83.3	# 3
Zero-Shot Video Retrieval	VATEX	VideoCoCa	text-to-video R@10	90.1	# 3
Zero-Shot Video Retrieval	VATEX	VideoCoCa	video-to-text R@5	93.2	# 3
Zero-Shot Video Retrieval	VATEX	VideoCoCa	video-to-text R@10	97.2	# 3
Video Captioning	VATEX	VideoCoCa	BLEU-4	39.7	# 4
Video Captioning	VATEX	VideoCoCa	CIDEr	77.8	# 4
Video Captioning	VATEX	VideoCoCa	ROUGE-L	54.5	# 2
Zero-Shot Video Retrieval	YouCook2	VideoCOca	text-to-video R@1	20.3	# 3
Zero-Shot Video Retrieval	YouCook2	VideoCOca	text-to-video R@5	43.0	# 4
Zero-Shot Video Retrieval	YouCook2	VideoCOca	text-to-video R@10	53.3	# 4
Video Retrieval	YouCook2	VideoCoCa (zero-shot)	text-to-video R@1	21.7	# 9
Video Retrieval	YouCook2	VideoCoCa (zero-shot)	text-to-video R@10	55.2	# 11
Video Retrieval	YouCook2	VideoCoCa (zero-shot)	text-to-video R@5	43.9	# 9
Video Captioning	YouCook2	VideoCoCa	BLEU-4	14.2	# 4
Video Captioning	YouCook2	VideoCoCa	ROUGE-L	37.7	# 7
Video Captioning	YouCook2	VideoCoCa	CIDEr	1.28	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-captioning-on-activitynet-captions)](https://paperswithcode.com/sota/video-captioning-on-activitynet-captions?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-action-recognition-on-charades-1)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-charades-1?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-question-answering-on-ivqa)](https://paperswithcode.com/sota/video-question-answering-on-ivqa?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-video-retrieval-on-msr-vtt-full)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt-full?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-video-retrieval-on-youcook2)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-youcook2?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-captioning-on-vatex-1)](https://paperswithcode.com/sota/video-captioning-on-vatex-1?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-captioning-on-youcook2)](https://paperswithcode.com/sota/video-captioning-on-youcook2?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-action-recognition-on-kinetics)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-kinetics?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=video-text-modeling-with-zero-shot-transfer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-modeling-with-zero-shot-transfer/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=video-text-modeling-with-zero-shot-transfer)`

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

9 Dec 2022 · Shen Yan, Tao Zhu, ZiRui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu ·

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Question Answering

Retrieval

Text to Video Retrieval

Video Captioning

Video Classification

Video Question Answering

Video Retrieval

Video to Text Retrieval

Visual Question Answering (VQA)

Zero-Shot Action Recognition

Zero-Shot Video Retrieval

Datasets

UCF101

Kinetics

HMDB51

ActivityNet

MSR-VTT

Charades

MSVD

HowTo100M

ActivityNet Captions

YouCook2

VATEX

ActivityNet-QA MSRVTT-QA MSVD-QA

iVQA

VideoCC3M

Results from the Paper

Edit

Ranked #1 on Video Captioning on ActivityNet Captions (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	ActivityNet	VideoCoCa	text-to-video R@1	34.5	# 8	Compare
			video-to-text R@1	33.0	# 7	Compare
			text-to-video R@10	76.6	# 8	Compare
			text-to-video R@5	63.2	# 8	Compare
			video-to-text R@5	61.6	# 7	Compare
			video-to-text R@10	75.3	# 7	Compare
Video Captioning	ActivityNet Captions	VideoCoCa	ROUGE-L	35.0	# 3	Compare
			BLEU4	14.7	# 1	Compare
			CIDEr	39.3	# 1	Compare
Video Question Answering	ActivityNet-QA	VideoCoCa	Accuracy	56.1	# 3	Compare
Zero-Shot Action Recognition	Charades	VideoCoCa	mAP	25.8	# 2	Compare
Zero-Shot Action Recognition	HMDB51	VideoCoCa	Top-1 Accuracy	58.7	# 6	Compare
Zero-Shot Action Recognition	HMDB51	VideoCoCa	Top-5 Accuracy	84.5	# 1	Compare
Video Question Answering	iVQA	VideoCoCa	Accuracy	39.0	# 3	Compare
Zero-Shot Action Recognition	Kinetics	VideoCoCa	Top-1 Accuracy	70.1	# 5	Compare
Zero-Shot Action Recognition	Kinetics	VideoCoCa	Top-5 Accuracy	88.9	# 4	Compare
Video Captioning	MSR-VTT	VideoCoCa	CIDEr	73.2	# 8	Compare
			ROUGE-L	68.0	# 4	Compare
			BLEU-4	53.8	# 6	Compare
Video Retrieval	MSR-VTT	VideoCoCa (zero-shot)	text-to-video R@1	34.3	# 17	Compare
			text-to-video R@5	57.8	# 20	Compare
			text-to-video R@10	67.0	# 21	Compare
			video-to-text R@1	64.7	# 1	Compare
			video-to-text R@5	85.2	# 2	Compare
			video-to-text R@10	91.4	# 2	Compare
Zero-Shot Video Retrieval	MSR-VTT-full	VideoCoCa	text-to-video R@1	34.3	# 3	Compare
			text-to-video R@5	57.8	# 3	Compare
			text-to-video R@10	67.0	# 3	Compare
			video-to-text R@1	64.7	# 1	Compare
			video-to-text R@5	85.2	# 1	Compare
			video-to-text R@10	91.4	# 1	Compare
Visual Question Answering (VQA)	MSRVTT-QA	VideoCoCa	Accuracy	0.463	# 10	Compare
Visual Question Answering (VQA)	MSVD-QA	VideoCoCa	Accuracy	0.569	# 8	Compare
Zero-Shot Action Recognition	UCF101	VideoCoCa	Top-1 Accuracy	86.6	# 4	Compare
Zero-Shot Action Recognition	UCF101	VideoCoCa	Top-5 accuracy	98.4	# 1	Compare
Zero-Shot Video Retrieval	VATEX	VideoCoCa	text-to-video R@1	53.2	# 3	Compare
			video-to-text R@1	73.6	# 3	Compare
			text-to-video R@5	83.3	# 3	Compare
			text-to-video R@10	90.1	# 3	Compare
			video-to-text R@5	93.2	# 3	Compare
			video-to-text R@10	97.2	# 3	Compare
Video Captioning	VATEX	VideoCoCa	BLEU-4	39.7	# 4	Compare
			CIDEr	77.8	# 4	Compare
			ROUGE-L	54.5	# 2	Compare
Zero-Shot Video Retrieval	YouCook2	VideoCOca	text-to-video R@1	20.3	# 3	Compare
			text-to-video R@5	43.0	# 4	Compare
			text-to-video R@10	53.3	# 4	Compare
Video Retrieval	YouCook2	VideoCoCa (zero-shot)	text-to-video R@1	21.7	# 9	Compare
			text-to-video R@10	55.2	# 11	Compare
			text-to-video R@5	43.9	# 9	Compare
Video Captioning	YouCook2	VideoCoCa	BLEU-4	14.2	# 4	Compare
			ROUGE-L	37.7	# 7	Compare
			CIDEr	1.28	# 8	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove