VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

9 Dec 2022  ยท  Shen Yan, Tao Zhu, ZiRui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu ยท

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

PDF Abstract

Results from the Paper


 Ranked #1 on Video Captioning on ActivityNet Captions (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Retrieval ActivityNet VideoCoCa text-to-video R@1 34.5 # 8
video-to-text R@1 33.0 # 7
text-to-video R@10 76.6 # 8
text-to-video R@5 63.2 # 8
video-to-text R@5 61.6 # 7
video-to-text R@10 75.3 # 7
Video Captioning ActivityNet Captions VideoCoCa ROUGE-L 35.0 # 3
BLEU4 14.7 # 1
CIDEr 39.3 # 1
Video Question Answering ActivityNet-QA VideoCoCa Accuracy 56.1 # 3
Zero-Shot Action Recognition Charades VideoCoCa mAP 25.8 # 2
Zero-Shot Action Recognition HMDB51 VideoCoCa Top-1 Accuracy 58.7 # 6
Top-5 Accuracy 84.5 # 1
Video Question Answering iVQA VideoCoCa Accuracy 39.0 # 3
Zero-Shot Action Recognition Kinetics VideoCoCa Top-1 Accuracy 70.1 # 5
Top-5 Accuracy 88.9 # 4
Video Captioning MSR-VTT VideoCoCa CIDEr 73.2 # 8
ROUGE-L 68.0 # 4
BLEU-4 53.8 # 6
Video Retrieval MSR-VTT VideoCoCa (zero-shot) text-to-video R@1 34.3 # 17
text-to-video R@5 57.8 # 20
text-to-video R@10 67.0 # 21
video-to-text R@1 64.7 # 1
video-to-text R@5 85.2 # 2
video-to-text R@10 91.4 # 2
Zero-Shot Video Retrieval MSR-VTT-full VideoCoCa text-to-video R@1 34.3 # 3
text-to-video R@5 57.8 # 3
text-to-video R@10 67.0 # 3
video-to-text R@1 64.7 # 1
video-to-text R@5 85.2 # 1
video-to-text R@10 91.4 # 1
Visual Question Answering (VQA) MSRVTT-QA VideoCoCa Accuracy 0.463 # 10
Visual Question Answering (VQA) MSVD-QA VideoCoCa Accuracy 0.569 # 8
Zero-Shot Action Recognition UCF101 VideoCoCa Top-1 Accuracy 86.6 # 4
Top-5 accuracy 98.4 # 1
Zero-Shot Video Retrieval VATEX VideoCoCa text-to-video R@1 53.2 # 3
video-to-text R@1 73.6 # 3
text-to-video R@5 83.3 # 3
text-to-video R@10 90.1 # 3
video-to-text R@5 93.2 # 3
video-to-text R@10 97.2 # 3
Video Captioning VATEX VideoCoCa BLEU-4 39.7 # 4
CIDEr 77.8 # 4
ROUGE-L 54.5 # 2
Zero-Shot Video Retrieval YouCook2 VideoCOca text-to-video R@1 20.3 # 3
text-to-video R@5 43.0 # 4
text-to-video R@10 53.3 # 4
Video Retrieval YouCook2 VideoCoCa (zero-shot) text-to-video R@1 21.7 # 9
text-to-video R@10 55.2 # 11
text-to-video R@5 43.9 # 9
Video Captioning YouCook2 VideoCoCa BLEU-4 14.2 # 4
ROUGE-L 37.7 # 7
CIDEr 1.28 # 8

Methods


No methods listed for this paper. Add relevant methods here