OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

PDF Abstract

Results from the Paper


Ranked #4 on Cross-Modal Retrieval on Flickr30k (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Cross-Modal Retrieval COCO 2014 OmniVL (14M) Image-to-text R@1 82.1 # 5
Image-to-text R@10 98.1 # 6
Image-to-text R@5 95.9 # 5
Text-to-image R@1 64.8 # 8
Text-to-image R@10 91.6 # 6
Text-to-image R@5 86.1 # 7
Video Retrieval DiDeMo OmniVL text-to-video R@1 52.4 # 16
text-to-video R@5 79.5 # 9
text-to-video R@10 85.4 # 14
Zero-Shot Video Retrieval DiDeMo OmniVL text-to-video R@1 33.3 # 14
text-to-video R@5 58.7 # 14
text-to-video R@10 68.5 # 14
Cross-Modal Retrieval Flickr30k OmniVL (14M) Image-to-text R@1 97.3 # 4
Image-to-text R@10 100 # 1
Image-to-text R@5 99.9 # 7
Text-to-image R@1 87.9 # 6
Text-to-image R@10 99.1 # 6
Text-to-image R@5 97.8 # 6
Action Classification Kinetics-400 OmniVL Acc@1 79.1 # 109
Acc@5 94.5 # 65
Video Retrieval MSR-VTT OmniVL text-to-video R@1 47.8 # 10
text-to-video R@5 74.2 # 7
text-to-video R@10 83.8 # 7
Zero-Shot Video Retrieval MSR-VTT OmniVL text-to-video R@1 34.6 # 13
text-to-video R@5 58.4 # 13
text-to-video R@10 66.6 # 15
Visual Question Answering (VQA) MSRVTT-QA OmniVL Accuracy 0.441 # 18
Visual Question Answering (VQA) MSVD-QA OmniVL Accuracy 0.510 # 19
Image Captioning nocaps-val-in-domain OmniVL CIDEr 104.6 # 9
SPICE 15 # 6
Pre-train (#images) 14M # 6
Image Captioning nocaps-val-near-domain OmniVL CIDEr 108.3 # 8
SPICE 14.9 # 5
Pre-train (#images) 14M # 7
Image Captioning nocaps-val-out-domain OmniVL CIDEr 106.3 # 8
SPICE 14.2 # 5
Pretrain (#images) 14M # 7
Image Captioning nocaps-val-overall OmniVL CIDEr 107.5 # 8
SPICE 14.7 # 6
Pretrain (#images) 14M # 7
Action Recognition Something-Something V2 OmniVL Top-1 Accuracy 62.5 # 100
Top-5 Accuracy 86.2 # 83
Video Captioning YouCook2 OmniVL BLEU-3 12.87 # 5
BLEU-4 8.72 # 9
METEOR 14.83 # 6
ROUGE-L 36.09 # 8
CIDEr 1.16 # 8

Methods


No methods listed for this paper. Add relevant methods here