VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Results from the Paper


 Ranked #1 on Image Captioning on COCO Captions (SPICE metric, using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Retrieval ActivityNet VAST text-to-video R@1 70.5 # 2
text-to-video R@5 90.9 # 1
text-to-video R@10 95.5 # 1
Video Question Answering ActivityNet-QA VAST Accuracy 50.4 # 5
Audio captioning AudioCaps VAST CIDEr 0.781 # 3
BLEU-4 0.295 # 1
METEOR 0.247 # 4
ROUGE-L 0.509 # 1
Text to Audio Retrieval AudioCaps VAST R@1 52.0 # 2
R@10 82.9 # 5
R@5 76.8 # 2
Text to Audio Retrieval Clotho VAST R@1 26.9 # 2
R@10 66.1 # 1
R@5 53.2 # 1
Audio captioning Clotho VAST CIDEr 0.519 # 1
BLEU-4 19 # 1
METEOR 19.3 # 1
ROUGE-L 40.8 # 1
Cross-Modal Retrieval COCO 2014 VAST Text-to-image R@1 68.0 # 2
Text-to-image R@10 92.8 # 1
Text-to-image R@5 87.7 # 2
Image Captioning COCO Captions VAST CIDER 149.0 # 5
SPICE 27.0 # 1
Video Retrieval DiDeMo VAST text-to-video R@1 72.0 # 3
text-to-video R@5 89.0 # 3
text-to-video R@10 91.4 # 4
Zero-Shot Video Retrieval DiDeMo VAST text-to-video R@1 55.5 # 3
text-to-video R@5 74.3 # 3
text-to-video R@10 79.6 # 4
Zero-Shot Cross-Modal Retrieval Flickr30k VAST Text-to-image R@1 90.4 # 2
Cross-Modal Retrieval Flickr30k VAST Text-to-image R@1 91.0 # 3
Text-to-image R@10 99.5 # 2
Text-to-image R@5 98.5 # 4
Zero-Shot Video Retrieval MSR-VTT VAST text-to-video R@1 49.3 # 3
text-to-video R@5 68.3 # 5
text-to-video R@10 73.9 # 6
Video Captioning MSR-VTT VAST CIDEr 78.0 # 2
BLEU-4 56.7 # 2
Video Retrieval MSR-VTT VAST text-to-video R@1 63.9 # 1
text-to-video R@5 84.3 # 1
text-to-video R@10 89.6 # 1
Video Question Answering MSRVTT-QA VAST Accuracy 50.1 # 2
Visual Question Answering (VQA) MSVD-QA VAST Accuracy 0.60 # 4
Audio-visual Question Answering MUSIC-AVQA VAST Acc 80.7 # 1
TGIF-Frame TGIF-QA VAST Accuracy 79.1 # 2
Video Captioning TVC VAST BLEU-4 19.9 # 1
CIDEr 74.1 # 1
Audio-Visual Captioning VALOR-32K VAST CIDEr 62.2 # 1
BLEU-4 9.9 # 1
text-to-audiovisual retrieval VALOR-32K VAST text-to-audiovisual R@1 80.0 # 2
text-to-audiovisual R@5 93.7 # 2
text-to-audiovisual R@10 96.6 # 2
Video Retrieval VATEX VAST text-to-video R@1 83.0 # 1
text-to-video R@10 99.2 # 1
text-to-video R@5 98.2 # 5
Video Captioning VATEX VAST BLEU-4 45.0 # 2
CIDEr 99.5 # 1
Video Retrieval YouCook2 VAST text-to-video R@1 50.4 # 1
text-to-video R@10 80.8 # 1
text-to-video R@5 74.3 # 1
Video Captioning YouCook2 VAST BLEU-4 18.2 # 1
CIDEr 1.99 # 1

Methods


No methods listed for this paper. Add relevant methods here