VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 AbstractCode
Tasks
















Results from the Paper
Ranked #1 on
Image Captioning
on COCO Captions
(SPICE metric, using extra
training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Video Retrieval | ActivityNet | VAST | text-to-video R@1 | 70.5 | # 2 | ||
text-to-video R@5 | 90.9 | # 1 | |||||
text-to-video R@10 | 95.5 | # 2 | |||||
Video Question Answering | ActivityNet-QA | VAST | Accuracy | 50.4 | # 5 | ||
Audio captioning | AudioCaps | VAST | CIDEr | 0.781 | # 9 | ||
BLEU-4 | 0.295 | # 2 | |||||
METEOR | 0.247 | # 10 | |||||
ROUGE-L | 0.509 | # 3 | |||||
Text to Audio Retrieval | AudioCaps | VAST | R@1 | 52.0 | # 2 | ||
R@10 | 82.9 | # 5 | |||||
R@5 | 76.8 | # 2 | |||||
Text to Audio Retrieval | Clotho | VAST | R@1 | 26.9 | # 3 | ||
R@10 | 66.1 | # 3 | |||||
R@5 | 53.2 | # 3 | |||||
Audio captioning | Clotho | VAST | CIDEr | 0.519 | # 1 | ||
BLEU-4 | 19 | # 1 | |||||
METEOR | 19.3 | # 1 | |||||
ROUGE-L | 40.8 | # 1 | |||||
Cross-Modal Retrieval | COCO 2014 | VAST | Text-to-image R@1 | 68.0 | # 1 | ||
Text-to-image R@10 | 92.8 | # 1 | |||||
Text-to-image R@5 | 87.7 | # 2 | |||||
Image Captioning | COCO Captions | VAST | CIDER | 149.0 | # 5 | ||
SPICE | 27.0 | # 1 | |||||
Video Retrieval | DiDeMo | VAST | text-to-video R@1 | 72.0 | # 3 | ||
text-to-video R@5 | 89.0 | # 3 | |||||
text-to-video R@10 | 91.4 | # 4 | |||||
Zero-Shot Video Retrieval | DiDeMo | VAST | text-to-video R@1 | 55.5 | # 3 | ||
text-to-video R@5 | 74.3 | # 3 | |||||
text-to-video R@10 | 79.6 | # 5 | |||||
Zero-Shot Cross-Modal Retrieval | Flickr30k | VAST | Text-to-image R@1 | 90.4 | # 2 | ||
Cross-Modal Retrieval | Flickr30k | VAST | Text-to-image R@1 | 91.0 | # 3 | ||
Text-to-image R@10 | 99.5 | # 2 | |||||
Text-to-image R@5 | 98.5 | # 4 | |||||
Zero-Shot Video Retrieval | MSR-VTT | VAST | text-to-video R@1 | 49.3 | # 6 | ||
text-to-video R@5 | 68.3 | # 7 | |||||
text-to-video R@10 | 73.9 | # 10 | |||||
Video Captioning | MSR-VTT | VAST | CIDEr | 78.0 | # 2 | ||
BLEU-4 | 56.7 | # 2 | |||||
Video Retrieval | MSR-VTT | VAST | text-to-video R@1 | 63.9 | # 2 | ||
text-to-video R@5 | 84.3 | # 1 | |||||
text-to-video R@10 | 89.6 | # 1 | |||||
Video Question Answering | MSRVTT-QA | VAST | Accuracy | 50.1 | # 2 | ||
Visual Question Answering (VQA) | MSVD-QA | VAST | Accuracy | 0.60 | # 4 | ||
Audio-visual Question Answering | MUSIC-AVQA | VAST | Acc | 80.7 | # 1 | ||
TGIF-Frame | TGIF-QA | VAST | Accuracy | 79.1 | # 2 | ||
Video Captioning | TVC | VAST | BLEU-4 | 19.9 | # 1 | ||
CIDEr | 74.1 | # 1 | |||||
Audio-Visual Captioning | VALOR-32K | VAST | CIDEr | 62.2 | # 1 | ||
BLEU-4 | 9.9 | # 1 | |||||
text-to-audiovisual retrieval | VALOR-32K | VAST | text-to-audiovisual R@1 | 80.0 | # 2 | ||
text-to-audiovisual R@5 | 93.7 | # 2 | |||||
text-to-audiovisual R@10 | 96.6 | # 2 | |||||
Video Retrieval | VATEX | VAST | text-to-video R@1 | 83.0 | # 2 | ||
text-to-video R@10 | 99.2 | # 2 | |||||
text-to-video R@5 | 98.2 | # 6 | |||||
Video Captioning | VATEX | VAST | BLEU-4 | 45.0 | # 2 | ||
CIDEr | 99.5 | # 1 | |||||
Video Retrieval | YouCook2 | VAST | text-to-video R@1 | 50.4 | # 1 | ||
text-to-video R@10 | 80.8 | # 1 | |||||
text-to-video R@5 | 74.3 | # 1 | |||||
Video Captioning | YouCook2 | VAST | BLEU-4 | 18.2 | # 1 | ||
CIDEr | 1.99 | # 2 |