Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Ranked #1 on Audio-Visual Captioning on VALOR-32K (using extra training data)
We show that only language-paired two-modality data is sufficient to connect all modalities.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
Ranked #1 on Image Retrieval on Localized Narratives