Audio-Visual Captioning
4 papers with code • 1 benchmarks • 0 datasets
Most implemented papers
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions.