Audio-Visual Captioning

4 papers with code • 1 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

naver-intel-co-lab/gaudi-lavcap 16 Jan 2025

LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

jongsuk1/avcap 10 Jul 2024

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions.