Zero-shot Audio Captioning

4 papers with code • 2 benchmarks • 2 datasets

Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without any prior training for this task. Audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action.

Most implemented papers

Zero-shot audio captioning with audio-language model guidance and audio context keywords

explainableml/zeraucap 14 Nov 2023

In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content.

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

NVIDIA/audio-flamingo 2 Feb 2024

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

hugomalard/aneyeforanear 8 Oct 2024

In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner.

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning

X-LANCE/SLAM-LLM 12 Oct 2024

By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner.