Zero-shot Audio Captioning

2 papers with code • 2 benchmarks • 2 datasets

Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without any prior training for this task. Audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action.

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-shot Audio Captioning

Trend	Dataset	Best Model	Paper	Code	Compare
	AudioCaps	Audio Flamingo			See all
	Clotho	ZerAuCap			See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Zero-shot audio captioning with audio-language model guidance and audio context keywords

explainableml/zeraucap • • 14 Nov 2023

In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content.

Paper
Code

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

no code yet • 2 Feb 2024

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs.

Paper
Add Code