Audio captioning

62 papers with code • 2 benchmarks • 5 datasets

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Libraries

Use these libraries to find Audio captioning models and implementations

Most implemented papers

Clotho: An Audio Captioning Dataset

labbeti/aac-datasets 21 Oct 2019

Audio captioning is the novel task of general audio content description using free text.

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

CL4AC: A Contrastive Loss for Audio Captioning

liuxubo717/cl4ac 21 Jul 2021

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

modelscope/FunCodec 7 Oct 2023

Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

qwenlm/qwen-audio 14 Nov 2023

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

kuan2jiu99/audio-hallucination 12 Jun 2024

Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks.

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

naver-intel-co-lab/gaudi-lavcap 16 Jan 2025

LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.

M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP

nttcslab/m2d 28 Mar 2025

In the second stage, it learns CLAP features using the audio features learned from the LLM-based embeddings.

Audio Caption in a Car Setting with a Sentence-Level Loss

richermans/AudioCaption 31 May 2019

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.