Audio captioning

41 papers with code • 2 benchmarks • 4 datasets

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.


Use these libraries to find Audio captioning models and implementations

Most implemented papers

Clotho: An Audio Captioning Dataset

labbeti/aac-datasets 21 Oct 2019

Audio captioning is the novel task of general audio content description using free text.

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

CL4AC: A Contrastive Loss for Audio Captioning

liuxubo717/cl4ac 21 Jul 2021

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

qwenlm/qwen-audio 14 Nov 2023

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.

Audio Caption in a Car Setting with a Sentence-Level Loss

richermans/AudioCaption 31 May 2019

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

DK-Nguyen/audio-captioning-sub-sampling 6 Jul 2020

In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

emrcak/dcase-2020-baseline 9 Jul 2020

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio.

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

haantran96/wavetransformer 21 Oct 2020

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i. e. a caption) of its contents.

MusCaps: Generating Captions for Music Audio

ilaria-manco/muscaps 24 Apr 2021

Content-based music information retrieval has seen rapid progress with the adoption of deep learning.


wsntxxn/AudioCaption DCASE Challenge 2021

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6.