Audio captioning
62 papers with code • 2 benchmarks • 5 datasets
Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.
Libraries
Use these libraries to find Audio captioning models and implementationsMost implemented papers
Clotho: An Audio Captioning Dataset
Audio captioning is the novel task of general audio content description using free text.
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
CL4AC: A Contrastive Loss for Audio Captioning
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks.
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction.
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP
In the second stage, it learns CLAP features using the audio features learned from the LLM-based embeddings.
Audio Caption in a Car Setting with a Sentence-Level Loss
Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.