TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-shot Text to Audio Retrieval	AudioCaps	WavCaps	R@10	75.8	# 1
Zero-shot Text to Audio Retrieval	AudioCaps	WavCaps	Audio-to-text R@1	25.6	# 2
Zero-shot Text to Audio Retrieval	Clotho	WavCaps	text-to-audio R@1	16.5	# 3
Zero-shot Text to Audio Retrieval	Clotho	WavCaps	text-to-audio R@10	50.9	# 2
Zero-Shot Environment Sound Classification	ESC-50	WavCaps	Accuracy	94.8	# 1
Zero-shot Audio Classification	VGG-Sound	WavCaps	Acc@1	29.6	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavcaps-a-chatgpt-assisted-weakly-labelled/zero-shot-environment-sound-classification-on-1)](https://paperswithcode.com/sota/zero-shot-environment-sound-classification-on-1?p=wavcaps-a-chatgpt-assisted-weakly-labelled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavcaps-a-chatgpt-assisted-weakly-labelled/zero-shot-text-to-audio-retrieval-on)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on?p=wavcaps-a-chatgpt-assisted-weakly-labelled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavcaps-a-chatgpt-assisted-weakly-labelled/zero-shot-audio-classification-on-vgg-sound)](https://paperswithcode.com/sota/zero-shot-audio-classification-on-vgg-sound?p=wavcaps-a-chatgpt-assisted-weakly-labelled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavcaps-a-chatgpt-assisted-weakly-labelled/zero-shot-text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on-clotho?p=wavcaps-a-chatgpt-assisted-weakly-labelled)`

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

30 Mar 2023 · Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang ·

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

PDF Abstract

Code

Add Remove Mark official

xinhaomei/wavcaps official

173

labbeti/aac-datasets

gzhu06/cacophony

Tasks

Add Remove

Audio captioning

Event Detection

Language Modelling

Large Language Model

Sound Event Detection

Zero-shot Audio Classification

Zero-Shot Environment Sound Classification

Zero-shot Text to Audio Retrieval

Datasets

Introduced in the Paper:

WavCaps

Used in the Paper:

AudioSet

Conceptual Captions

ESC-50

AudioCaps

VGG-Sound

Clotho

UrbanSound8K

CC12M MACS SoundDescs

Results from the Paper

Edit

Ranked #1 on Zero-Shot Environment Sound Classification on ESC-50 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-shot Text to Audio Retrieval	AudioCaps	WavCaps	R@10	75.8	# 1	Compare
Zero-shot Text to Audio Retrieval	AudioCaps	WavCaps	Audio-to-text R@1	25.6	# 2	Compare
Zero-shot Text to Audio Retrieval	Clotho	WavCaps	text-to-audio R@1	16.5	# 3	Compare
Zero-shot Text to Audio Retrieval	Clotho	WavCaps	text-to-audio R@10	50.9	# 2	Compare
Zero-Shot Environment Sound Classification	ESC-50	WavCaps	Accuracy	94.8	# 1	Compare
Zero-shot Audio Classification	VGG-Sound	WavCaps	Acc@1	29.6	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove