Audio Retrieval with Natural Language Queries: A Benchmark Study

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.

PDF Abstract

Datasets


Introduced in the Paper:

SoundDescs

Used in the Paper:

AudioSet AudioCaps Clotho

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text to Audio Retrieval AudioCaps CE R@1 23.6± 0.6 # 8
R@10 71.4±0.5 # 7
Text to Audio Retrieval AudioCaps MMT R@1 36.1±3.3 # 5
R@10 84.5±2.0 # 2
Audio to Text Retrieval AudioCaps MoEE R@1 26.6±0.7 # 5
R@10 73.5±1.1 # 5
Audio to Text Retrieval AudioCaps CE R@1 27.6±1.0 # 4
R@10 74.7±0.8 # 4
Audio to Text Retrieval AudioCaps MMT R@1 39.6±0.2 # 3
R@10 86.7±1.8 # 3
Text to Audio Retrieval AudioCaps MoEE R@1 23.0±0.7 # 10
R@10 71.0±1.2 # 8
Text to Audio Retrieval Clotho MMT R@1 6.5±0.6 # 8
R@10 32.8±2.1 # 7
Text to Audio Retrieval Clotho CE(pretraining:SoundDescs) R@1 6.4±0.5 # 9
R@10 32.5±1.7 # 8
Audio to Text Retrieval Clotho CE(pretraining: SoundDescs) R@1 6.1±0.7 # 7
R@10 31.4±1.8 # 7
Audio to Text Retrieval Clotho MMT R@1 6.3±0.5 # 6
R@10 33.3±2.2 # 5
Audio to Text Retrieval SoundDescs CE(pretrained: AudioCaps) R@1 22.2±0.4 # 4
R@10 63.3±0.3 # 4
Audio to Text Retrieval SoundDescs MoEE R@1 30.9±0.3 # 2
R@10 70.1±0.3 # 2
Audio to Text Retrieval SoundDescs CE R@1 30.8±0.8 # 3
R@10 69.5±0.1 # 3
Text to Audio Retrieval SoundDescs MoEE R@1 30.8±0.7 # 2
R@10 70.9±0.5 # 2
Text to Audio Retrieval SoundDescs CE(pretrained: AudioCaps) R@1 23.3±0.7 # 4
R@10 63.9±0.5 # 4
Text to Audio Retrieval SoundDescs MMT R@1 30.7±0.4 # 3
R@10 72.7±0.8 # 1
Text to Audio Retrieval SoundDescs CE R@1 31.1±0.2 # 1
R@10 70.8±0.5 # 3
Audio to Text Retrieval SoundDescs MMT R@1 31.4±0.8 # 1
R@10 73.4±0.5 # 1

Methods


No methods listed for this paper. Add relevant methods here