Audio Retrieval with Natural Language Queries

We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text to Audio/Video Retrieval AudioCaps VGGish R@1 18.0±0.2 # 6
R@10 62.0±0.5 # 6
Text to Audio/Video Retrieval AudioCaps CE-Visual + VGGish R@1 23.9±0.7 # 3
R@10 74.4±0.2 # 3
Audio/Video to Text Retrieval AudioCaps VGGish + VGGSound (CE-Audio) R@1 25.1±0.9 # 4
R@10 73.2±1.6 # 4
Text to Audio/Video Retrieval AudioCaps VGGish + VGGSound (CE-Audio) R@1 23.1±0.8 # 4
R@10 70.7±0.7 # 4
Audio/Video to Text Retrieval AudioCaps VGGSound R@1 24.6±0.9 # 5
R@10 70.4±0.4 # 5
Text to Audio/Video Retrieval AudioCaps VGGSound R@1 20.5±0.6 # 5
R@10 67.0±1.0 # 5
Audio/Video to Text Retrieval AudioCaps R2P1D + Inst (CE-Visual) R@1 12.1±0.4 # 7
R@10 46.1±1.3 # 7
Text to Audio/Video Retrieval AudioCaps R2P1D + Inst (CE-Visual) R@1 10.1±0.2 # 7
R@10 49.6±1.1 # 7
Audio/Video to Text Retrieval AudioCaps VGGish R@1 21.0±0.8 # 6
R@10 62.7±1.6 # 6
Audio/Video to Text Retrieval AudioCaps CE-Visual + VGGSound R@1 34.0±1.5 # 1
R@10 82.5±1.2 # 2
Text to Audio Retrieval AudioCaps CE R@1 23.1±0.8 # 9
R@10 70.7±0.7 # 9
Text to Audio/Video Retrieval AudioCaps CE-Visual + CE-Audio R@1 28.1±0.6 # 1
R@10 79.0±0.5 # 1
Audio to Text Retrieval AudioCaps CE R@1 25.1±0.9 # 6
R@10 73.2±1.6 # 6
Audio to Text Retrieval AudioCaps MoEE R@1 25.1±0.8 # 6
R@10 72.9±1.2 # 7
Text to Audio Retrieval AudioCaps MoEE R@1 22.5±0.3 # 11
R@10 69.5±0.9 # 10
Audio/Video to Text Retrieval AudioCaps Scene + R2P1D R@1 11.0±0.6 # 8
R@10 45.1±1.7 # 8
Text to Audio/Video Retrieval AudioCaps Scene + R2P1D R@1 8.8±0.1 # 8
R@10 46.8±0.1 # 9
Audio/Video to Text Retrieval AudioCaps Scene + Inst R@1 10.6±0.6 # 9
R@10 41.4±1.5 # 10
Text to Audio/Video Retrieval AudioCaps Scene + Inst R@1 8.7±0.5 # 9
R@10 47.4±0.5 # 8
Audio/Video to Text Retrieval AudioCaps R2P1D R@1 10.3±0.4 # 10
R@10 41.8±3.1 # 9
Text to Audio/Video Retrieval AudioCaps R2P1D R@1 8.2±0.5 # 10
R@10 44.7±0.9 # 11
Audio/Video to Text Retrieval AudioCaps Inst R@1 9.8±0.9 # 11
R@10 40.6±0.7 # 11
Text to Audio/Video Retrieval AudioCaps Inst R@1 7.7±0.2 # 11
R@10 46.7±1.3 # 10
Audio/Video to Text Retrieval AudioCaps Scene R@1 6.5±0.8 # 12
R@10 31.3±1.6 # 12
Text to Audio/Video Retrieval AudioCaps Scene R@1 6.1±0.4 # 12
R@10 35.8±0.6 # 12
Audio/Video to Text Retrieval AudioCaps CE-Visual + CE-Audio R@1 33.7±1.6 # 2
R@10 83.7±0.4 # 1
Text to Audio/Video Retrieval AudioCaps CE-Visual + VGGSound R@1 27.4±0.7 # 2
R@10 78.2±0.3 # 2
Audio/Video to Text Retrieval AudioCaps CE-Visual + VGGish R@1 29.0±2.0 # 3
R@10 77.2±1.9 # 3
Text to Audio Retrieval Clotho CE (pretraining:AudioCaps) R@1 9.6±0.3 # 5
R@10 40.1±0.7 # 4
Text to Audio Retrieval Clotho MoEE (pretraining:AudioCaps) R@1 8.6±0.4 # 6
R@10 39.3±0.7 # 5
Audio to Text Retrieval Clotho CE R@1 7.1±0.3 # 5
R@10 34.6±0.5 # 4
Text to Audio Retrieval Clotho CE R@1 6.7±0.4 # 7
R@10 33.2±0.3 # 6
Audio to Text Retrieval Clotho MoEE R@1 7.2±0.5 # 4
R@10 33.2±1.1 # 6
Text to Audio Retrieval Clotho MoEE R@1 6.0±0.1 # 10
R@10 32.3±0.3 # 9
Audio to Text Retrieval Clotho CE (pretraining:AudioCaps) R@1 10.7±0.6 # 2
R@10 40.8±1.4 # 2
Audio to Text Retrieval Clotho MoEE (pretraining:AudioCaps) R@1 10.0±0.3 # 3
R@10 40.1±1.3 # 3

Methods


No methods listed for this paper. Add relevant methods here