TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text to Audio Retrieval	AudioCaps	CE	R@1	23.6± 0.6	# 8
Text to Audio Retrieval	AudioCaps	CE	R@10	71.4±0.5	# 7
Text to Audio Retrieval	AudioCaps	MMT	R@1	36.1±3.3	# 5
Text to Audio Retrieval	AudioCaps	MMT	R@10	84.5±2.0	# 2
Audio to Text Retrieval	AudioCaps	MoEE	R@1	26.6±0.7	# 5
Audio to Text Retrieval	AudioCaps	MoEE	R@10	73.5±1.1	# 5
Audio to Text Retrieval	AudioCaps	CE	R@1	27.6±1.0	# 4
Audio to Text Retrieval	AudioCaps	CE	R@10	74.7±0.8	# 4
Audio to Text Retrieval	AudioCaps	MMT	R@1	39.6±0.2	# 3
Audio to Text Retrieval	AudioCaps	MMT	R@10	86.7±1.8	# 3
Text to Audio Retrieval	AudioCaps	MoEE	R@1	23.0±0.7	# 10
Text to Audio Retrieval	AudioCaps	MoEE	R@10	71.0±1.2	# 8
Text to Audio Retrieval	Clotho	MMT	R@1	6.5±0.6	# 8
Text to Audio Retrieval	Clotho	MMT	R@10	32.8±2.1	# 7
Text to Audio Retrieval	Clotho	CE(pretraining:SoundDescs)	R@1	6.4±0.5	# 9
Text to Audio Retrieval	Clotho	CE(pretraining:SoundDescs)	R@10	32.5±1.7	# 8
Audio to Text Retrieval	Clotho	CE(pretraining: SoundDescs)	R@1	6.1±0.7	# 7
Audio to Text Retrieval	Clotho	CE(pretraining: SoundDescs)	R@10	31.4±1.8	# 7
Audio to Text Retrieval	Clotho	MMT	R@1	6.3±0.5	# 6
Audio to Text Retrieval	Clotho	MMT	R@10	33.3±2.2	# 5
Audio to Text Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@1	22.2±0.4	# 4
Audio to Text Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@10	63.3±0.3	# 4
Audio to Text Retrieval	SoundDescs	MoEE	R@1	30.9±0.3	# 2
Audio to Text Retrieval	SoundDescs	MoEE	R@10	70.1±0.3	# 2
Audio to Text Retrieval	SoundDescs	CE	R@1	30.8±0.8	# 3
Audio to Text Retrieval	SoundDescs	CE	R@10	69.5±0.1	# 3
Text to Audio Retrieval	SoundDescs	MoEE	R@1	30.8±0.7	# 2
Text to Audio Retrieval	SoundDescs	MoEE	R@10	70.9±0.5	# 2
Text to Audio Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@1	23.3±0.7	# 4
Text to Audio Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@10	63.9±0.5	# 4
Text to Audio Retrieval	SoundDescs	MMT	R@1	30.7±0.4	# 3
Text to Audio Retrieval	SoundDescs	MMT	R@10	72.7±0.8	# 1
Text to Audio Retrieval	SoundDescs	CE	R@1	31.1±0.2	# 1
Text to Audio Retrieval	SoundDescs	CE	R@10	70.8±0.5	# 3
Audio to Text Retrieval	SoundDescs	MMT	R@1	31.4±0.8	# 1
Audio to Text Retrieval	SoundDescs	MMT	R@10	73.4±0.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/text-to-audio-retrieval-on-sounddescs)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-sounddescs?p=audio-retrieval-with-natural-language-queries-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/audio-to-text-retrieval-on-sounddescs)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-sounddescs?p=audio-retrieval-with-natural-language-queries-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/audio-to-text-retrieval-on-audiocaps)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-audiocaps?p=audio-retrieval-with-natural-language-queries-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=audio-retrieval-with-natural-language-queries-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/audio-to-text-retrieval-on-clotho)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-clotho?p=audio-retrieval-with-natural-language-queries-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-retrieval-with-natural-language-queries-1/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=audio-retrieval-with-natural-language-queries-1)`

Audio Retrieval with Natural Language Queries: A Benchmark Study

17 Dec 2021 · A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie ·

The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.

PDF Abstract

Code

Add Remove Mark official

akoepke/audio-retrieval-benchmark official

Tasks

Add Remove

AudioCaps

Audio captioning

Audio to Text Retrieval

Natural Language Queries

Retrieval

Text Retrieval

Text to Audio Retrieval

Datasets

Introduced in the Paper:

SoundDescs

Used in the Paper:

AudioSet

AudioCaps

Clotho

Results from the Paper

Edit

Ranked #1 on Audio to Text Retrieval on SoundDescs

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text to Audio Retrieval	AudioCaps	CE	R@1	23.6± 0.6	# 8	Compare
Text to Audio Retrieval	AudioCaps	CE	R@10	71.4±0.5	# 7	Compare
Text to Audio Retrieval	AudioCaps	MMT	R@1	36.1±3.3	# 5	Compare
Text to Audio Retrieval	AudioCaps	MMT	R@10	84.5±2.0	# 2	Compare
Audio to Text Retrieval	AudioCaps	MoEE	R@1	26.6±0.7	# 5	Compare
Audio to Text Retrieval	AudioCaps	MoEE	R@10	73.5±1.1	# 5	Compare
Audio to Text Retrieval	AudioCaps	CE	R@1	27.6±1.0	# 4	Compare
Audio to Text Retrieval	AudioCaps	CE	R@10	74.7±0.8	# 4	Compare
Audio to Text Retrieval	AudioCaps	MMT	R@1	39.6±0.2	# 3	Compare
Audio to Text Retrieval	AudioCaps	MMT	R@10	86.7±1.8	# 3	Compare
Text to Audio Retrieval	AudioCaps	MoEE	R@1	23.0±0.7	# 10	Compare
Text to Audio Retrieval	AudioCaps	MoEE	R@10	71.0±1.2	# 8	Compare
Text to Audio Retrieval	Clotho	MMT	R@1	6.5±0.6	# 8	Compare
Text to Audio Retrieval	Clotho	MMT	R@10	32.8±2.1	# 7	Compare
Text to Audio Retrieval	Clotho	CE(pretraining:SoundDescs)	R@1	6.4±0.5	# 9	Compare
Text to Audio Retrieval	Clotho	CE(pretraining:SoundDescs)	R@10	32.5±1.7	# 8	Compare
Audio to Text Retrieval	Clotho	CE(pretraining: SoundDescs)	R@1	6.1±0.7	# 7	Compare
Audio to Text Retrieval	Clotho	CE(pretraining: SoundDescs)	R@10	31.4±1.8	# 7	Compare
Audio to Text Retrieval	Clotho	MMT	R@1	6.3±0.5	# 6	Compare
Audio to Text Retrieval	Clotho	MMT	R@10	33.3±2.2	# 5	Compare
Audio to Text Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@1	22.2±0.4	# 4	Compare
Audio to Text Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@10	63.3±0.3	# 4	Compare
Audio to Text Retrieval	SoundDescs	MoEE	R@1	30.9±0.3	# 2	Compare
Audio to Text Retrieval	SoundDescs	MoEE	R@10	70.1±0.3	# 2	Compare
Audio to Text Retrieval	SoundDescs	CE	R@1	30.8±0.8	# 3	Compare
Audio to Text Retrieval	SoundDescs	CE	R@10	69.5±0.1	# 3	Compare
Text to Audio Retrieval	SoundDescs	MoEE	R@1	30.8±0.7	# 2	Compare
Text to Audio Retrieval	SoundDescs	MoEE	R@10	70.9±0.5	# 2	Compare
Text to Audio Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@1	23.3±0.7	# 4	Compare
Text to Audio Retrieval	SoundDescs	CE(pretrained: AudioCaps)	R@10	63.9±0.5	# 4	Compare
Text to Audio Retrieval	SoundDescs	MMT	R@1	30.7±0.4	# 3	Compare
Text to Audio Retrieval	SoundDescs	MMT	R@10	72.7±0.8	# 1	Compare
Text to Audio Retrieval	SoundDescs	CE	R@1	31.1±0.2	# 1	Compare
Text to Audio Retrieval	SoundDescs	CE	R@10	70.8±0.5	# 3	Compare
Audio to Text Retrieval	SoundDescs	MMT	R@1	31.4±0.8	# 1	Compare
Audio to Text Retrieval	SoundDescs	MMT	R@10	73.4±0.5	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Audio Retrieval with Natural Language Queries: A Benchmark Study

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove