TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio captioning	Clotho	Ensemble	CIDEr	0.400	# 5
Audio captioning	Clotho	Ensemble	SPIDEr	0.318	# 1
Audio captioning	Clotho	Ensemble	SPICE	0.137	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-dcase-2021-challenge-task-6-system/audio-captioning-on-clotho)](https://paperswithcode.com/sota/audio-captioning-on-clotho?p=the-dcase-2021-challenge-task-6-system)`

THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS

DCASE workshop 2021 · Weiqiang Yuan ∗, Qichen Han∗, Dong Liu, Xiang Li, Zhen Yang ·

This technical report describes the system participating to the De- tection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use encoder-decoder modeling framework for audio under- standing and caption generation. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As limited audios with golden captions are available, we collect large-scale weakly labeled da- taset from Web with heuristic methods. Then we pre-train the en- coder-decoder models with this dataset followed by fine-tuning on Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar au- dios and audio event tags produced by pre-trained models to guide words generation in decoding stage. We tested our submissions using the development-testing dataset. Our best submission achieved 31.8 SPIDEr score where that of the baseline system is 5.4.

PDF