TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio captioning	Clotho	Ensemble-RL	CIDEr	0.468	# 2
Audio captioning	Clotho	Ensemble-RL	SPIDEr	0.295	# 2
Audio captioning	Clotho	Ensemble-RL	SPICE	0.123	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-sjtu-system-for-dcase2021-challenge-task/audio-captioning-on-clotho)](https://paperswithcode.com/sota/audio-captioning-on-clotho?p=the-sjtu-system-for-dcase2021-challenge-task)`

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

DCASE Challenge 2021 · Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu ·

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.

PDF