THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING
This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.
PDFCode
Datasets
Results from the Paper
Ranked #5 on
Audio captioning
on Clotho
(using extra training data)