THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

DCASE Challenge 2021  ·  Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu ·

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.

PDF

Datasets


Results from the Paper


Ranked #2 on Audio captioning on Clotho (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Audio captioning Clotho Ensemble-RL CIDEr 0.468 # 2
SPIDEr 0.295 # 2
SPICE 0.123 # 3

Methods