THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS

This technical report describes the system participating to the De- tection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use encoder-decoder modeling framework for audio under- standing and caption generation. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As limited audios with golden captions are available, we collect large-scale weakly labeled da- taset from Web with heuristic methods. Then we pre-train the en- coder-decoder models with this dataset followed by fine-tuning on Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar au- dios and audio event tags produced by pre-trained models to guide words generation in decoding stage. We tested our submissions using the development-testing dataset. Our best submission achieved 31.8 SPIDEr score where that of the baseline system is 5.4.

PDF

Datasets


Results from the Paper


 Ranked #1 on Audio captioning on Clotho (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Audio captioning Clotho Ensemble CIDEr 0.400 # 5
SPIDEr 0.318 # 1
SPICE 0.137 # 1

Methods


No methods listed for this paper. Add relevant methods here