no code implementations • 7 Dec 2018 • Yapeng Tian, Chenxiao Guan, Justin Goodman, Marc Moore, Chenliang Xu
To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation.