Dense Video Captioning
21 papers with code • 3 benchmarks • 6 datasets
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
LibrariesUse these libraries to find Dense Video Captioning models and implementations
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.