22 papers with code • 1 benchmarks • 1 datasets
Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail.
Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs.
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language.
The goal is to densely detect visual concepts (e. g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase.
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance.
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.