3D dense captioning
11 papers with code • 0 benchmarks • 1 datasets
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest.
Benchmarks
These leaderboards are used to track progress in 3D dense captioning
Most implemented papers
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
Thus, a more faithful caption can be generated only using point clouds during the inference.
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.
An Embodied Generalist Agent in 3D World
However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions.
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes
However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.