3D dense captioning

11 papers with code • 0 benchmarks • 1 datasets

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest.


Most implemented papers

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

curryyuan/x-trans2cap CVPR 2022

Thus, a more faithful caption can be generated only using point clouds during the inference.

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

SxJyJay/MORE 10 Mar 2022

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

heng-hw/spacap3d 22 Apr 2022

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

leolyj/3d-vlp CVPR 2023

The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.

End-to-End 3D Dense Captioning with Vote2Cap-DETR

ch3cook-fdu/vote2cap-detr CVPR 2023

Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

ch3cook-fdu/vote2cap-detr 6 Sep 2023

Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

An Embodied Generalist Agent in 3D World

embodied-generalist/embodied-generalist 18 Nov 2023

However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

open3da/ll3da 30 Nov 2023

However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

open3da/ll3da CVPR 2024

Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions.

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

jxbbb/tod3cap 28 Mar 2024

However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.