3D dense captioning

9 papers with code • 0 benchmarks • 1 datasets

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest.

Datasets


Most implemented papers

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

curryyuan/x-trans2cap CVPR 2022

Thus, a more faithful caption can be generated only using point clouds during the inference.

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

SxJyJay/MORE 10 Mar 2022

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

heng-hw/spacap3d 22 Apr 2022

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

leolyj/3d-vlp CVPR 2023

The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.

End-to-End 3D Dense Captioning with Vote2Cap-DETR

ch3cook-fdu/vote2cap-detr CVPR 2023

Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

ch3cook-fdu/vote2cap-detr 6 Sep 2023

Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

An Embodied Generalist Agent in 3D World

embodied-generalist/embodied-generalist 18 Nov 2023

Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics.

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

open3da/ll3da 30 Nov 2023

However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

jxbbb/tod3cap 28 Mar 2024

However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the \textbf{domain gap} between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the \textbf{lack of data} with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.