Dense Captioning
23 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.
GRiT: A Generative Region-to-text Transformer for Object Understanding
Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
IIITD-20K: Dense captioning for Text-Image ReID
IIITD-20K comprises of 20, 000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID.
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.