Dense Captioning

23 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

heng-hw/spacap3d 22 Apr 2022

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.

GRiT: A Generative Region-to-text Transformer for Object Understanding

JialianW/GRiT 1 Dec 2022

Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

leolyj/3d-vlp CVPR 2023

The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.

End-to-End 3D Dense Captioning with Vote2Cap-DETR

ch3cook-fdu/vote2cap-detr CVPR 2023

Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.

IIITD-20K: Dense captioning for Text-Image ReID

Visual-Conception-Group/Dense-Captioning-for-Text-Image-ReID 8 May 2023

IIITD-20K comprises of 20, 000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID.

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

johncaged/OPT_Questioner 19 May 2023

In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

3d-vista/3D-VisTA ICCV 2023

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

ch3cook-fdu/vote2cap-detr 6 Sep 2023

Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

open3da/ll3da 30 Nov 2023

However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

renshuhuai-andy/timechat 4 Dec 2023

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.