Dense Captioning

23 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Dense Captioning

Trend	Dataset	Best Model	Paper	Code	Compare
	Visual Genome	ControlCap			See all

Datasets

Visual Genome

Most implemented papers

Most implemented Social Latest No code

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

heng-hw/spacap3d • • 22 Apr 2022

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.

Paper
Code

GRiT: A Generative Region-to-text Transformer for Object Understanding

JialianW/GRiT • • 1 Dec 2022

Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.

Paper
Code

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

leolyj/3d-vlp • • CVPR 2023

The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.

Paper
Code

End-to-End 3D Dense Captioning with Vote2Cap-DETR

ch3cook-fdu/vote2cap-detr • • CVPR 2023

Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.

Paper
Code

IIITD-20K: Dense captioning for Text-Image ReID

Visual-Conception-Group/Dense-Captioning-for-Text-Image-ReID • • 8 May 2023

IIITD-20K comprises of 20, 000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID.

Paper
Code

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

johncaged/OPT_Questioner • • 19 May 2023

In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.

Paper
Code

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

3d-vista/3D-VisTA • • ICCV 2023

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.

Paper
Code

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

ch3cook-fdu/vote2cap-detr • • 6 Sep 2023

Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

Paper
Code

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

open3da/ll3da • • 30 Nov 2023

However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.

Paper
Code

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

renshuhuai-andy/timechat • • 4 Dec 2023

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.

Paper
Code

Dense Captioning

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result