Dense Video Captioning

25 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Most implemented papers

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

Streaming Dense Video Captioning

google-research/scenic 1 Apr 2024

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

ucf-sst-lab/aicity2024cvprw 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

quangminhdinh/trafficvlm 14 Apr 2024

Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems.