Dense Video Captioning

16 papers with code • 2 benchmarks • 5 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Libraries

Use these libraries to find Dense Video Captioning models and implementations

Most implemented papers

Multi-modal Dense Video Captioning

v-iashin/MDVC 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

v-iashin/BMT 17 May 2020

We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

Towards Automatic Learning of Procedures from Web Instructional Videos

LuoweiZhou/ProcNets-YouCook2 28 Mar 2017

To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.

Joint Event Detection and Description in Continuous Video Streams

VisionLearningGroup/JEDDi-Net 28 Feb 2018

In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

JaywongWang/DenseVideoCaptioning CVPR 2018

We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.

End-to-End Dense Video Captioning with Masked Transformer

salesforce/densecap CVPR 2018

To address this problem, we propose an end-to-end transformer model for dense video captioning.

Streamlined Dense Video Captioning

ttengwang/ESGN CVPR 2019

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events.

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

ttengwang/dense-video-captioning-pytorch 21 Jun 2020

This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.

Multimodal Pretraining for Dense Video Captioning

google-research-datasets/Video-Timeline-Tags-ViTT Asian Chapter of the Association for Computational Linguistics 2020

First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

amanchadha/iPerceive 16 Nov 2020

Most prior art in visual understanding relies solely on analyzing the "what" (e. g., event recognition) and "where" (e. g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.