|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.
To address this problem, we propose an end-to-end transformer model for dense video captioning.
Ranked #3 on Video Captioning on YouCook2
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.
Ranked #4 on Dense Video Captioning on ActivityNet Captions
Most prior art in visual understanding relies solely on analyzing the "what" (e. g., event recognition) and "where" (e. g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.
Ranked #1 on Video Question Answering on TVQA
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Ranked #1 on Dense Video Captioning on ActivityNet Captions
To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.
In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context.
First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.
Ranked #1 on Dense Video Captioning on YouCook2 (using extra training data)
This paper proposes a new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), for measuring the performance of video story description systems.