Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Ranked #2 on Video Retrieval on MSVD (using extra training data)
These two tasks are substantially more complex than predicting or retrieving a single sentence from an image.
We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset.
This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning.
In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Most prior art in visual understanding relies solely on analyzing the "what" (e. g., event recognition) and "where" (e. g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.
Ranked #1 on Video Question Answering on TVQA
First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.
Ranked #1 on Dense Video Captioning on YouCook2 (using extra training data)
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
Ranked #1 on Video Captioning on ActivityNet Captions
We propose to use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity and build the actor relationship graph to allow the graph convolution network to learn how to classify group activities.