153 papers with code • 11 benchmarks • 31 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
LibrariesUse these libraries to find Video Captioning models and implementations
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?
Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.