Video Captioning
154 papers with code • 11 benchmarks • 31 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
Source: NITS-VC System for VATEX Video Captioning Challenge 2020
Libraries
Use these libraries to find Video Captioning models and implementationsSubtasks
Most implemented papers
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Delving Deeper into Convolutional Networks for Learning Video Representations
We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.
Video captioning with recurrent networks based on frame- and video-level features and visual content classification
In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015.
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
Learning to Generate Grounded Visual Captions without Localization Supervision
When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.
OmniNet: A unified architecture for multi-modal multi-task learning
We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.
A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling
Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video.
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.