Video Captioning

154 papers with code • 11 benchmarks • 31 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Most implemented papers

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

google-research/scenic CVPR 2023

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

Delving Deeper into Convolutional Networks for Learning Video Representations

yaoli/arctic-capgen-vid 19 Nov 2015

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

aalto-cbir/neuraltalkTheano 9 Dec 2015

In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015.

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

eric-xw/Video-guided-Machine-Translation ICCV 2019

We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.

Learning to Generate Grounded Visual Captions without Localization Supervision

chihyaoma/cyclical-visual-captioning 1 Jun 2019

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

WingsBrokenAngel/Semantics-AssistedVideoCaptioning 31 Aug 2019

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video.

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

microsoft/UniVL 15 Feb 2020

However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

jacobswan1/Video2Commonsense EMNLP 2020

In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.