Video Captioning

111 papers with code • 6 benchmarks • 24 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Most implemented papers

Top-down Visual Saliency Guided by Captions

IgnacioHeredia/plant_classification CVPR 2017

Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment

ParitoshParmar/MTL-AQA CVPR 2019

Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?

Multi-modal Dense Video Captioning

v-iashin/MDVC 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

Reconstruction Network for Video Captioning

hobincar/RecNet CVPR 2018

Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

VideoBERT: A Joint Model for Video and Language Representation Learning

ammesatyajit/VideoBERT ICCV 2019

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Delving Deeper into Convolutional Networks for Learning Video Representations

yaoli/arctic-capgen-vid 19 Nov 2015

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

aalto-cbir/neuraltalkTheano 9 Dec 2015

In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015.