Video Captioning

83 papers with code • 6 benchmarks • 20 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Greatest papers with code

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Cross-Modal Retrieval Image Captioning +4

NMT-Keras: a Very Flexible Toolkit with a Focus on Interactive NMT and Online Learning

lvapeab/nmt-keras 9 Jul 2018

We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning.

General Classification Machine Translation +5

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

Image Captioning Language understanding +6

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Ranked #45 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Action Recognition +4

Delving Deeper into Convolutional Networks for Learning Video Representations

yaoli/arctic-capgen-vid 19 Nov 2015

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.

Action Recognition Video Captioning

Oracle performance for visual captioning

yaoli/arctic-capgen-vid 14 Nov 2015

The task of associating images and videos with a natural language description has attracted a great amount of attention recently.

Image Captioning Language Modelling +1

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

gingsi/coot-videotext NeurIPS 2020

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.

Cross-Modal Retrieval Representation Learning +1

End-to-End Dense Video Captioning with Masked Transformer

salesforce/densecap CVPR 2018

To address this problem, we propose an end-to-end transformer model for dense video captioning.

Dense Video Captioning

Learning to Generate Grounded Visual Captions without Localization Supervision

facebookresearch/ActivityNet-Entities 1 Jun 2019

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.

Image Captioning Language Modelling +1