Video Captioning

153 papers with code • 11 benchmarks • 31 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020


Use these libraries to find Video Captioning models and implementations

Most implemented papers

Top-down Visual Saliency Guided by Captions

IgnacioHeredia/plant_classification CVPR 2017

Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain.

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment

ParitoshParmar/MTL-AQA CVPR 2019

Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?

Multi-modal Dense Video Captioning

v-iashin/MDVC 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

whwu95/Cap4Video CVPR 2023

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Reconstruction Network for Video Captioning

nasib-ullah/video-captioning-models-in-Pytorch CVPR 2018

Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning.

VideoBERT: A Joint Model for Video and Language Representation Learning

ammesatyajit/VideoBERT ICCV 2019

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.