Video Captioning

164 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Latest papers

Most implemented Social Latest No code

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

shajiayu1/OmniDataComposer • 8 Aug 2023

This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities.

08 Aug 2023

Paper
Code

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

camma-public/surgvlp • 27 Jul 2023

SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space.

27 Jul 2023

Paper
Code

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

hcplab-sysu/causal-vlreasoning • • 30 Jun 2023

We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc.

110

30 Jun 2023

Paper
Code

MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

willyfh/msvd-indonesian • 20 Jun 2023

Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable.

20 Jun 2023

Paper
Code

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

zjr2000/llmva-gebc • • 17 Jun 2023

This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality.

17 Jun 2023

Paper
Code

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

txh-mercury/cosa • • 15 Jun 2023

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.

15 Jun 2023

Paper
Code

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

x-plug/youku-mplug • • 7 Jun 2023

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

258

07 Jun 2023

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast • • NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

184

29 May 2023

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI • • 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

29 May 2023

Paper
Code

Movie101: A New Movie Understanding Benchmark

yuezih/movie101 • • 20 May 2023

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

20 May 2023

Paper
Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result