Video Captioning

163 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Most implemented papers

Most implemented Social Latest No code

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

v-iashin/BMT • • 17 May 2020

We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

Paper
Code

Enriching Video Captions With Contextual Text

primle/LSMDC-Context • 29 Jul 2020

Understanding video content and generating caption with context is an important and challenging task.

Paper
Code

End-to-End Dense Video Captioning with Parallel Decoding

ttengwang/pdvc • • ICCV 2021

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.

Paper
Code

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler • • 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Paper
Code

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

mugen-org/MUGEN_baseline • • 17 Apr 2022

Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation.

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI • • 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Paper
Code

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

hcplab-sysu/causalvlr • • 30 Jun 2023

We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc.

Paper
Code

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting • • 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

Paper
Code

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 • • 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

Paper
Code

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

ailab-kyunghee/cm2_dvc • • 11 Apr 2024

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.

Paper
Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result