Video Captioning

162 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Latest papers

Most implemented Social Latest No code

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story • • 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.

16 Dec 2023

Paper
Code

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 • • 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

01 Dec 2023

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

117

30 Nov 2023

Paper
Code

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

intellabs/multimodal_cognitive_ai • • 7 Oct 2023

We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).

07 Oct 2023

Paper
Code

Accurate and Fast Compressed Video Captioning

acherstyx/CoCap • • ICCV 2023

Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed.

22 Sep 2023

Paper
Code

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting • • 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

12 Sep 2023

Paper
Code

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

yangbang18/multicapclip • • 25 Aug 2023

To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.

25 Aug 2023

Paper
Code

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

henryhzy/vl-pet • • ICCV 2023

In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2. 92% (3. 41%) and LoRA by 3. 37% (7. 03%) with BART-base (T5-base) on image-text tasks.

18 Aug 2023

Paper
Code

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

bladewaltz1/promptswitch • • ICCV 2023

In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e. g., CLIP) by adapting them to the video domain.

15 Aug 2023

Paper
Code

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

shajiayu1/OmniDataComposer • 8 Aug 2023

This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities.

08 Aug 2023

Paper
Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result