Video Captioning

162 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.

45
16 Dec 2023

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

10
01 Dec 2023

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

117
30 Nov 2023

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

intellabs/multimodal_cognitive_ai 7 Oct 2023

We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).

32
07 Oct 2023

Accurate and Fast Compressed Video Captioning

acherstyx/CoCap ICCV 2023

Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed.

27
22 Sep 2023

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

70
12 Sep 2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

yangbang18/multicapclip 25 Aug 2023

To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.

33
25 Aug 2023

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

henryhzy/vl-pet ICCV 2023

In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2. 92% (3. 41%) and LoRA by 3. 37% (7. 03%) with BART-base (T5-base) on image-text tasks.

47
18 Aug 2023

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

bladewaltz1/promptswitch ICCV 2023

In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e. g., CLIP) by adapting them to the video domain.

24
15 Aug 2023

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

shajiayu1/OmniDataComposer 8 Aug 2023

This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities.

16
08 Aug 2023