Video Captioning
162 papers with code • 11 benchmarks • 32 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
Source: NITS-VC System for VATEX Video Captioning Challenge 2020
Libraries
Use these libraries to find Video Captioning models and implementationsSubtasks
Latest papers
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
A human need to capture both the event in every shot and associate them together to understand the story behind it.
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.
VTimeLLM: Empower LLM to Grasp Video Moments
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).
Accurate and Fast Compressed Video Captioning
Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed.
SoccerNet 2023 Challenges Results
More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2. 92% (3. 41%) and LoRA by 3. 37% (7. 03%) with BART-base (T5-base) on image-text tasks.
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e. g., CLIP) by adapting them to the video domain.
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities.