Video Captioning
163 papers with code • 11 benchmarks • 32 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
Source: NITS-VC System for VATEX Video Captioning Challenge 2020
Libraries
Use these libraries to find Video Captioning models and implementationsSubtasks
Most implemented papers
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
Enriching Video Captions With Contextual Text
Understanding video content and generating caption with context is an important and challenging task.
End-to-End Dense Video Captioning with Parallel Decoding
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation.
PaLI-X: On Scaling up a Multilingual Vision and Language Model
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning
We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc.
SoccerNet 2023 Challenges Results
More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.