Video Captioning

163 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Most implemented papers

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

v-iashin/BMT 17 May 2020

We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

Enriching Video Captions With Contextual Text

primle/LSMDC-Context 29 Jul 2020

Understanding video content and generating caption with context is an important and challenging task.

End-to-End Dense Video Captioning with Parallel Decoding

ttengwang/pdvc ICCV 2021

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

mugen-org/MUGEN_baseline 17 Apr 2022

Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

hcplab-sysu/causalvlr 30 Jun 2023

We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc.

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

ailab-kyunghee/cm2_dvc 11 Apr 2024

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.