Video Captioning

164 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

shajiayu1/OmniDataComposer 8 Aug 2023

This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities.

16
08 Aug 2023

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

camma-public/surgvlp 27 Jul 2023

SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space.

4
27 Jul 2023

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

hcplab-sysu/causal-vlreasoning 30 Jun 2023

We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc.

110
30 Jun 2023

MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

willyfh/msvd-indonesian 20 Jun 2023

Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable.

3
20 Jun 2023

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

zjr2000/llmva-gebc 17 Jun 2023

This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality.

29
17 Jun 2023

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

txh-mercury/cosa 15 Jun 2023

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.

35
15 Jun 2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

x-plug/youku-mplug 7 Jun 2023

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

258
07 Jun 2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

184
29 May 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

71
29 May 2023

Movie101: A New Movie Understanding Benchmark

yuezih/movie101 20 May 2023

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

37
20 May 2023