Video Captioning

160 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Latest papers with no code

The 8th AI City Challenge

no code yet • 15 Apr 2024

The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities.

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

no code yet • 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

no code yet • 3 Apr 2024

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

no code yet • 8 Mar 2024

Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

no code yet • 29 Feb 2024

Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

no code yet • 27 Feb 2024

Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task.

Video ReCap: Recursive Captioning of Hour-Long Videos

no code yet • 20 Feb 2024

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark

no code yet • 25 Jan 2024

We develop a knowledge guided entity-aware video captioning network (KEANet) based on a candidate player list in encoder-decoder form for basketball live text broadcast.

SnapCap: Efficient Snapshot Compressive Video Captioning

no code yet • 10 Jan 2024

To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap.

Retrieval-Augmented Egocentric Video Captioning

no code yet • 1 Jan 2024

In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.