Video Captioning

160 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT (ours)	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Latest papers with no code

Most implemented Social Latest No code

The 8th AI City Challenge

no code yet • 15 Apr 2024

The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities.

Paper
Add Code

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

no code yet • 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

Paper
Add Code

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

no code yet • 3 Apr 2024

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.

Paper
Add Code

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

no code yet • 8 Mar 2024

Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing.

Paper
Add Code

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

no code yet • 29 Feb 2024

Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.

Paper
Add Code

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

no code yet • 27 Feb 2024

Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task.

Paper
Add Code

Video ReCap: Recursive Captioning of Hour-Long Videos

no code yet • 20 Feb 2024

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Paper
Add Code

Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark

no code yet • 25 Jan 2024

We develop a knowledge guided entity-aware video captioning network (KEANet) based on a candidate player list in encoder-decoder form for basketball live text broadcast.

Paper
Add Code

SnapCap: Efficient Snapshot Compressive Video Captioning

no code yet • 10 Jan 2024

To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap.

Paper
Add Code

Retrieval-Augmented Egocentric Video Captioning

no code yet • 1 Jan 2024

In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.

Paper
Add Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result