About

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

TREND DATASET BEST METHOD PAPER TITLE PAPER CODE COMPARE

Subtasks

Datasets

Latest papers with code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

1 Apr 2021m-bain/frozen-in-time

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Ranked #2 on Video Retrieval on MSVD (using extra training data)

CURRICULUM LEARNING VIDEO CAPTIONING VIDEO-TEXT RETRIEVAL

18
01 Apr 2021

Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

27 Mar 2021jssprz/video_captioning_datasets

These two tasks are substantially more complex than predicting or retrieving a single sentence from an image.

VIDEO CAPTIONING

3
27 Mar 2021

Annotation Cleaning for the MSR-Video to Text Dataset

12 Feb 2021WingsBrokenAngel/MSR-VTT-DataCleaning

We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset.

VIDEO CAPTIONING

6
12 Feb 2021

Semantic Grouping Network for Video Captioning

1 Feb 2021hobincar/SGN

This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word.

VIDEO CAPTIONING

16
01 Feb 2021

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

23 Nov 2020HumamAlwassel/TSP

Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning.

ACTION CLASSIFICATION DENSE VIDEO CAPTIONING TEMPORAL ACTION PROPOSAL GENERATION TEMPORAL LOCALIZATION

23
23 Nov 2020

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

18 Nov 2020hassanhub/R3Transformer

In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.

DICTIONARY LEARNING VIDEO CAPTIONING

9
18 Nov 2020

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

16 Nov 2020amanchadha/iPerceive

Most prior art in visual understanding relies solely on analyzing the "what" (e. g., event recognition) and "where" (e. g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.

COMMON SENSE REASONING DENSE VIDEO CAPTIONING MACHINE TRANSLATION QUESTION ANSWERING VIDEO QUESTION ANSWERING

36
16 Nov 2020

Multimodal Pretraining for Dense Video Captioning

10 Nov 2020google-research-datasets/Video-Timeline-Tags-ViTT

First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.

 Ranked #1 on Dense Video Captioning on YouCook2 (using extra training data)

DENSE VIDEO CAPTIONING

7
10 Nov 2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

NeurIPS 2020 gingsi/coot-videotext

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.

CROSS-MODAL RETRIEVAL REPRESENTATION LEARNING VIDEO-TEXT RETRIEVAL

157
01 Nov 2020

Improved Actor Relation Graph based Group Activity Recognition

24 Oct 2020kuangzijian/Improved-Actor-Relation-Graph-based-Group-Activity-Recognition

We propose to use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity and build the actor relationship graph to allow the graph convolution network to learn how to classify group activities.

GROUP ACTIVITY RECOGNITION OBJECT DETECTION VIDEO CAPTIONING VIDEO UNDERSTANDING

0
24 Oct 2020