Video Grounding

42 papers with code • 2 benchmarks • 8 datasets

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Most implemented papers

Detecting Moments and Highlights in Videos via Natural Language Queries

jayleicn/moment_detr NeurIPS 2021

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

sangminwoo/explore-and-match 25 Jan 2022

To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again.

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

tencentarc/umt CVPR 2022

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.

TubeDETR: Spatio-Temporal Video Grounding with Transformers

antoyang/TubeDETR CVPR 2022

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding

SUTDCV/Animal-Kingdom CVPR 2022

More specifically, our dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.

Video-Guided Curriculum Learning for Spoken Video Grounding

marmot-xy/spoken-video-grounding 1 Sep 2022

To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise.

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

houzhijian/cone 22 Sep 2022

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.

Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

ericashimomoto/parameter-efficient-tvg 26 Sep 2022

This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query.

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

jy0205/stcat 27 Sep 2022

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.

Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization

samsunglabs/graph2vid 10 Oct 2022

In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video.