Video Grounding
42 papers with code • 2 benchmarks • 8 datasets
Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.
Most implemented papers
Detecting Moments and Highlights in Videos via Natural Language Queries
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos
To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again.
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding
More specifically, our dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
Video-Guided Curriculum Learning for Spoken Video Grounding
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise.
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.
Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding
This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query.
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.
Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization
In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video.