Spatio-Temporal Video Grounding
8 papers with code • 3 benchmarks • 3 datasets
Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.
Most implemented papers
Context-Guided Spatio-Temporal Video Grounding
The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context.
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Spatio-temporal grounding describes the task of localizing events in space and time, e. g., in video data, based on verbal descriptions only.
Guided Attention for Interpretable Motion Captioning
Diverse and extensive work has recently been conducted on text-conditioned human motion generation.
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.