Spatio-Temporal Video Grounding

6 papers with code • 3 benchmarks • 3 datasets

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Benchmarks

Add a Result

These leaderboards are used to track progress in Spatio-Temporal Video Grounding

Dataset	Best Model	Compare
HC-STVG2	CG-STVG	See all
VidSTG	CG-STVG	See all
HC-STVG1	CG-STVG	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Guaranteer/VidSTG-Dataset • CVPR 2020

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).

Paper
Code

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

tzhhhh123/HC-STVG • 10 Nov 2020

HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.

Paper
Code

TubeDETR: Spatio-Temporal Video Grounding with Transformers

antoyang/TubeDETR • • CVPR 2022

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

Paper
Code

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

jy0205/stcat • • 27 Sep 2022

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.

Paper
Code

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

mbzuai-oryx/video-llava • • 22 Nov 2023

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.

Paper
Code

Context-Guided Spatio-Temporal Video Grounding

henglan/cgstvg • • 3 Jan 2024

The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context.

Paper
Code

Spatio-Temporal Video Grounding

Benchmarks Add a Result

Datasets

Most implemented papers

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Context-Guided Spatio-Temporal Video Grounding

Content

Benchmarks

Add a Result