Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Natural Language Moment Retrieval MAD Zero-Shot CLIP + Guidance Model R@1,IoU=0.1 9.3 # 1
R@5,IoU=0.1 18.96 # 1
R@10,IoU=0.1 24.30 # 1
R@50,IoU=0.1 39.79 # 2
R@100,IoU=0.1 47.35 # 4
R@1,IoU=0.3 4.65 # 1
R@5,IoU=0.3 13.06 # 1
R@10,IoU=0.3 17.73 # 2
R@50,IoU=0.3 32.23 # 3
R@100,IoU=0.3 39.58 # 3
R@1,IoU=0.5 2.16 # 2
R@5,IoU=0.5 7.4 # 2
R@10,IoU=0.5 11.09 # 2
R@50,IoU=0.5 23.21 # 3
R@100,IoU=0.5 29.68 # 3
Natural Language Moment Retrieval MAD VLG-Net + Guidance Model R@1,IoU=0.1 5.60 # 3
R@5,IoU=0.1 16.07 # 2
R@10,IoU=0.1 23.64 # 2
R@50,IoU=0.1 45.35 # 1
R@100,IoU=0.1 55.59 # 1
R@1,IoU=0.3 4.28 # 2
R@10,IoU=0.3 19.86 # 1
R@50,IoU=0.3 39.77 # 1
R@100,IoU=0.3 49.38 # 1
R@1,IoU=0.5 2.48 # 1
R@5,IoU=0.5 8.78 # 1
R@10,IoU=0.5 13.72 # 1
R@50,IoU=0.5 30.22 # 1
R@100,IoU=0.5 39.12 # 1

Methods