Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

10 Sep 2021  ·  Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu ·

Temporal grounding aims to temporally localize a video moment in the video whose semantics are related to a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with a focus on designing complicated heads and fusion strategies... Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Dual Matching Network (DMN), to directly model the relations between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs from a dual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal pair discrimination to maximize their mutual information. Experiments show that DMN achieves highly competitive performance compared with state-of-the-art methods on four video grounding benchmarks. Based on DMN, we present a winner solution for STVG challenge of the 3rd PIC workshop. This suggests that metric-learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. read more

PDF Abstract

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here