Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Moment Retrieval Charades-STA QD-DETR (Only Video) R@1 IoU=0.5 57.31 # 9
R@1 IoU=0.7 32.55 # 11
Moment Retrieval QVHighlights QD-DETR (only Video w/ PT ASR Captions) mAP 40.0 # 13
R@1 IoU=0.5 63.2 # 12
R@1 IoU=0.7 45.2 # 13
mAP@0.5 63.4 # 10
mAP@0.75 40.4 # 11
Moment Retrieval QVHighlights QD-DETR (w/ audio) mAP 40.19 # 11
R@1 IoU=0.5 63.06 # 13
R@1 IoU=0.7 45.10 # 14
mAP@0.5 63.04 # 13
mAP@0.75 40.10 # 13
Moment Retrieval QVHighlights QD-DETR (only Video) mAP 39.86 # 14
R@1 IoU=0.5 62.40 # 15
R@1 IoU=0.7 44.98 # 16
mAP@0.5 62.52 # 14
mAP@0.75 39.88 # 14
Moment Retrieval QVHighlights QD-DETR (w/ PT) mAP 40.62 # 10
R@1 IoU=0.5 64.1 # 8
R@1 IoU=0.7 46.1 # 12
mAP@0.5 64.3 # 8
mAP@0.75 40.5 # 10
Video Grounding QVHighlights QD-DETR R@1,IoU=0.5 62.40 # 3
R@1,IoU=0.7 44.98 # 3
Highlight Detection QVHighlights QD-DETR (w/ PT) mAP 38.52 # 7
Hit@1 62.27 # 6
Highlight Detection QVHighlights QD-DETR mAP 39.04 # 5
Hit@1 62.87 # 4
Highlight Detection QVHighlights QD-DETR (only Video) mAP 38.94 # 6
Hit@1 62.40 # 5
Highlight Detection QVHighlights QD-DETR (only Video w/ PT) Hit@1 61.91 # 7
Highlight Detection TvSum QD-DETR mAP 86.6 # 2
Highlight Detection TvSum QD-DETR (only Video) mAP 85.0 # 4

Methods