HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Recognition Charades-Ego HierVL (Zero-shot) mAP 26 # 6
Action Recognition Charades-Ego HierVL mAP 33.8 # 3
Long Term Action Anticipation Ego4D HierVL ED@20 Action 92.75 # 3
ED@20 Noun 73.49 # 4
ED@20 Verb 72.39 # 4
Multi-Instance Retrieval EPIC-KITCHENS-100 HierVL (Zero-shot) mAP (Avg) 18.9 # 12
nDCG (Avg) 24.7 # 12
Multi-Instance Retrieval EPIC-KITCHENS-100 HierVL mAP (Avg) 46.7 # 6
nDCG (Avg) 61.1 # 4

Methods


No methods listed for this paper. Add relevant methods here