TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Charades-Ego	HierVL (Zero-shot)	mAP	26	# 6
Action Recognition	Charades-Ego	HierVL	mAP	33.8	# 3
Long Term Action Anticipation	Ego4D	HierVL	ED@20 Action	92.75	# 3
Long Term Action Anticipation	Ego4D	HierVL	ED@20 Noun	73.49	# 4
Long Term Action Anticipation	Ego4D	HierVL	ED@20 Verb	72.39	# 4
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL (Zero-shot)	mAP (Avg)	18.9	# 12
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL (Zero-shot)	nDCG (Avg)	24.7	# 12
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL	mAP (Avg)	46.7	# 6
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL	nDCG (Avg)	61.1	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hiervl-learning-hierarchical-video-language/action-recognition-on-charades-ego)](https://paperswithcode.com/sota/action-recognition-on-charades-ego?p=hiervl-learning-hierarchical-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hiervl-learning-hierarchical-video-language/long-term-action-anticipation-on-ego4d)](https://paperswithcode.com/sota/long-term-action-anticipation-on-ego4d?p=hiervl-learning-hierarchical-video-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hiervl-learning-hierarchical-video-language/multi-instance-retrieval-on-epic-kitchens-100)](https://paperswithcode.com/sota/multi-instance-retrieval-on-epic-kitchens-100?p=hiervl-learning-hierarchical-video-language)`

HierVL: Learning Hierarchical Video-Language Embeddings

CVPR 2023 · Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman ·

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Action Recognition

Long Term Action Anticipation

Long Term Anticipation

Multi-Instance Retrieval

Datasets

HowTo100M

EPIC-KITCHENS-100 Charades-Ego

Ego4D

Results from the Paper

Edit

Ranked #3 on Action Recognition on Charades-Ego

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Charades-Ego	HierVL (Zero-shot)	mAP	26	# 6	Compare
Action Recognition	Charades-Ego	HierVL	mAP	33.8	# 3	Compare
Long Term Action Anticipation	Ego4D	HierVL	ED@20 Action	92.75	# 3	Compare
			ED@20 Noun	73.49	# 4	Compare
			ED@20 Verb	72.39	# 4	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL (Zero-shot)	mAP (Avg)	18.9	# 12	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL (Zero-shot)	nDCG (Avg)	24.7	# 12	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL	mAP (Avg)	46.7	# 6	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	HierVL	nDCG (Avg)	61.1	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

HierVL: Learning Hierarchical Video-Language Embeddings

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove