TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Charades-Ego	EgoVLPv2	mAP	34.1	# 2
Moment Queries	Ego4D	EgoVLPV2	Avg mAP (0.1-0.5)	12.23	# 4
Natural Language Queries	Ego4D	EgoVLPv2	R@1 IoU=0.3	12.95	# 3
Natural Language Queries	Ego4D	EgoVLPv2	R@5 IoU=0.3	23.80	# 2
Natural Language Queries	Ego4D	EgoVLPv2	R@1 IoU=0.5	7.91	# 4
Natural Language Queries	Ego4D	EgoVLPv2	R@5 IoU=0.5	16.11	# 2
Question Answering	EgoTaskQA	EgoVLPv2	Direct	46.26	# 1
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2	mAP (Avg)	47.3	# 4
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2	nDCG (Avg)	61.9	# 3
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2 (Zero-shot)	mAP (Avg)	26.7	# 11
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2 (Zero-shot)	nDCG (Avg)	29.1	# 11
Video Summarization	Query-Focused Video Summarization Dataset	EgoVLPv2	F1 (avg)	52.08	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/question-answering-on-egotaskqa)](https://paperswithcode.com/sota/question-answering-on-egotaskqa?p=egovlpv2-egocentric-video-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/video-summarization-on-query-focused-video)](https://paperswithcode.com/sota/video-summarization-on-query-focused-video?p=egovlpv2-egocentric-video-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/action-recognition-on-charades-ego)](https://paperswithcode.com/sota/action-recognition-on-charades-ego?p=egovlpv2-egocentric-video-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/natural-language-queries-on-ego4d)](https://paperswithcode.com/sota/natural-language-queries-on-ego4d?p=egovlpv2-egocentric-video-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/moment-queries-on-ego4d)](https://paperswithcode.com/sota/moment-queries-on-ego4d?p=egovlpv2-egocentric-video-language-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/egovlpv2-egocentric-video-language-pre/multi-instance-retrieval-on-epic-kitchens-100)](https://paperswithcode.com/sota/multi-instance-retrieval-on-epic-kitchens-100?p=egovlpv2-egocentric-video-language-pre)`

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

ICCV 2023 · Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang ·

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

facebookresearch/EgoVLPv2 official

Tasks

Add Remove

Action Recognition

Moment Queries

Multi-Instance Retrieval

Natural Language Queries

Question Answering

Video Summarization

Datasets

EPIC-KITCHENS-100 Charades-Ego

Ego4D

EgoTaskQA Query-Focused Video Summarization Dataset

Results from the Paper

Add Remove

Ranked #1 on Video Summarization on Query-Focused Video Summarization Dataset

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Charades-Ego	EgoVLPv2	mAP	34.1	# 2	Compare
Moment Queries	Ego4D	EgoVLPV2	Avg mAP (0.1-0.5)	12.23	# 4	Compare
Natural Language Queries	Ego4D	EgoVLPv2	R@1 IoU=0.3	12.95	# 3	Compare
			R@5 IoU=0.3	23.80	# 2	Compare
			R@1 IoU=0.5	7.91	# 4	Compare
			R@5 IoU=0.5	16.11	# 2	Compare
Question Answering	EgoTaskQA	EgoVLPv2	Direct	46.26	# 1	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2	mAP (Avg)	47.3	# 4	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2	nDCG (Avg)	61.9	# 3	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2 (Zero-shot)	mAP (Avg)	26.7	# 11	Compare
Multi-Instance Retrieval	EPIC-KITCHENS-100	EgoVLPv2 (Zero-shot)	nDCG (Avg)	29.1	# 11	Compare
Video Summarization	Query-Focused Video Summarization Dataset	EgoVLPv2	F1 (avg)	52.08	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove