TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	HBI	text-to-video R@1	42.2	# 21
Video Retrieval	ActivityNet	HBI	text-to-video R@5	73.0	# 18
Video Retrieval	ActivityNet	HBI	text-to-video R@10	84.6	# 16
Video Retrieval	ActivityNet	HBI	text-to-video Median Rank	2.0	# 5
Video Retrieval	ActivityNet	HBI	text-to-video Mean Rank	6.6	# 8
Video Retrieval	ActivityNet	HBI	video-to-text R@1	42.4	# 12
Video Retrieval	ActivityNet	HBI	video-to-text R@5	73.0	# 10
Video Retrieval	ActivityNet	HBI	video-to-text R@10	86.0	# 7
Video Retrieval	ActivityNet	HBI	video-to-text Median Rank	2.0	# 2
Video Retrieval	ActivityNet	HBI	video-to-text Mean Rank	6.5	# 7
Video Retrieval	DiDeMo	HBI	text-to-video R@1	46.9	# 28
Video Retrieval	DiDeMo	HBI	text-to-video R@5	74.9	# 26
Video Retrieval	DiDeMo	HBI	text-to-video R@10	82.7	# 25
Video Retrieval	DiDeMo	HBI	text-to-video Median Rank	2.0	# 9
Video Retrieval	DiDeMo	HBI	text-to-video Mean Rank	12.1	# 5
Video Retrieval	DiDeMo	HBI	video-to-text R@1	46.2	# 12
Video Retrieval	DiDeMo	HBI	video-to-text R@10	82.7	# 9
Video Retrieval	DiDeMo	HBI	video-to-text Median Rank	2.0	# 5
Video Retrieval	DiDeMo	HBI	video-to-text Mean Rank	8.7	# 4
Video Retrieval	DiDeMo	HBI	video-to-text R@5	73.0	# 9
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video Mean Rank	12.0	# 7
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video R@1	48.6	# 22
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video R@5	74.6	# 20
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video R@10	83.4	# 21
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video Median Rank	2.0	# 10
Video Retrieval	MSR-VTT-1kA	HBI	video-to-text R@1	46.8	# 17
Video Retrieval	MSR-VTT-1kA	HBI	video-to-text R@5	74.3	# 12
Video Retrieval	MSR-VTT-1kA	HBI	video-to-text R@10	84.3	# 11
Video Retrieval	MSR-VTT-1kA	HBI	video-to-text Median Rank	2.0	# 7
Video Retrieval	MSR-VTT-1kA	HBI	video-to-text Mean Rank	8.9	# 12
Visual Question Answering (VQA)	MSRVTT-QA	HBI	Accuracy	0.462	# 11
Video Question Answering	MSRVTT-QA	HBI	Accuracy	46.2	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-as-game-players-hierarchical/video-question-answering-on-msrvtt-qa)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-qa?p=video-text-as-game-players-hierarchical)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-as-game-players-hierarchical/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=video-text-as-game-players-hierarchical)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-as-game-players-hierarchical/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=video-text-as-game-players-hierarchical)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-as-game-players-hierarchical/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=video-text-as-game-players-hierarchical)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-text-as-game-players-hierarchical/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=video-text-as-game-players-hierarchical)`

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

CVPR 2023 · Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen ·

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

jpthu17/HBI official

jpthu17/diffusionret

jpthu17/emcl

jpthu17/dicosa

Tasks

Add Remove

Contrastive Learning

Question Answering

Representation Learning

Retrieval

Video Question Answering

Video Retrieval

Visual Question Answering (VQA)

Datasets

ActivityNet

MSR-VTT

ActivityNet Captions

DiDeMo MSRVTT-QA

Results from the Paper

Edit

Ranked #8 on Video Question Answering on MSRVTT-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	HBI	text-to-video R@1	42.2	# 21	Compare
			text-to-video R@5	73.0	# 18	Compare
			text-to-video R@10	84.6	# 16	Compare
			text-to-video Median Rank	2.0	# 5	Compare
			text-to-video Mean Rank	6.6	# 8	Compare
			video-to-text R@1	42.4	# 12	Compare
			video-to-text R@5	73.0	# 10	Compare
			video-to-text R@10	86.0	# 7	Compare
			video-to-text Median Rank	2.0	# 2	Compare
			video-to-text Mean Rank	6.5	# 7	Compare
Video Retrieval	DiDeMo	HBI	text-to-video R@1	46.9	# 28	Compare
			text-to-video R@5	74.9	# 26	Compare
			text-to-video R@10	82.7	# 25	Compare
			text-to-video Median Rank	2.0	# 9	Compare
			text-to-video Mean Rank	12.1	# 5	Compare
			video-to-text R@1	46.2	# 12	Compare
			video-to-text R@10	82.7	# 9	Compare
			video-to-text Median Rank	2.0	# 5	Compare
			video-to-text Mean Rank	8.7	# 4	Compare
			video-to-text R@5	73.0	# 9	Compare
Video Retrieval	MSR-VTT-1kA	HBI	text-to-video Mean Rank	12.0	# 7	Compare
			text-to-video R@1	48.6	# 22	Compare
			text-to-video R@5	74.6	# 20	Compare
			text-to-video R@10	83.4	# 21	Compare
			text-to-video Median Rank	2.0	# 10	Compare
			video-to-text R@1	46.8	# 17	Compare
			video-to-text R@5	74.3	# 12	Compare
			video-to-text R@10	84.3	# 11	Compare
			video-to-text Median Rank	2.0	# 7	Compare
			video-to-text Mean Rank	8.9	# 12	Compare
Visual Question Answering (VQA)	MSRVTT-QA	HBI	Accuracy	0.462	# 11	Compare
Video Question Answering	MSRVTT-QA	HBI	Accuracy	46.2	# 8	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove