TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Human-Object Interaction Detection	HICO-DET	RLIPv2	mAP (UC)	42.26	# 1
Zero-Shot Human-Object Interaction Detection	HICO-DET	RLIPv2	mAP (UO)	-	# 5
Zero-Shot Human-Object Interaction Detection	HICO-DET	RLIPv2	mAP (UA)	-	# 2
Human-Object Interaction Detection	HICO-DET	RLIPv2 (Swin-L)	mAP	45.09	# 1
Human-Object Interaction Detection	V-COCO	RLIPv2	AP(S1)	72.1	# 1
Human-Object Interaction Detection	V-COCO	RLIPv2	AP(S2)	74.1	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rlipv2-fast-scaling-of-relational-language/zero-shot-human-object-interaction-detection)](https://paperswithcode.com/sota/zero-shot-human-object-interaction-detection?p=rlipv2-fast-scaling-of-relational-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rlipv2-fast-scaling-of-relational-language/human-object-interaction-detection-on-hico)](https://paperswithcode.com/sota/human-object-interaction-detection-on-hico?p=rlipv2-fast-scaling-of-relational-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rlipv2-fast-scaling-of-relational-language/human-object-interaction-detection-on-v-coco)](https://paperswithcode.com/sota/human-object-interaction-detection-on-v-coco?p=rlipv2-fast-scaling-of-relational-language)`

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

ICCV 2023 · Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao ·

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

jacobyuan7/rlipv2 official

jacobyuan7/rlip

jacobyuan7/ocn-hoi-benchmark

Tasks

Add Remove

Graph Generation

Human-Object Interaction Detection

object-detection

Object Detection

Relation

Relational Reasoning

Scene Graph Generation

Zero-Shot Human-Object Interaction Detection

Datasets

MS COCO

COCO Captions

HICO-DET

V-COCO

Objects365

Results from the Paper

Edit

Ranked #1 on Zero-Shot Human-Object Interaction Detection on HICO-DET (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Human-Object Interaction Detection	HICO-DET	RLIPv2	mAP (UC)	42.26	# 1	Compare
			mAP (UO)	-	# 5	Compare
			mAP (UA)	-	# 2	Compare
Human-Object Interaction Detection	HICO-DET	RLIPv2 (Swin-L)	mAP	45.09	# 1	Compare
Human-Object Interaction Detection	V-COCO	RLIPv2	AP(S1)	72.1	# 1	Compare
Human-Object Interaction Detection	V-COCO	RLIPv2	AP(S2)	74.1	# 1	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove