TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	im->txt	10.58	# 1
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	P@1	38.78	# 1
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	GT-box AP	88.99	# 1
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	Human-Accor	31.39	# 1
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	txt->im	12.96	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fine-grained-regional-prompt-tuning-for/visual-abductive-reasoning-on-sherlock)](https://paperswithcode.com/sota/visual-abductive-reasoning-on-sherlock?p=fine-grained-regional-prompt-tuning-for)`

A Region-Prompted Adapter Tuning for Visual Abductive Reasoning

18 Mar 2023 · Hao Zhang, Yeo Keat Ee, Basura Fernando ·

Visual Abductive Reasoning is an emerging vision-language (VL) topic where the model needs to retrieve/generate a likely textual hypothesis from a visual input (image or its part) using backward reasoning based on commonsense. Unlike in conventional VL retrieval or captioning tasks, where entities of texts appear in the image, in abductive inferences, the relevant facts about inferences are not readily apparent in the input images. Besides, these inferences are causally linked to specific regional visual cues and would change as cues change. Existing works highlight cues utilizing a specific prompt (e.g., colorful prompt). Then, a full fine-tuning of a VL foundation model is launched to tweak its function from perception to deduction. However, the colorful prompt uniformly patchify ``regional hints'' and ``global context'' at the same granularity level and may lose fine-grained visual details crucial for VAR. Meanwhile, full fine-tuning of VLF on limited data would easily be overfitted. To tackle this, we propose a simple yet effective Region-Prompted Adapter (RPA), a hybrid parameter-efficient fine-tuning method that leverages the strengths of detailed cues and efficient training for the VAR task. RPA~consists of two novel modules: Regional Prompt Generator (RPG) and Adapter$^\textbf{+}$. The prior encodes ``regional visual hints'' and ``global contexts'' into visual prompts separately at fine and coarse-grained levels. The latter extends the vanilla adapters with a new Map Adapter, which modifies the attention map using a trainable low-dim query/key projection. Additionally, we propose a new Dual-Contrastive Loss to regress the visual feature toward features of factual description and plausible hypothesis. Experiments on the Sherlock demonstrate that RPA outperforms previous SOTAs, achieving the 1st rank on leaderboards (Comparison to Human Accuracy: RPA~31.74 vs CPT-CLIP 29.58).

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Visual Abductive Reasoning

Datasets

MS COCO

Visual Genome

SHERLOCK

Results from the Paper

Edit

Ranked #1 on Visual Abductive Reasoning on SHERLOCK

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Abductive Reasoning	SHERLOCK	Dual-Contrast RGPs (CLIP ViT-L14-336)	im->txt	10.58	# 1	Compare
			P@1	38.78	# 1	Compare
			GT-box AP	88.99	# 1	Compare
			Human-Accor	31.39	# 1	Compare
			txt->im	12.96	# 1	Compare

Methods

Add Remove

Adapter • CLIP

Edit Social Preview

A Region-Prompted Adapter Tuning for Visual Abductive Reasoning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove