TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	GQA test-dev	PNP-VQA	Accuracy	41.9	# 10
Visual Question Answering (VQA)	OK-VQA	PNP-VQA	Accuracy	35.9	# 30
Visual Question Answering (VQA)	VQA v2 test-dev	PNP-VQA	Accuracy	64.8	# 45
Visual Question Answering (VQA)	VQA v2 val	PNP-VQA	Accuracy	63.3	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/plug-and-play-vqa-zero-shot-vqa-by-conjoining/visual-question-answering-on-vqa-v2-val)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-val?p=plug-and-play-vqa-zero-shot-vqa-by-conjoining)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/plug-and-play-vqa-zero-shot-vqa-by-conjoining/visual-question-answering-on-gqa-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-gqa-test-dev?p=plug-and-play-vqa-zero-shot-vqa-by-conjoining)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/plug-and-play-vqa-zero-shot-vqa-by-conjoining/visual-question-answering-on-ok-vqa)](https://paperswithcode.com/sota/visual-question-answering-on-ok-vqa?p=plug-and-play-vqa-zero-shot-vqa-by-conjoining)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/plug-and-play-vqa-zero-shot-vqa-by-conjoining/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=plug-and-play-vqa-zero-shot-vqa-by-conjoining)`

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

17 Oct 2022 · Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven C. H. Hoi ·

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa

PDF Abstract

Code

Add Remove Mark official

salesforce/lavis official

8,830

abril4416/kgen_vqa

Tasks

Add Remove

Image Captioning

Network Interpretation

Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

GQA

Visual Question Answering v2.0

OK-VQA

Results from the Paper

Edit

Ranked #2 on Visual Question Answering (VQA) on VQA v2 val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	GQA test-dev	PNP-VQA	Accuracy	41.9	# 10	Compare
Visual Question Answering (VQA)	OK-VQA	PNP-VQA	Accuracy	35.9	# 30	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	PNP-VQA	Accuracy	64.8	# 45	Compare
Visual Question Answering (VQA)	VQA v2 val	PNP-VQA	Accuracy	63.3	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove