TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	Flickr30k Captions test	FewVLM	CIDEr	31.0	# 5
Image Captioning	Flickr30k Captions test	FewVLM	SPICE	10.0	# 4
Visual Question Answering (VQA)	GQA test-dev	FewVLM (zero-shot)	Accuracy	29.3	# 14
Visual Question Answering (VQA)	OK-VQA	FewVLM	Accuracy	16.5	# 33
Visual Question Answering (VQA)	VQA v2 val	Few VLM (zero-shot)	Accuracy	47.7	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-good-prompt-is-worth-millions-of-parameters/image-captioning-on-flickr30k-captions-test)](https://paperswithcode.com/sota/image-captioning-on-flickr30k-captions-test?p=a-good-prompt-is-worth-millions-of-parameters)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-good-prompt-is-worth-millions-of-parameters/visual-question-answering-on-vqa-v2-val)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-val?p=a-good-prompt-is-worth-millions-of-parameters)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-good-prompt-is-worth-millions-of-parameters/visual-question-answering-on-gqa-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-gqa-test-dev?p=a-good-prompt-is-worth-millions-of-parameters)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-good-prompt-is-worth-millions-of-parameters/visual-question-answering-on-ok-vqa)](https://paperswithcode.com/sota/visual-question-answering-on-ok-vqa?p=a-good-prompt-is-worth-millions-of-parameters)`

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models

ACL 2022 · Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, Xiang Ren ·

Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x larger model, PICa. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at \url{https://github.com/woojeongjin/FewVLM}

PDF Abstract ACL 2022 PDF ACL 2022 Abstract

Code

Add Remove Mark official

woojeongjin/fewvlm official

Tasks

Add Remove

Image Captioning

Language Modelling

Masked Language Modeling

Visual Question Answering (VQA)

Datasets

MS COCO mini-Imagenet

Visual Genome

Flickr30k

GQA

Visual Question Answering v2.0

OK-VQA

NoCaps

Results from the Paper

Edit

Ranked #4 on Image Captioning on Flickr30k Captions test (SPICE metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	Flickr30k Captions test	FewVLM	CIDEr	31.0	# 5	Compare
Image Captioning	Flickr30k Captions test	FewVLM	SPICE	10.0	# 4	Compare
Visual Question Answering (VQA)	GQA test-dev	FewVLM (zero-shot)	Accuracy	29.3	# 14	Compare
Visual Question Answering (VQA)	OK-VQA	FewVLM	Accuracy	16.5	# 33	Compare
Visual Question Answering (VQA)	VQA v2 val	Few VLM (zero-shot)	Accuracy	47.7	# 8	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove