TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Few-shot Age Estimation	MORPH Album2	CoOp	MAE	5.09	# 2
Few-shot Age Estimation	MORPH Album2	CoOp	MAE (2 shot)	4.50	# 2
Few-shot Age Estimation	MORPH Album2	CoOp	MAE (4 shot)	3.81	# 2
Few-shot Age Estimation	MORPH Album2	CoOp	MAE (8 shot)	3.57	# 2
Few-shot Age Estimation	MORPH Album2	CoOp	MAE (16 shot)	3.23	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-to-prompt-for-vision-language-models/few-shot-age-estimation-on-morph-album2)](https://paperswithcode.com/sota/few-shot-age-estimation-on-morph-album2?p=learning-to-prompt-for-vision-language-models)`

Learning to Prompt for Vision-Language Models

2 Sep 2021 · Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu ·

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

PDF Abstract

Code

Add Remove Mark official

kaiyangzhou/coop official

1,467

muzairkhattak/multimodal-prompt-lea…

511

azshue/TPT

117

muzairkhattak/protext

vill-lab/2024-aaai-hpt

See all 13 implementations

Tasks

Add Remove

Domain Generalization

Few-shot Age Estimation

Prompt Engineering

Representation Learning

Datasets

ImageNet

UCF101

Oxford 102 Flower

Stanford Cars

DTD

Food-101

Caltech-101

EuroSAT

FGVC-Aircraft

ImageNet-A

ImageNet-Sketch

MORPH

Results from the Paper

Edit

Ranked #2 on Few-shot Age Estimation on MORPH Album2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Few-shot Age Estimation	MORPH Album2	CoOp	MAE	5.09	# 2	Compare
			MAE (2 shot)	4.50	# 2	Compare
			MAE (4 shot)	3.81	# 2	Compare
			MAE (8 shot)	3.57	# 2	Compare
			MAE (16 shot)	3.23	# 2	Compare

Methods

Add Remove

CLIP • CoOp

Edit Social Preview

Learning to Prompt for Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove