Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Open Vocabulary Semantic Segmentation COCO-Stuff-171 POMP HIoU 39.1 # 1
Prompt Engineering ImageNet-21k POMP Accuracy 25.3 # 1
Prompt Engineering ImageNet-A POMP Top-1 accuracy % 51.6 # 1
Prompt Engineering ImageNet-R POMP Top-1 accuracy % 77.9 # 1
Prompt Engineering ImageNet-S POMP Top-1 accuracy % 49.8 # 1
Open Vocabulary Object Detection LVIS v1.0 POMP AP novel-LVIS base training 25.2 # 10
Open Vocabulary Semantic Segmentation PascalVOC-20 POMP mIoU 89.4 # 9
hIoU 84.4 # 1

Methods


No methods listed for this paper. Add relevant methods here