Open Vocabulary Semantic Segmentation
28 papers with code • 9 benchmarks • 4 datasets
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks.
However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging.
However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images.
Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision model that has demonstrated significant benefits for downstream tasks, including many zero-shot learning and text-guided vision tasks.
In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time.
To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions.
When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data.