TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Open Vocabulary Semantic Segmentation	ADE20K-150	SimSeg	mIoU	20.5	# 14
Open Vocabulary Semantic Segmentation	ADE20K-847	SimSeg	mIoU	7	# 13
Open Vocabulary Semantic Segmentation	Cityscapes	SimSeg	mIoU	34.5	# 2
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	ZSSeg	HIoU	37.8	# 2
Open Vocabulary Semantic Segmentation	PASCAL Context-59	SimSeg	mIoU	47.7	# 12
Open Vocabulary Semantic Segmentation	PascalVOC-20	ZSSeg	hIoU	77.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on?p=2112-14757)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on-coco)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-coco?p=2112-14757)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on-5)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-5?p=2112-14757)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=2112-14757)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on-3)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-3?p=2112-14757)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/2112-14757/open-vocabulary-semantic-segmentation-on-2)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-2?p=2112-14757)`

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

29 Dec 2021 · Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, Xiang Bai ·

Recently, open-vocabulary image classification by vision language pre-training has demonstrated incredible achievements, that the model can classify arbitrary categories without seeing additional annotated images of that category. However, it is still unclear how to make the open-vocabulary recognition work well on broader vision problems. This paper targets open-vocabulary semantic segmentation by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy in processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform open-vocabulary classification on the masked image crops which are generated in the first stage. Our experimental results show that this two-stage framework can achieve superior performance than FCN when trained only on COCO Stuff dataset and evaluated on other datasets without fine-tuning. Moreover, this simple framework also surpasses previous state-of-the-arts of zero-shot semantic segmentation by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate future research. The code are made publicly available at~\url{https://github.com/MendelXu/zsseg.baseline}.

PDF Abstract

Code

Add Remove Mark official

mendelxu/zsseg.baseline official

156

openrobotlab/ov_parts

Tasks

Add Remove

Image Classification

Language Modelling

object-detection

Object Detection

Open Vocabulary Semantic Segmentation

Segmentation

Semantic Segmentation

Zero-Shot Image Classification

Zero-Shot Learning

Zero-Shot Semantic Segmentation

Datasets

Cityscapes

ADE20K

PASCAL Context

COCO-Stuff

PASCAL VOC

Results from the Paper

Edit

Ranked #2 on Open Vocabulary Semantic Segmentation on COCO-Stuff-171

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Open Vocabulary Semantic Segmentation	ADE20K-150	SimSeg	mIoU	20.5	# 14	Compare
Open Vocabulary Semantic Segmentation	ADE20K-847	SimSeg	mIoU	7	# 13	Compare
Open Vocabulary Semantic Segmentation	Cityscapes	SimSeg	mIoU	34.5	# 2	Compare
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	ZSSeg	HIoU	37.8	# 2	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	SimSeg	mIoU	47.7	# 12	Compare
Open Vocabulary Semantic Segmentation	PascalVOC-20	ZSSeg	hIoU	77.5	# 2	Compare

Methods

Add Remove

CLIP • Convolution • FCN • Max Pooling

Edit Social Preview

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove