TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Open Vocabulary Panoptic Segmentation	ADE20K	FC-CLIP	PQ	26.8	# 2
Open Vocabulary Semantic Segmentation	ADE20K-150	FC-CLIP	mIoU	34.1	# 5
Open Vocabulary Semantic Segmentation	ADE20K-847	FC-CLIP	mIoU	14.8	# 4
Open Vocabulary Semantic Segmentation	Cityscapes	FC-CLIP	mIoU	56.2	# 1
Open Vocabulary Semantic Segmentation	PASCAL Context-459	FC-CLIP	mIoU	18.2	# 4
Open Vocabulary Semantic Segmentation	PASCAL Context-59	FC-CLIP	mIoU	58.4	# 8
Open Vocabulary Semantic Segmentation	PascalVOC-20	FC-CLIP	mIoU	95.4	# 5
Open Vocabulary Semantic Segmentation	PascalVOC-20b	FC-CLIP	mIoU	81.8	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-panoptic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-panoptic-segmentation-on?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-9)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-9?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-3)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-3?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-7)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-7?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-2)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-2?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-5)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-5?p=convolutions-die-hard-open-vocabulary-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutions-die-hard-open-vocabulary-1/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=convolutions-die-hard-open-vocabulary-1)`

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

NeurIPS 2023 · Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen ·

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

bytedance/fc-clip official

↳ Quickstart in

Spaces

246

Tasks

Add Remove

Open Vocabulary Panoptic Segmentation

Open Vocabulary Semantic Segmentation

Semantic Segmentation

Datasets

MS COCO

Cityscapes

ADE20K

COCO-Stuff

PASCAL VOC

Results from the Paper

Add Remove

Ranked #1 on Open Vocabulary Semantic Segmentation on Cityscapes

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Open Vocabulary Panoptic Segmentation	ADE20K	FC-CLIP	PQ	26.8	# 2	Compare
Open Vocabulary Semantic Segmentation	ADE20K-150	FC-CLIP	mIoU	34.1	# 5	Compare
Open Vocabulary Semantic Segmentation	ADE20K-847	FC-CLIP	mIoU	14.8	# 4	Compare
Open Vocabulary Semantic Segmentation	Cityscapes	FC-CLIP	mIoU	56.2	# 1	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-459	FC-CLIP	mIoU	18.2	# 4	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	FC-CLIP	mIoU	58.4	# 8	Compare
Open Vocabulary Semantic Segmentation	PascalVOC-20	FC-CLIP	mIoU	95.4	# 5	Compare
Open Vocabulary Semantic Segmentation	PascalVOC-20b	FC-CLIP	mIoU	81.8	# 3	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove