TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero Shot Segmentation	ADE20K training-free zero-shot segmentation	CLIPSurgery	mIoU	12.9	# 3
Open Vocabulary Semantic Segmentation	Cityscapes	CLIP Surgery (CLIP without any fine-tuning)	mIoU	31.4	# 4
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	CLIP Surgery (original CLIP without any fine-tuning)	mIoU	21.9	# 2
Open Vocabulary Semantic Segmentation	PASCAL Context-59	CLIP Surgery (original CLIP without any fine-tuning)	mIoU	29.3	# 18

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-surgery-for-better-explainability-with/open-vocabulary-semantic-segmentation-on-coco)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-coco?p=clip-surgery-for-better-explainability-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-surgery-for-better-explainability-with/zero-shot-segmentation-on-ade20k-training)](https://paperswithcode.com/sota/zero-shot-segmentation-on-ade20k-training?p=clip-surgery-for-better-explainability-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-surgery-for-better-explainability-with/open-vocabulary-semantic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on?p=clip-surgery-for-better-explainability-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip-surgery-for-better-explainability-with/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=clip-surgery-for-better-explainability-with)`

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

12 Apr 2023 · Yi Li, Hualiang Wang, Yiqun Duan, Xiaomeng Li ·

Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision model that has demonstrated significant benefits for downstream tasks, including many zero-shot learning and text-guided vision tasks. However, we notice some severe problems regarding the model's explainability, which undermines its credibility and impedes related tasks. Specifically, we find CLIP prefers the background regions than the foregrounds according to the predicted similarity map, which contradicts human understanding. Besides, there are obvious noisy activations on the visualization results at irrelevant positions. To address these two issues, we conduct in-depth analyses and reveal the reasons with new findings and evidences. Based on these insights, we propose the CLIP Surgery, a method that enables surgery-like modifications for the inference architecture and features, for better explainability and enhancement in multiple open-vocabulary tasks. The proposed method has significantly improved the explainability of CLIP for both convolutional networks and vision transformers, surpassing existing methods by large margins. Besides, our approach also demonstrates remarkable improvements in open-vocabulary segmentation and multi-label recognition tasks. For examples, the mAP improvement on NUS-Wide multi-label recognition is 4.41% without any additional training, and our CLIP Surgery surpasses the state-of-the-art method by 8.74% at mIoU on Cityscapes open-vocabulary semantic segmentation. Furthermore, our method benefits other tasks including multimodal visualization and interactive segmentation like Segment Anything Model (SAM). The code is available at https://github.com/xmed-lab/CLIP_Surgery

PDF Abstract

Code

Add Remove Mark official

xmed-lab/clip_surgery official

289

xmed-lab/clipn

101

Tasks

Add Remove

Interactive Segmentation

Open Vocabulary Semantic Segmentation

Segmentation

Semantic Segmentation

Zero-Shot Learning

Zero Shot Segmentation

Datasets

ImageNet

MS COCO

Cityscapes

NUS-WIDE

PASCAL Context

COCO-Stuff

ImageNet-S

Results from the Paper

Edit

Ranked #2 on Open Vocabulary Semantic Segmentation on COCO-Stuff-171 (mIoU metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero Shot Segmentation	ADE20K training-free zero-shot segmentation	CLIPSurgery	mIoU	12.9	# 3	Compare
Open Vocabulary Semantic Segmentation	Cityscapes	CLIP Surgery (CLIP without any fine-tuning)	mIoU	31.4	# 4	Compare
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	CLIP Surgery (original CLIP without any fine-tuning)	mIoU	21.9	# 2	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	CLIP Surgery (original CLIP without any fine-tuning)	mIoU	29.3	# 18	Compare

Methods

Add Remove

CLIP

Edit Social Preview

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove