TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	CLIPpy ViT-B	Mean IoU (val)	13.5	# 4
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	CLIPpy ViT-B	mIoU	18.1	# 7
Unsupervised Semantic Segmentation with Language-image Pre-training	MS COCO	CLIPpy ViT-B	Mean IoU (val)	25.5	# 1
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC 2007	CLIPpy ViT-B	Mean IoU (val)	52.2	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/perceptual-grouping-in-vision-language-models/unsupervised-semantic-segmentation-with-5)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-5?p=perceptual-grouping-in-vision-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/perceptual-grouping-in-vision-language-models/unsupervised-semantic-segmentation-with-6)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-6?p=perceptual-grouping-in-vision-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/perceptual-grouping-in-vision-language-models/unsupervised-semantic-segmentation-with-4)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-4?p=perceptual-grouping-in-vision-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/perceptual-grouping-in-vision-language-models/unsupervised-semantic-segmentation-with-3)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-3?p=perceptual-grouping-in-vision-language-models)`

Perceptual Grouping in Contrastive Vision-Language Models

ICCV 2023 · Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens ·

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

kahnchana/clippy

↳ Quickstart in

Colab

Spaces

Tasks

Add Remove

Object Localization

Representation Learning

Unsupervised Semantic Segmentation

Unsupervised Semantic Segmentation with Language-image Pre-training

Datasets

ImageNet

MS COCO

Cityscapes

ADE20K

PASCAL VOC 2007

CC12M

Results from the Paper

Edit

Ranked #1 on Unsupervised Semantic Segmentation with Language-image Pre-training on MS COCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	CLIPpy ViT-B	Mean IoU (val)	13.5	# 4	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	CLIPpy ViT-B	mIoU	18.1	# 7	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	MS COCO	CLIPpy ViT-B	Mean IoU (val)	25.5	# 1	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC 2007	CLIPpy ViT-B	Mean IoU (val)	52.2	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Perceptual Grouping in Contrastive Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove