TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	GroupViT (RedCaps)	Mean IoU (val)	9.2	# 8
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	GroupViT (RedCaps)	mIoU	27.5	# 5
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	GroupViT	mIoU	11.1	# 7
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	GroupViT (RedCaps)	mIoU	23.4	# 6
Unsupervised Semantic Segmentation with Language-image Pre-training	PascalVOC-20	GroupViT (RedCaps)	mIoU	79.7	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/groupvit-semantic-segmentation-emerges-from/unsupervised-semantic-segmentation-with-7)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-7?p=groupvit-semantic-segmentation-emerges-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/groupvit-semantic-segmentation-emerges-from/unsupervised-semantic-segmentation-with-10)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-10?p=groupvit-semantic-segmentation-emerges-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/groupvit-semantic-segmentation-emerges-from/unsupervised-semantic-segmentation-with-8)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-8?p=groupvit-semantic-segmentation-emerges-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/groupvit-semantic-segmentation-emerges-from/unsupervised-semantic-segmentation-with-9)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-9?p=groupvit-semantic-segmentation-emerges-from)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/groupvit-semantic-segmentation-emerges-from/unsupervised-semantic-segmentation-with-4)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-4?p=groupvit-semantic-segmentation-emerges-from)`

GroupViT: Semantic Segmentation Emerges from Text Supervision

CVPR 2022 · Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang ·

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at https://github.com/NVlabs/GroupViT .

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

NVlabs/GroupViT official

↳ Quickstart in

Colab

Spaces

698

huggingface/transformers

125,796

Tasks

Add Remove

Object Detection

Scene Understanding

Semantic Segmentation

Transfer Learning

Unsupervised Semantic Segmentation with Language-image Pre-training

Datasets

ImageNet

ADE20K

PASCAL Context

COCO-Stuff

PASCAL VOC

CC12M

Results from the Paper

Edit

Ranked #3 on Unsupervised Semantic Segmentation with Language-image Pre-training on PascalVOC-20

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	GroupViT (RedCaps)	Mean IoU (val)	9.2	# 8	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	GroupViT (RedCaps)	mIoU	27.5	# 5	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	GroupViT	mIoU	11.1	# 7	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	GroupViT (RedCaps)	mIoU	23.4	# 6	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PascalVOC-20	GroupViT (RedCaps)	mIoU	79.7	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

GroupViT: Semantic Segmentation Emerges from Text Supervision

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove