Unsupervised Semantic Segmentation with Language-image Pre-training

12 papers with code • 12 benchmarks • 7 datasets

A segmentation task which does not utilise any human-level supervision for semantic segmentation except for a backbone which is initialised with features pre-trained with image-level labels.

Most implemented papers

GroupViT: Semantic Segmentation Emerges from Text Supervision

NVlabs/GroupViT CVPR 2022

With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i. e., without any further fine-tuning.

ReCo: Retrieve and Co-segment for Zero-shot Transfer

NoelShin/reco 14 Jun 2022

Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment.

Perceptual Grouping in Contrastive Vision-Language Models

kahnchana/clippy ICCV 2023

In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.

Extract Free Dense Labels from CLIP

chongzhou96/maskclip 2 Dec 2021

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

kakaobrain/tcl CVPR 2023

Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task.

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

linyq2117/tagclip 20 Dec 2023

As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags.

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying-Liu/TagAlign 21 Dec 2023

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data.

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

shjo-april/TTD 30 Mar 2024

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias.

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

mc-lan/proxyclip 9 Aug 2024

ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity.

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

YuHengsss/Trident 14 Nov 2024

Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation.