Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

CVPR 2023  ·  Junbum Cha, Jonghwan Mun, Byungseok Roh ·

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TCL Mean IoU (val) 17.1 # 2
Semantic Segmentation CC3M-TagMask TCL mIoU 60.4 # 2
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TCL mIoU 24.0 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TCL mIoU 31.6 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TCL mIoU 22.4 # 3
Open Vocabulary Semantic Segmentation PASCAL Context-59 TCL mIoU 33.9 # 16
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TCL mIoU 33.9 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TCL mIoU 55.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training PascalVOC-20 TCL mIoU 83.2 # 2
Open Vocabulary Semantic Segmentation PascalVOC-20 TCL mIoU 83.2 # 12

Methods