UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.
PDF AbstractCode
Datasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Open Vocabulary Panoptic Segmentation | ADE20K | UMG-CLIP-L/14 | PQ | 29.1 | # 3 | ||
Open Vocabulary Panoptic Segmentation | ADE20K | UMG-CLIP-E/14 | PQ | 31.6 | # 1 | ||
Open Vocabulary Semantic Segmentation | ADE20K-150 | UMG-CLIP-E/14 | mIoU | 38.2 | # 1 | ||
Open Vocabulary Semantic Segmentation | ADE20K-150 | UMG-CLIP-L/14 | mIoU | 36.1 | # 6 | ||
Open Vocabulary Semantic Segmentation | ADE20K-847 | UMG-CLIP-E/14 | mIoU | 17.3 | # 1 | ||
Open Vocabulary Semantic Segmentation | ADE20K-847 | UMG-CLIP-L/14 | mIoU | 15.4 | # 5 | ||
Panoptic Segmentation | COCO minival | UMG-CLIP-E/14 | PQ | 59.5 | # 3 | ||
AP | 50.7 | # 4 | |||||
mIoU | 69.7 | # 1 | |||||
Panoptic Segmentation | COCO minival | UMG-CLIP-L/14 | PQ | 58.9 | # 7 | ||
AP | 49.7 | # 5 | |||||
mIoU | 68.9 | # 2 | |||||
Open Vocabulary Semantic Segmentation | PASCAL Context-459 | UMG-CLIP-L/14 | mIoU | 23.2 | # 5 | ||
Open Vocabulary Semantic Segmentation | PASCAL Context-459 | UMG-CLIP-E/14 | mIoU | 25.2 | # 2 | ||
Open Vocabulary Semantic Segmentation | PASCAL Context-59 | UMG-CLIP-L/14 | mIoU | 61.0 | # 6 | ||
Open Vocabulary Semantic Segmentation | PascalVOC-20 | UMG-CLIP-L/14 | mIoU | 97.9 | # 1 | ||
Open Vocabulary Semantic Segmentation | PascalVOC-20b | UMG-CLIP-E/14 | mIoU | 85.4 | # 1 |