A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods