Let Go of Your Labels with Unsupervised Transfer

Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.

PDF Abstract International Conference 2024 PDF International Conference 2024 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Clustering Birdsnap TURTLE (CLIP + DINOv2) Accuracy 68.1 # 1
Image Clustering Caltech-101 TURTLE (CLIP + DINOv2) Accuracy 89.8 # 1
Image Clustering CIFAR-10 TURTLE (CLIP + DINOv2) Accuracy 0.995 # 1
NMI 0.985 # 1
ARI 0.989 # 1
Unsupervised Image Classification CIFAR-10 TURTLE (CLIP + DINOv2) Accuracy 99.5 # 1
Image Clustering CIFAR-100 TURTLE (CLIP + DINOv2) Accuracy 0.899 # 1
NMI 0.915 # 1
ARI 0.832 # 1
Image Clustering CLEVR Counts TURTLE (CLIP + DINOv2) Accuracy 24.0 # 1
Image Clustering Country211 TURTLE (CLIP + DINOv2) Accuracy 11.1 # 1
Image Clustering DTD TURTLE (CLIP + DINOv2) Accuracy 57.3 # 1
Image Clustering EuroSAT TURTLE (CLIP + DINOv2) Accuracy 96.6 # 1
Image Clustering FER2013 TURTLE (CLIP + DINOv2) Accuracy 36.2 # 1
Image Clustering FGVC Aircraft TURTLE (CLIP + DINOv2) Accuracy 36.5 # 1
Image Clustering Flowers-102 TURTLE (CLIP + DINOv2) Accuracy 99.6 # 1
Image Clustering Food-101 TURTLE (CLIP + DINOv2) Accuracy 92.2 # 1
Image Clustering GTSRB TURTLE (CLIP + DINOv2) Accuracy 48.4 # 1
Image Clustering Hateful Memes TURTLE (CLIP + DINOv2) Accuracy 54.2 # 1
Unsupervised Image Classification ImageNet TURTLE (CLIP + DINOv2) Accuracy (%) 72.9 # 1
ARI 62.5 # 1
Image Clustering ImageNet TURTLE (CLIP + DINOv2) NMI 88.2 # 1
Accuracy 72.9 # 1
ARI 62.5 # 1
Image Clustering Kinetics-700 TURTLE (CLIP + DINOv2) Accuracy 43.0 # 1
Image Clustering KITTI TURTLE (CLIP + DINOv2) Accuracy 39.4 # 1
Unsupervised Image Classification MNIST TURTLE (CLIP + DINOv2) Accuracy 97.8 # 3
Image Clustering MNIST TURTLE (CLIP + DINOv2) Accuracy 97.8 # 1
Image Clustering Oxford-IIIT Pets TURTLE (CLIP + DINOv2) Accuracy 92.3 # 1
Image Clustering PCam TURTLE (CLIP + DINOv2) Accuracy 52.0 # 1
Image Clustering Rendered SST2 TURTLE (CLIP + DINOv2) Accuracy 51.6 # 1
Image Clustering RESISC45 TURTLE (CLIP + DINOv2) Accuracy 89.6 # 1
Image Clustering Stanford Cars TURTLE (CLIP + DINOv2) Accuracy 0.646 # 1
Unsupervised Image Classification STL-10 TURTLE (CLIP + DINOv2) Accuracy 99.7 # 1
Image Clustering STL-10 TURTLE (CLIP + DINOv2) Accuracy 0.997 # 1
NMI 0.993 # 1
ARI 0.994 # 1
Image Clustering SUN397 TURTLE (CLIP + DINOv2) Accuracy 67.9 # 1
Image Clustering UCF101 TURTLE (CLIP + DINOv2) Accuracy 82.3 # 1

Methods