MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

15 Feb 2024  ยท  Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter ยท

We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to diverse intermediate layers. In each head, a modified nearest neighbor objective helps to construct respective semantic clusters. The refinement process is short but effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, achieves new state-of-the-art results in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. In ImageNet-1K 1-shot classification, MIM-Refiner sets a new state-of-the-art of 64.2%, outperforming larger models that were trained on up to 2000x more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B. Project page: https://ml-jku.github.io/MIM-Refiner

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Clustering ImageNet MIM-Refiner (D2V2-ViT-H/14) NMI 87.2 # 1
Accuracy 67.3 # 1
ARI 42.2 # 4
Image Clustering ImageNet MIM-Refiner (MAE-ViT-H/14) NMI 85.3 # 2
Accuracy 64.6 # 2
ARI 45.5 # 3
Self-Supervised Image Classification ImageNet MIM-Refiner (MAE-ViT-L/16) Top 1 Accuracy 82.8% # 9
Number of Params 307M # 16
Self-Supervised Image Classification ImageNet MIM-Refiner (D2V2-ViT-L/16) Top 1 Accuracy 83.5% # 8
Number of Params 307M # 16
Self-Supervised Image Classification ImageNet MIM-Refiner (MAE-ViT-H/14 Top 1 Accuracy 83.7% # 7
Number of Params 632M # 6
Self-Supervised Image Classification ImageNet MIM-Refiner (MAE-ViT-2B/14) Top 1 Accuracy 84.5% # 5
Number of Params 1890M # 2
Self-Supervised Image Classification ImageNet MIM-Refiner (D2V2-ViT-H/14) Top 1 Accuracy 84.7% # 4
Number of Params 632M # 6

Methods