LiT: Zero-Shot Transfer with Locked-image text Tuning

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Transfer Image Classification ImageNet LiT-tuning Accuracy (Private) 84.5 # 7
Accuracy (Public) 75.7 # 2
Zero-Shot Transfer Image Classification ImageNet-A LiT-tuning Accuracy (Private) 79.4 # 9
Accuracy (Public) 37.8 # 1
Zero-Shot Transfer Image Classification ImageNet-R LiT-tuning Accuracy 93.9 # 8
Zero-Shot Transfer Image Classification ImageNet ReaL LiT-tuning Accuracy (Private) 88.0 # 1
Accuracy (Public) 82.2 # 1
Zero-Shot Transfer Image Classification ImageNet V2 LiT-tuning Accuracy (Private) 78.7 # 6
Accuracy (Public) 66.6 # 1
Image Classification ObjectNet LiT Top-1 Accuracy 82.5 # 2
Zero-Shot Transfer Image Classification ObjectNet LiT-tuning Accuracy (Private) 81.1 # 5
Accuracy (Public) 54.5 # 1

Methods


No methods listed for this paper. Add relevant methods here