1 code implementation • 18 Oct 2022 • Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun
Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction.