Zero-Shot Transfer Image Classification
7 papers with code • 15 benchmarks • 6 datasets
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur.
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.