Zero-Shot Transfer Image Classification

16 papers with code • 16 benchmarks • 8 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Zero-Shot Transfer Image Classification models and implementations

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

openai/CLIP 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

kakaobrain/coyo-dataset 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

LiT: Zero-Shot Transfer with Locked-image text Tuning

google-research/vision_transformer CVPR 2022

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.

EVA-CLIP: Improved Training Techniques for CLIP at Scale

baaivision/eva 27 Mar 2023

Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Your Diffusion Model is Secretly a Zero-Shot Classifier

diffusion-classifier/diffusion-classifier ICCV 2023

Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models.

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

baaivision/eva 6 Feb 2024

Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models.

Florence: A New Foundation Model for Computer Vision

microsoft/unicl 22 Nov 2021

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

google-research/big_vision 14 Sep 2022

PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

flagai-open/flagai 12 Nov 2022

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.