Zero-Shot Image Classification
62 papers with code • 3 benchmarks • 6 datasets
Zero-shot image classification is a technique in computer vision where a model can classify images into categories that were not present during training. This is achieved by leveraging semantic information about the categories, such as textual descriptions or relationships between classes.
Libraries
Use these libraries to find Zero-Shot Image Classification models and implementationsMost implemented papers
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
LiT: Zero-Shot Transfer with Locked-image text Tuning
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Reproducible scaling laws for contrastive language-image learning
To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
What does a platypus look like? Generating customized prompts for zero-shot image classification
Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference.
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images.
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives.
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
However, the compositional reasoning abilities of existing VLMs remains subpar.