Zero-Shot Image Classification

62 papers with code • 3 benchmarks • 6 datasets

Zero-shot image classification is a technique in computer vision where a model can classify images into categories that were not present during training. This is achieved by leveraging semantic information about the categories, such as textual descriptions or relationships between classes.

Libraries

Use these libraries to find Zero-Shot Image Classification models and implementations

Most implemented papers

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

computer-vision-in-the-wild/cvinw_readings 19 Apr 2022

In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

facebookresearch/metaclip 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

LiT: Zero-Shot Transfer with Locked-image text Tuning

google-research/vision_transformer CVPR 2022

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

tensorflow/tpu ICLR 2022

On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.

Reproducible scaling laws for contrastive language-image learning

laion-ai/scaling-laws-openclip CVPR 2023

To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.

What does a platypus look like? Generating customized prompts for zero-shot image classification

sarahpratt/cupl ICCV 2023

Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference.

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

mendelxu/zsseg.baseline 29 Dec 2021

However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images.

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

zjukg/DUET 4 Jul 2022

Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives.

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

flagai-open/flagai 12 Nov 2022

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.