Zero-shot Image Retrieval

16 papers with code • 5 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-shot Image Retrieval

Dataset	Best Model	Compare
Flickr30k-CN	M2-Encoder	See all
COCO-CN	M2-Encoder	See all
MUGE Retrieval	CN-CLIP (ViT-H/14)	See all
XTD10	InternVL-G	See all
ImageNet-R	Pic2Word	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal • • CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Paper
Code

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

deepmind/multimodal_transformers • • 31 Jan 2021

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.

Paper
Code

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

AutoVision-cloud/SSL-ViT-lowlabel-highdata • • 22 Jan 2022

For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.

Paper
Code

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

0jason000/wukong • • 14 Feb 2022

Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.

Paper
Code

CCMB: A Large-scale Chinese Cross-modal Benchmark

yuxie11/R2D2 • • 8 May 2022

In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks.

Paper
Code

FETA: Towards Specializing Foundation Models for Expert Task Applications

alfassy/FETA • • 8 Sep 2022

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

Paper
Code

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

PaddlePaddle/ERNIE • • 30 Sep 2022

They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality.

Paper
Code

General Image Descriptors for Open World Image Retrieval using ViT CLIP

ivanaer/g-universal-clip • • 20 Oct 2022

The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc.

Paper
Code

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

ofa-sys/chinese-clip • • 2 Nov 2022

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.

Paper
Code

Zero-shot Image Retrieval

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result