Zero-shot Image Retrieval

16 papers with code • 5 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

deepmind/multimodal_transformers 31 Jan 2021

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

AutoVision-cloud/SSL-ViT-lowlabel-highdata 22 Jan 2022

For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

0jason000/wukong 14 Feb 2022

Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.

CCMB: A Large-scale Chinese Cross-modal Benchmark

yuxie11/R2D2 8 May 2022

In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks.

FETA: Towards Specializing Foundation Models for Expert Task Applications

alfassy/FETA 8 Sep 2022

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

PaddlePaddle/ERNIE 30 Sep 2022

They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality.

General Image Descriptors for Open World Image Retrieval using ViT CLIP

ivanaer/g-universal-clip 20 Oct 2022

The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc.

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

ofa-sys/chinese-clip 2 Nov 2022

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.