Zero-shot Image Retrieval
16 papers with code • 5 benchmarks • 6 datasets
Most implemented papers
FLAVA: A Foundational Language And Vision Alignment Model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.
Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime
For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.
CCMB: A Large-scale Chinese Cross-modal Benchmark
In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks.
FETA: Towards Specializing Foundation Models for Expert Task Applications
However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality.
General Image Descriptors for Open World Image Retrieval using ViT CLIP
The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc.
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.