Zero-shot Image Retrieval
11 papers with code • 4 benchmarks • 4 datasets
Most implemented papers
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.
FLAVA: A Foundational Language And Vision Alignment Model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime
For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.
Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework
Along with the ZERO benchmark, we also develop a VLP framework with pre-Ranking + Ranking mechanism, boosted with target-guided Distillation and feature-guided Distillation (R2D2) for large-scale cross-modal learning.
FETA: Towards Specializing Foundation Models for Expert Task Applications
However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.
General Image Descriptors for Open World Image Retrieval using ViT CLIP
The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc.
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval.