Zero-shot Image Retrieval

Most implemented papers

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

flagai-open/flagai 12 Nov 2022

In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

deepmind/multimodal_transformers 31 Jan 2021

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.

FLAVA: A Foundational Language And Vision Alignment Model

apsdehal/flava-tutorials CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

AutoVision-cloud/SSL-ViT-lowlabel-highdata 22 Jan 2022

For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels.

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

0jason000/wukong 14 Feb 2022

Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

yuxie11/R2D2 8 May 2022

Along with the ZERO benchmark, we also develop a VLP framework with pre-Ranking + Ranking mechanism, boosted with target-guided Distillation and feature-guided Distillation (R2D2) for large-scale cross-modal learning.

FETA: Towards Specializing Foundation Models for Expert Task Applications

alfassy/FETA 8 Sep 2022

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

General Image Descriptors for Open World Image Retrieval using ViT CLIP

ivanaer/g-universal-clip 20 Oct 2022

The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc.

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

ofa-sys/chinese-clip 2 Nov 2022

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

zhuang-li/factual 27 May 2023

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval.