Image-text Retrieval

127 papers with code • 0 benchmarks • 5 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Image-text Retrieval models and implementations
2 papers
10,555
2 papers
3,366

Most implemented papers

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

salesforce/lavis 28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

facebookresearch/metaclip 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

FlexiViT: One Model for All Patch Sizes

google-research/big_vision CVPR 2023

Vision Transformers convert images to sequences by slicing them into patches.

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

google-research-datasets/wit 2 Mar 2021

First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing).

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

alibaba/AliceMind 24 May 2022

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

vacancy/NSCL-PyTorch-Release ICLR 2019

To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation.

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image Recognition

marshuang80/gloria ICCV 2021

In recent years, the growing number of medical imaging studies is placing an ever-increasing burden on radiologists.