Image-text Retrieval
127 papers with code • 0 benchmarks • 5 datasets
Benchmarks
These leaderboards are used to track progress in Image-text Retrieval
Libraries
Use these libraries to find Image-text Retrieval models and implementationsMost implemented papers
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches.
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing).
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image Recognition
In recent years, the growing number of medical imaging studies is placing an ever-increasing burden on radiologists.