Text Retrieval
237 papers with code • 5 benchmarks • 14 datasets
Libraries
Use these libraries to find Text Retrieval models and implementationsDatasets
Most implemented papers
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows.
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
Single Shot Scene Text Retrieval
In this way, the text based image retrieval task can be casted as a simple nearest neighbor search of the query text representation over the outputs of the CNN over the entire image database.
Image Chat: Engaging Grounded Conversations
To test such models, we collect a dataset of grounded human-human conversations, where speakers are asked to play roles given a provided emotional mood or style, as the use of such traits is also a key factor in engagingness (Guo et al., 2019).
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing).
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.