Referring Expression
156 papers with code • 1 benchmarks • 3 datasets
Referring expressions places a bounding box around the instance corresponding to the provided description and image.
Libraries
Use these libraries to find Referring Expression models and implementationsMost implemented papers
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Image Segmentation Using Text and Image Prompts
After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Modeling Context in Referring Expressions
Humans refer to objects in their environments all the time, especially in dialogue with other people.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.
Towards Visual Grounding: A Survey
Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers.
CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions
Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
SeqTR: A Simple yet Universal Network for Visual Grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).