Referring Expression
79 papers with code • 0 benchmarks • 2 datasets
Referring expressions places a bounding box around the instance corresponding to the provided description and image.
Benchmarks
These leaderboards are used to track progress in Referring Expression
Libraries
Use these libraries to find Referring Expression models and implementationsMost implemented papers
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Modeling Context in Referring Expressions
Humans refer to objects in their environments all the time, especially in dialogue with other people.
CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions
Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.
Generating Easy-to-Understand Referring Expressions for Target Identifications
Moreover, we regard that sentences that are easily understood are those that are comprehended correctly and quickly by humans.
A Fast and Accurate One-Stage Approach to Visual Grounding
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Image Segmentation Using Text and Image Prompts
After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.