Referring Expression Segmentation
80 papers with code • 22 benchmarks • 11 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
Datasets
Subtasks
Most implemented papers
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Image Segmentation Using Text and Image Prompts
After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.
Segmentation from Natural Language Expressions
To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.
CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions
Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.
SeqTR: A Simple yet Universal Network for Visual Grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.
SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.
End-to-End Referring Video Object Segmentation with Multimodal Transformers
Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.
Unleashing Text-to-Image Diffusion Models for Visual Perception
In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.