Referring Expression Segmentation
66 papers with code • 25 benchmarks • 11 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
Datasets
Latest papers with no code
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens.
RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner
This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.
Generalizable Entity Grounding via Assistance of Large Language Model
In this work, we propose a novel approach to densely ground visual entities from a long caption.
Mask Grounding for Referring Image Segmentation
To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects.
CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation
To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network.
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation
In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.
WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation
Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information.
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip.