Referring Expression Segmentation

66 papers with code • 25 benchmarks • 11 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Latest papers with no code

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

no code yet • 26 Feb 2024

Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens.

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

no code yet • 8 Feb 2024

This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.

Generalizable Entity Grounding via Assistance of Large Language Model

no code yet • 4 Feb 2024

In this work, we propose a novel approach to densely ground visual entities from a long caption.

Mask Grounding for Referring Image Segmentation

no code yet • 19 Dec 2023

To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects.

GSVA: Generalized Segmentation via Multimodal Large Language Models

no code yet • 15 Dec 2023

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

no code yet • 8 Dec 2023

During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects.

CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

no code yet • 17 Sep 2023

To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network.

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

no code yet • 18 Aug 2023

In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

no code yet • 19 Jun 2023

Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information.

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

no code yet • 14 Jun 2023

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip.