Referring Expression Segmentation

53 papers with code • 18 benchmarks • 11 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Most implemented papers

Image Segmentation Using Text and Image Prompts

timojl/clipseg CVPR 2022

After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query.

Segmentation from Natural Language Expressions

ssharpe42/NLQAC_ObjSeg 20 Mar 2016

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

miriambellver/refvos 1 Oct 2020

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

imatge-upc/synthref 8 Jun 2021

Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.

End-to-End Referring Video Object Segmentation with Multimodal Transformers

mttr2021/MTTR CVPR 2022

Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.

GRES: Generalized Referring Expression Segmentation

henghuiding/ReLA CVPR 2023

Existing classic RES datasets and methods commonly support single-target expressions only, i. e., one expression refers to one target object.

GLaMM: Pixel Grounding Large Multimodal Model

syscv/sam-hq 6 Nov 2023

In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.

MAttNet: Modular Attention Network for Referring Expression Comprehension

lichengunc/MAttNet CVPR 2018

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.