Referring Expression Segmentation
26 papers with code • 20 benchmarks • 11 datasets
The task aims at labelling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an indivisual object in a discourse or scene (the referent). REs unambiguosly identify the target instace.
Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.
Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.
Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image.
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.