14 papers with code • 14 benchmarks • 10 datasets
The task aims at labelling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an indivisual object in a discourse or scene (the referent). REs unambiguosly identify the target instace.
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on Visual Question Answering on CLEVR-Humans
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Ranked #4 on Referring Expression Segmentation on RefCOCO+ testA
In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).
We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77, 262 images and 345, 486 phrase-region pairs.
Ranked #2 on Referring Expression Segmentation on PhraseCut
In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models.
Ranked #2 on Referring Expression Segmentation on A2D Sentences
In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information.
Ranked #2 on Referring Expression Segmentation on RefCOCO testA (Overall IoU metric)
To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains.
Ranked #5 on Referring Expression Segmentation on RefCoCo val (Overall IoU metric)
We address the problem of image segmentation from natural language descriptions.
This module controls the information flow of features at different levels.
Ranked #5 on Referring Expression Segmentation on RefCOCO testB (Overall IoU metric)
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.