See-Through-Text Grouping for Referring Image Segmentation

Motivated by the conventional grouping techniques to image segmentation, we develop their DNN counterpart to tackle the referring variant. The proposed method is driven by a convolutional-recurrent neural network (ConvRNN) that iteratively carries out top-down processing of bottom-up segmentation cues. Given a natural language referring expression, our method learns to predict its relevance to each pixel and derives a See-through-Text Embedding Pixelwise (STEP) heatmap, which reveals segmentation cues of pixel level via the learned visual-textual co-embedding. The ConvRNN performs a top-down approximation by converting the STEP heatmap into a refined one, whereas the improvement is expected from training the network with a classification loss from the ground truth. With the refined heatmap, we update the textual representation of the referring expression by re-evaluating its attention distribution and then compute a new STEP heatmap as the next input to the ConvRNN. Boosting by such collaborative learning, the framework can progressively and simultaneously yield the desired referring segmentation and reasonable attention distribution over the referring sentence. Our method is general and does not rely on, say, the outcomes of object detection from other DNN models, while achieving state-of-the-art performance in all of the four datasets in the experiments.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation RefCOCO testA STEP (1-fold) Overall IoU 58.70 # 23
Referring Expression Segmentation RefCOCO+ testA STEP (5-fold) Overall IoU 52.33 # 18
Referring Expression Segmentation RefCOCO testB STEP (1-fold) Overall IoU 55.39 # 16
Referring Expression Segmentation RefCOCO+ test B STEP (5-fold) Overall IoU 40.41 # 17
Referring Expression Segmentation RefCoCo val STEP (1-fold) Overall IoU 56.58 # 24
Referring Expression Segmentation RefCOCO+ val STEP (5-fold) Overall IoU 48.18 # 19

Methods