SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Referring Expression Segmentation DAVIS 2017 (val) RefVOS J&F 1st frame 45.1 # 5
Referring Expression Segmentation DAVIS 2017 (val) RefVOS + SynthRef-YouTube-VIS J&F 1st frame 45.3 # 4
J&F Full video 44.8 # 3
Referring Expression Segmentation Refer-YouTube-VOS RefVOS-Human REs Precision@0.5 38.6 # 2
Precision@0.9 6.9 # 1
Mean IoU 39.5 # 1
Referring Expression Segmentation Refer-YouTube-VOS RefVOS-Synthetic REs Precision@0.5 32.3 # 1
Precision@0.9 1.8 # 2
Mean IoU 35.0 # 2

Methods


No methods listed for this paper. Add relevant methods here