RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Segmentation A2Dre test RefVos Overall IoU 47.5 # 1
Mean IoU 33.2 # 1
Referring Expression Segmentation A2D Sentences RefVOS Precision@0.5 0.495 # 17
Precision@0.9 0.064 # 12
IoU overall 0.599 # 16
IoU mean 0.599 # 6
Referring Expression Segmentation DAVIS 2017 (val) RefVOS J&F 1st frame 44.5 # 6
J&F Full video 45.1 # 2
Referring Expression Segmentation RefCOCO testA RefVOS with BERT Pre-train Overall IoU 63.19 # 7
Referring Expression Segmentation RefCOCO testA RefVos with Bi-LSTM Overall IoU 52.90 # 12
Referring Expression Segmentation RefCOCO+ testA RefVOS with BERT + MLM Loss Overall IoU 49.73 # 10
Referring Expression Segmentation RefCOCO testB RefVOS with BERT Pre-train Overall IoU 54.17 # 9
Referring Expression Segmentation RefCOCO+ test B RefVOS with BERT + MLM loss Overall IoU 36.17 # 11
Referring Expression Segmentation RefCoCo val RefVOS with BERT Pre-train Overall IoU 58.65 # 9
Referring Expression Segmentation RefCoCo val RefVOS with BERT + MLM loss Overall IoU 59.45 # 7
Referring Expression Segmentation RefCOCO+ val RefVOS with BERT + MLM loss Overall IoU 44.71 # 10

Methods