URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

ECCV 2020  ·  Seonguk Seo, Joon-Young Lee, Bohyung Han ·

We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the {object masks} referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at \url{https://github.com/skynbe/Refer-Youtube-VOS}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Referring Expression Segmentation DAVIS 2017 (val) URVOS + Refer-Youtube-VOS + ft. DAVIS J&F 1st frame 51.63 # 6
Referring Expression Segmentation DAVIS 2017 (val) URVOS + Refer-Youtube-VOS J&F 1st frame 46.85 # 8
Referring Expression Segmentation DAVIS 2017 (val) URVOS J&F 1st frame 44.1 # 12
Referring Video Object Segmentation MeViS URVOS J&F 27.8 # 6
J 25.7 # 6
F 29.9 # 6
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) URVOS J&F 48.9 # 24
J 47.0 # 23
F 50.8 # 21

Methods


No methods listed for this paper. Add relevant methods here