URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the {object masks} referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at \url{https://github.com/skynbe/Refer-Youtube-VOS}.
PDF AbstractDatasets
Introduced in the Paper:
Refer-YouTube-VOSUsed in the Paper:
DAVIS DAVIS 2017 JHMDB Referring Expressions for DAVIS 2016 & 2017 A2D A2D Sentences MeViSResults from the Paper
Ranked #6 on Referring Expression Segmentation on DAVIS 2017 (val) (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Referring Expression Segmentation | DAVIS 2017 (val) | URVOS + Refer-Youtube-VOS + ft. DAVIS | J&F 1st frame | 51.63 | # 6 | ||
Referring Expression Segmentation | DAVIS 2017 (val) | URVOS + Refer-Youtube-VOS | J&F 1st frame | 46.85 | # 8 | ||
Referring Expression Segmentation | DAVIS 2017 (val) | URVOS | J&F 1st frame | 44.1 | # 12 | ||
Referring Video Object Segmentation | MeViS | URVOS | J&F | 27.8 | # 6 | ||
J | 25.7 | # 6 | |||||
F | 29.9 | # 6 | |||||
Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | URVOS | J&F | 48.9 | # 24 | ||
J | 47.0 | # 23 | |||||
F | 50.8 | # 21 |