Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

CVPR 2021  ยท  Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang ยท

We present Modular interactive VOS (MiVOS) framework which decouples interaction-to-mask and mask propagation, allowing for higher generalizability and better performance. Trained separately, the interaction module converts user interactions to an object mask, which is then temporally propagated by our propagation module using a novel top-$k$ filtering strategy in reading the space-time memory. To effectively take the user's intent into account, a novel difference-aware module is proposed to learn how to properly fuse the masks before and after each interaction, which are aligned with the target frames by employing the space-time memory. We evaluate our method both qualitatively and quantitatively with different forms of user interactions (e.g., scribbles, clicks) on DAVIS to show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions, with the additional advantage in generalizing to different types of user interactions. We contribute a large-scale synthetic VOS dataset with pixel-accurate segmentation of 4.8M frames to accompany our source codes to facilitate future research.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Results from the Paper


 Ranked #1 on Interactive Video Object Segmentation on DAVIS 2017 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semi-Supervised Video Object Segmentation DAVIS 2016 MiVOS Jaccard (Mean) 89.7 # 22
Jaccard (Recall) 97.5 # 2
Jaccard (Decay) 6.6 # 22
F-measure (Mean) 92.4 # 23
F-measure (Recall) 96.4 # 2
F-measure (Decay) 5.1 # 26
J&F 91.0 # 22
Speed (FPS) 16.9 # 26
Interactive Video Object Segmentation DAVIS 2017 MiVOS AUC-J 0.849 # 1
J@60s 0.854 # 1
AUC-J&F 0.879 # 1
J&F@60s 0.885 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) MiVOS J&F 76.5 # 29
Jaccard (Mean) 72.7 # 29
Jaccard (Recall) 81.2 # 2
Jaccard (Decay) 14.9 # 2
F-measure (Mean) 80.2 # 29
F-measure (Recall) 87.6 # 2
F-measure (Decay) 14.5 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) MiVOS Jaccard (Mean) 81.7 # 25
Jaccard (Recall) 90.9 # 3
Jaccard (Decay) 7.0 # 2
F-measure (Mean) 87.4 # 27
F-measure (Recall) 93.1 # 2
F-measure (Decay) 8.2 # 1
J&F 84.5 # 26
Speed (FPS) 11.2 # 28
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 MiVOS F-Measure (Seen) 84.7 # 39
F-Measure (Unseen) 85.5 # 31
Overall 82.0 # 33
Jaccard (Seen) 80.6 # 38
Jaccard (Unseen) 77.3 # 29

Methods


MiVOS โ€ข VOS