Tracking Anything with Decoupled Video Segmentation

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Open-World Video Segmentation BURST-val DEVA (Mask2Former) OWTA (all) 69.9 # 1
OWTA (com) 75.2 # 1
OWTA (unc) 41.5 # 2
Open-World Video Segmentation BURST-val DEVA (EntitySeg) OWTA (all) 69.5 # 2
OWTA (com) 73.3 # 2
OWTA (unc) 50.5 # 1
Unsupervised Video Object Segmentation DAVIS 2016 val DEVA (DIS) G 88.9 # 1
J 87.6 # 3
F 90.2 # 1
Unsupervised Video Object Segmentation DAVIS 2017 (test-dev) DEVA (EntitySeg) J&F 62.1 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DEVA J&F 83.2 # 7
Jaccard (Mean) 79.6 # 8
F-measure (Mean) 86.8 # 7
FPS 25.3 # 10
Referring Expression Segmentation DAVIS 2017 (val) DEVA (ReferFormer) J&F 1st frame 66.3 # 3
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DEVA Jaccard (Mean) 84.2 # 9
F-measure (Mean) 91.0 # 7
J&F 87.6 # 10
Speed (FPS) 25.3 # 13
Unsupervised Video Object Segmentation DAVIS 2017 (val) DEVA (EntitySeg) J&F 73.4 # 1
Jaccard (Mean) 70.4 # 1
F-measure (Mean) 76.4 # 1
Semi-Supervised Video Object Segmentation MOSE DEVA (no OVIS) J&F 60.0 # 10
J 55.8 # 10
F 64.3 # 10
FPS 25.3 # 7
Semi-Supervised Video Object Segmentation MOSE DEVA (with OVIS) J&F 66.5 # 7
J 62.3 # 7
F 70.8 # 7
FPS 25.3 # 7
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) DEVA (ReferFormer) J&F 66.0 # 11
Video Panoptic Segmentation VIPSeg DEVA (Mask2Former - SwinB) VPQ 55.0 # 6
STQ 52.2 # 6

Methods


No methods listed for this paper. Add relevant methods here