Tracking Anything with Decoupled Video Segmentation

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Open-World Video Segmentation BURST-val DEVA (Mask2Former) OWTA (all) 69.9 # 1
OWTA (com) 75.2 # 1
OWTA (unc) 41.5 # 2
Open-World Video Segmentation BURST-val DEVA (EntitySeg) OWTA (all) 69.5 # 2
OWTA (com) 73.3 # 2
OWTA (unc) 50.5 # 1
Unsupervised Video Object Segmentation DAVIS 2016 val DEVA (DIS) G 88.9 # 1
J 87.6 # 1
F 90.2 # 1
Unsupervised Video Object Segmentation DAVIS 2017 (test-dev) DEVA (EntitySeg) J&F 62.1 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DEVA J&F 83.2 # 4
Jaccard (Mean) 79.6 # 5
F-measure (Mean) 86.8 # 4
FPS 25.3 # 9
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DEVA Jaccard (Mean) 84.2 # 6
F-measure (Mean) 91.0 # 5
J&F 87.6 # 7
Speed (FPS) 25.3 # 13
Referring Expression Segmentation DAVIS 2017 (val) DEVA (ReferFormer) J&F 1st frame 66.3 # 2
Unsupervised Video Object Segmentation DAVIS 2017 (val) DEVA (EntitySeg) J&F 73.4 # 1
Jaccard (Mean) 70.4 # 1
F-measure (Mean) 76.4 # 1
Semi-Supervised Video Object Segmentation MOSE DEVA (with OVIS) J&F 66.5 # 1
J 62.3 # 1
F 70.8 # 1
FPS 25.3 # 1
Semi-Supervised Video Object Segmentation MOSE DEVA (no OVIS) J&F 60.0 # 2
J 55.8 # 2
F 64.3 # 2
FPS 25.3 # 1
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) DEVA (ReferFormer) J&F 66.0 # 4
Video Panoptic Segmentation VIPSeg DEVA (Mask2Former - SwinB) VPQ 55.0 # 2
STQ 52.2 # 3

Methods


No methods listed for this paper. Add relevant methods here