Tracking Anything with Decoupled Video Segmentation
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA
PDF Abstract ICCV 2023 PDF ICCV 2023 AbstractTasks
Results from the Paper
Ranked #1 on Unsupervised Video Object Segmentation on DAVIS 2016 val (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Open-World Video Segmentation | BURST-val | DEVA (Mask2Former) | OWTA (all) | 69.9 | # 1 | ||
OWTA (com) | 75.2 | # 1 | |||||
OWTA (unc) | 41.5 | # 2 | |||||
Open-World Video Segmentation | BURST-val | DEVA (EntitySeg) | OWTA (all) | 69.5 | # 2 | ||
OWTA (com) | 73.3 | # 2 | |||||
OWTA (unc) | 50.5 | # 1 | |||||
Unsupervised Video Object Segmentation | DAVIS 2016 val | DEVA (DIS) | G | 88.9 | # 1 | ||
J | 87.6 | # 3 | |||||
F | 90.2 | # 1 | |||||
Unsupervised Video Object Segmentation | DAVIS 2017 (test-dev) | DEVA (EntitySeg) | J&F | 62.1 | # 1 | ||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | DEVA | J&F | 83.2 | # 7 | ||
Jaccard (Mean) | 79.6 | # 8 | |||||
F-measure (Mean) | 86.8 | # 7 | |||||
FPS | 25.3 | # 10 | |||||
Referring Expression Segmentation | DAVIS 2017 (val) | DEVA (ReferFormer) | J&F 1st frame | 66.3 | # 3 | ||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | DEVA | Jaccard (Mean) | 84.2 | # 9 | ||
F-measure (Mean) | 91.0 | # 7 | |||||
J&F | 87.6 | # 10 | |||||
Speed (FPS) | 25.3 | # 13 | |||||
Unsupervised Video Object Segmentation | DAVIS 2017 (val) | DEVA (EntitySeg) | J&F | 73.4 | # 1 | ||
Jaccard (Mean) | 70.4 | # 1 | |||||
F-measure (Mean) | 76.4 | # 1 | |||||
Semi-Supervised Video Object Segmentation | MOSE | DEVA (no OVIS) | J&F | 60.0 | # 10 | ||
J | 55.8 | # 10 | |||||
F | 64.3 | # 10 | |||||
FPS | 25.3 | # 7 | |||||
Semi-Supervised Video Object Segmentation | MOSE | DEVA (with OVIS) | J&F | 66.5 | # 7 | ||
J | 62.3 | # 7 | |||||
F | 70.8 | # 7 | |||||
FPS | 25.3 | # 7 | |||||
Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | DEVA (ReferFormer) | J&F | 66.0 | # 11 | ||
Video Panoptic Segmentation | VIPSeg | DEVA (Mask2Former - SwinB) | VPQ | 55.0 | # 6 | ||
STQ | 52.2 | # 6 |