TarViS: A Unified Approach for Target-based Video Segmentation

The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Ranked #2 on Video Panoptic Segmentation on KITTI-STEP (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Panoptic Segmentation Cityscapes-VPS TarViS (Swin-L) VPQ 58.9 # 4
VPQ (thing) 43.7 # 5
VPQ (stuff) 69.9 # 3
Video Panoptic Segmentation Cityscapes-VPS TarViS (ResNet-50) VPQ 53.3 # 8
VPQ (thing) 35.9 # 7
VPQ (stuff) 66.0 # 6
Video Panoptic Segmentation Cityscapes-VPS TarViS (Swin-T) VPQ 58.0 # 5
VPQ (thing) 42.9 # 6
VPQ (stuff) 69.0 # 4
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) TarViS Jaccard (Mean) 81.7 # 25
F-measure (Mean) 88.5 # 19
J&F 85.3 # 20
Video Panoptic Segmentation KITTI-STEP TarViS (ResNet-50) STQ 69.6 # 5
AQ 70.3 # 4
SQ 68.8 # 5
Video Panoptic Segmentation KITTI-STEP TarViS (Swin-L) STQ 73.0 # 2
AQ 72.0 # 2
SQ 72.0 # 3
Video Panoptic Segmentation KITTI-STEP TarViS (Swin-T) STQ 70.6 # 4
AQ 71.2 # 3
SQ 69.9 # 4
Video Instance Segmentation OVIS validation TarViS (Swin-T) mask AP 34.0 # 24
AP50 55.0 # 24
AP75 34.4 # 24
AR1 16.1 # 17
AR10 40.9 # 16
Video Instance Segmentation OVIS validation TarViS (ResNet-50) mask AP 31.1 # 27
AP50 52.5 # 25
AP75 30.4 # 26
AR1 15.9 # 18
AR10 39.9 # 18
Video Instance Segmentation OVIS validation TarViS (Swin-L) mask AP 43.2 # 12
AP50 67.8 # 13
AP75 44.6 # 12
AR1 18.0 # 12
AR10 50.4 # 6
Video Panoptic Segmentation VIPSeg TarViS (ResNet-50) VPQ 33.5 # 11
STQ 43.1 # 8
Video Panoptic Segmentation VIPSeg TarViS (Swin-T) VPQ 35.8 # 10
STQ 45.3 # 7
Video Panoptic Segmentation VIPSeg TarViS (Swin-L) VPQ 48.0 # 8
STQ 52.9 # 4
Video Instance Segmentation YouTube-VIS 2021 TarViS (Swin-L) mask AP 60.2 # 5
AP50 81.4 # 6
AP75 67.6 # 5
AR10 64.8 # 5
AR1 47.6 # 9
Video Instance Segmentation YouTube-VIS 2021 TarViS (Swin-T) mask AP 50.9 # 18
AP50 71.6 # 18
AP75 56.6 # 18
AR10 57.2 # 17
AR1 42.2 # 18
Video Instance Segmentation YouTube-VIS 2021 TarViS (ResNet-50) mask AP 48.3 # 20
AP50 69.6 # 19
AP75 53.2 # 19
AR10 55.9 # 20
AR1 40.5 # 21

Methods


No methods listed for this paper. Add relevant methods here