Associating Objects with Transformers for Video Object Segmentation

NeurIPS 2021  ·  Zongxin Yang, Yunchao Wei, Yi Yang ·

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than $3\times$ faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semi-Supervised Video Object Segmentation DAVIS 2016 SwinB-AOT-L Jaccard (Mean) 90.7 # 8
F-measure (Mean) 93.3 # 14
J&F 92.0 # 11
Speed (FPS) 12.1 # 28
Semi-Supervised Video Object Segmentation DAVIS 2016 AOT-L Jaccard (Mean) 88.7 # 31
F-measure (Mean) 91.1 # 29
J&F 89.9 # 31
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 AOT-T Jaccard (Mean) 86.1 # 47
F-measure (Mean) 87.4 # 46
J&F 86.8 # 44
Speed (FPS) 51.4 # 6
Semi-Supervised Video Object Segmentation DAVIS 2016 R50-AOT-L Jaccard (Mean) 90.1 # 21
F-measure (Mean) 92.1 # 24
J&F 91.1 # 21
Speed (FPS) 18.0 # 24
Semi-Supervised Video Object Segmentation DAVIS 2016 AOT-L Jaccard (Mean) 89.6 # 24
F-measure (Mean) 91.1 # 29
J&F 90.4 # 29
Speed (FPS) 18.7 # 23
Semi-Supervised Video Object Segmentation DAVIS 2016 AOT-S Jaccard (Mean) 88.6 # 35
F-measure (Mean) 90.2 # 35
J&F 89.4 # 33
Speed (FPS) 40.0 # 9
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) SwinB-AOT-L J&F 81.2 # 13
Jaccard (Mean) 77.3 # 16
F-measure (Mean) 85.1 # 13
FPS 12.1 # 18
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) R50-AOT-L J&F 79.6 # 21
Jaccard (Mean) 75.9 # 21
F-measure (Mean) 83.3 # 21
FPS 18.0 # 13
Video Object Segmentation DAVIS 2017 (test-dev) AOT Jaccard 75.9 # 2
F-measure 83.3 # 2
Mean Jaccard & F-Measure 79.6 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) AOT-L J&F 78.3 # 24
Jaccard (Mean) 74.3 # 26
F-measure (Mean) 82.3 # 24
FPS 18.7 # 12
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) AOT-B J&F 75.5 # 31
Jaccard (Mean) 71.6 # 33
F-measure (Mean) 79.3 # 31
FPS 29.6 # 7
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) AOT-S J&F 73.9 # 35
Jaccard (Mean) 70.3 # 35
F-measure (Mean) 77.5 # 35
FPS 40.0 # 5
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) AOT-T J&F 72.0 # 38
Jaccard (Mean) 68.3 # 38
F-measure (Mean) 75.7 # 38
FPS 51.4 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) AOT-L Jaccard (Mean) 81.1 # 30
F-measure (Mean) 86.4 # 30
J&F 83.8 # 29
Speed (FPS) 18.7 # 21
Params(M) 8.3 # 6
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) SwinB-AOT-L Jaccard (Mean) 82.4 # 20
F-measure (Mean) 88.4 # 21
J&F 85.4 # 19
Speed (FPS) 12.1 # 26
Params(M) 65.4 # 18
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) AOT-T Jaccard (Mean) 77.4 # 42
F-measure (Mean) 82.3 # 43
J&F 79.9 # 44
Speed (FPS) 51.4 # 4
Params(M) 5.7 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) AOT-S Jaccard (Mean) 78.7 # 39
F-measure (Mean) 83.9 # 40
J&F 81.3 # 40
Speed (FPS) 40.0 # 8
Params(M) 7.0 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) AOT-B Jaccard (Mean) 79.7 # 35
F-measure (Mean) 85.2 # 36
J&F 82.5 # 35
Speed (FPS) 29.6 # 10
Params(M) 8.3 # 6
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) R50-AOT-L Jaccard (Mean) 82.3 # 21
F-measure (Mean) 87.5 # 25
J&F 84.9 # 24
Speed (FPS) 18.0 # 22
Params(M) 14.9 # 13
Semi-Supervised Video Object Segmentation DAVIS (no YouTube-VOS training) AOT-S FPS 40.0 # 3
D17 val (G) 79.2 # 3
D17 val (J) 76.4 # 3
D17 val (F) 82.0 # 3
Semi-Supervised Video Object Segmentation MOSE AOT J&F 57.2 # 13
J 53.1 # 13
F 61.3 # 13
Semi-Supervised Video Object Segmentation VOT2020 AOT-S EAO 0.512 # 14
EAO (real-time) 0.499 # 10
Semi-Supervised Video Object Segmentation VOT2020 AOT-L EAO 0.574 # 8
EAO (real-time) 0.560 # 2
Semi-Supervised Video Object Segmentation VOT2020 R50-AOT-L EAO 0.569 # 10
EAO (real-time) 0.540 # 7
Semi-Supervised Video Object Segmentation VOT2020 AOT-B EAO 0.541 # 12
EAO (real-time) 0.533 # 8
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 R50-AOT-L (all frames) F-Measure (Seen) 89.5 # 11
F-Measure (Unseen) 88.2 # 9
Overall 85.5 # 10
Speed (FPS) 6.4 # 22
Jaccard (Seen) 84.5 # 11
Jaccard (Unseen) 79.6 # 9
Params(M) 14.9 # 14
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 AOT-S (all frames) F-Measure (Seen) 87.0 # 30
F-Measure (Unseen) 85.7 # 29
Overall 83.0 # 29
Speed (FPS) 27.1 # 7
Jaccard (Seen) 82.2 # 29
Jaccard (Unseen) 77.3 # 29
Params(M) 7.9 # 4
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 AOT-L (all frames) F-Measure (Seen) 88.8 # 16
F-Measure (Unseen) 87.1 # 17
Overall 84.5 # 16
Speed (FPS) 6.5 # 21
Jaccard (Seen) 83.7 # 16
Jaccard (Unseen) 78.4 # 19