Decoupling Features in Hierarchical Propagation for Video Object Segmentation

18 Oct 2022  ยท  Zongxin Yang, Yi Yang ยท

This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semi-Supervised Video Object Segmentation DAVIS 2016 DeAOT-S Jaccard (Mean) 87.6 # 38
F-measure (Mean) 90.9 # 32
J&F 89.3 # 36
Speed (FPS) 49.2 # 7
Semi-Supervised Video Object Segmentation DAVIS 2016 DeAOT-T Jaccard (Mean) 87.8 # 37
F-measure (Mean) 89.9 # 37
J&F 88.9 # 38
Speed (FPS) 63.5 # 4
Semi-Supervised Video Object Segmentation DAVIS 2016 DeAOT-B Jaccard (Mean) 89.4 # 29
F-measure (Mean) 92.5 # 22
J&F 91.0 # 22
Speed (FPS) 40.9 # 8
Semi-Supervised Video Object Segmentation DAVIS 2016 DeAOT-L Jaccard (Mean) 90.3 # 19
F-measure (Mean) 93.7 # 10
J&F 92.0 # 11
Speed (FPS) 28.5 # 18
Semi-Supervised Video Object Segmentation DAVIS 2016 R50-DeAOT-L Jaccard (Mean) 90.5 # 13
F-measure (Mean) 94.0 # 8
J&F 92.3 # 9
Speed (FPS) 27.0 # 19
Semi-Supervised Video Object Segmentation DAVIS 2016 SwinB-DeAOT-L Jaccard (Mean) 91.1 # 6
F-measure (Mean) 94.7 # 1
J&F 92.9 # 5
Speed (FPS) 15.4 # 27
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DeAOT-T J&F 73.7 # 36
Jaccard (Mean) 70.0 # 36
F-measure (Mean) 77.3 # 36
FPS 63.5 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DeAOT-S J&F 75.4 # 32
Jaccard (Mean) 71.9 # 31
F-measure (Mean) 79.0 # 32
FPS 49.2 # 3
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DeAOT-B J&F 76.2 # 30
Jaccard (Mean) 72.5 # 30
F-measure (Mean) 79.9 # 30
FPS 40.9 # 4
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) SwinB-DeAOT-L J&F 82.8 # 9
Jaccard (Mean) 78.9 # 10
F-measure (Mean) 86.7 # 8
FPS 15.4 # 17
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) DeAOT-L J&F 77.9 # 27
Jaccard (Mean) 74.1 # 27
F-measure (Mean) 81.7 # 25
FPS 28.5 # 8
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) R50-DeAOT-L J&F 80.7 # 17
Jaccard (Mean) 76.9 # 17
F-measure (Mean) 84.5 # 15
FPS 27.0 # 9
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DeAOT-B Jaccard (Mean) 79.2 # 36
F-measure (Mean) 85.1 # 37
J&F 82.2 # 37
Speed (FPS) 40.9 # 7
Params(M) 13.2 # 10
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DeAOT-T Jaccard (Mean) 77.7 # 41
F-measure (Mean) 83.3 # 42
J&F 80.5 # 42
Speed (FPS) 63.5 # 3
Params(M) 7.2 # 3
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) SwinB-DeAOT-L Jaccard (Mean) 83.1 # 13
F-measure (Mean) 89.2 # 15
J&F 86.2 # 14
Speed (FPS) 15.4 # 25
Params(M) 70.3 # 22
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DeAOT-S Jaccard (Mean) 77.8 # 40
F-measure (Mean) 83.8 # 41
J&F 80.8 # 41
Speed (FPS) 49.2 # 5
Params(M) 10.2 # 8
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) DeAOT-L Jaccard (Mean) 81.0 # 31
F-measure (Mean) 87.1 # 28
J&F 84.1 # 28
Speed (FPS) 28.5 # 11
Params(M) 13.2 # 10
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) R50-DeAOT-L Jaccard (Mean) 82.2 # 22
F-measure (Mean) 88.2 # 22
J&F 85.2 # 23
Speed (FPS) 27.0 # 12
Params(M) 19.8 # 16
Semi-Supervised Video Object Segmentation MOSE DeAOT J&F 59.4 # 11
J 55.1 # 11
F 63.8 # 11
Semi-Supervised Video Object Segmentation VOT2020 R50-DeAOT-L EAO 0.613 # 2
EAO (real-time) 0.571 # 1
Semi-Supervised Video Object Segmentation VOT2020 DeAOT-T EAO 0.472 # 16
EAO (real-time) 0.463 # 12
Semi-Supervised Video Object Segmentation VOT2020 DeAOT-S EAO 0.593 # 4
EAO (real-time) 0.559 # 3
Semi-Supervised Video Object Segmentation VOT2020 DeAOT-B EAO 0.571 # 9
EAO (real-time) 0.542 # 6
Semi-Supervised Video Object Segmentation VOT2020 DeAOT-L EAO 0.591 # 5
EAO (real-time) 0.554 # 5
Semi-Supervised Video Object Segmentation VOT2020 SwinB-DeAOT-L EAO 0.622 # 1
EAO (real-time) 0.559 # 3
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 SwinB-DeAOT-L F-Measure (Seen) 90.6 # 3
F-Measure (Unseen) 88.4 # 8
Overall 86.2 # 5
Jaccard (Seen) 85.6 # 2
Jaccard (Unseen) 80.0 # 8
Speed (FPS) 11.9 # 11
Params(M) 70.3 # 23
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 DeAOT-T F-Measure (Seen) 86.3 # 34
F-Measure (Unseen) 84.2 # 34
Overall 82.0 # 33
Jaccard (Seen) 81.6 # 34
Jaccard (Unseen) 75.8 # 34
Speed (FPS) 53.4 # 2
Params(M) 7.2 # 3
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 DeAOT-S F-Measure (Seen) 88.3 # 22
F-Measure (Unseen) 86.6 # 22
Overall 84.0 # 24
Jaccard (Seen) 83.3 # 21
Jaccard (Unseen) 77.9 # 25
Speed (FPS) 38.7 # 3
Params(M) 10.2 # 10
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 DeAOT-B F-Measure (Seen) 88.9 # 15
F-Measure (Unseen) 87.0 # 18
Overall 84.6 # 15
Jaccard (Seen) 83.9 # 14
Jaccard (Unseen) 78.5 # 18
Speed (FPS) 30.4 # 5
Params(M) 13.2 # 12
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 DeAOT-L F-Measure (Seen) 89.4 # 12
Overall 84.8 # 14
Jaccard (Seen) 84.2 # 13
Jaccard (Unseen) 78.6 # 17
Speed (FPS) 24.7 # 6
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 R50-DeAOT-L F-Measure (Seen) 89.9 # 8
F-Measure (Unseen) 88.7 # 6
Overall 86.0 # 7
Jaccard (Seen) 84.9 # 9
Jaccard (Unseen) 80.4 # 5
Speed (FPS) 22.4 # 7
Params(M) 19.8 # 18
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 R50-DeAOT-L Overall 85.9 # 7
Jaccard (Seen) 84.6 # 8
Jaccard (Unseen) 80.8 # 7
F-Measure (Seen) 89.4 # 6
F-Measure (Unseen) 88.9 # 6
FPS 22.4 # 6
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 DeAOT-T Overall 82.0 # 19
Jaccard (Seen) 81.2 # 20
Jaccard (Unseen) 76.4 # 21
F-Measure (Seen) 85.6 # 20
F-Measure (Unseen) 84.7 # 20
FPS 53.4 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 DeAOT-L Overall 84.7 # 12
Jaccard (Seen) 83.8 # 10
Jaccard (Unseen) 79.0 # 16
F-Measure (Seen) 88.8 # 10
F-Measure (Unseen) 87.2 # 14
FPS 24.7 # 5
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 DeAOT-S Overall 83.8 # 18
Jaccard (Seen) 82.8 # 16
Jaccard (Unseen) 78.1 # 19
F-Measure (Seen) 87.5 # 16
F-Measure (Unseen) 86.8 # 18
FPS 38.7 # 2
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 SwinB-DeAOT-L Overall 86.1 # 6
Jaccard (Seen) 85.3 # 5
Jaccard (Unseen) 80.4 # 9
F-Measure (Seen) 90.2 # 3
F-Measure (Unseen) 88.6 # 9
FPS 11.9 # 7
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 DeAOT-B Overall 84.6 # 13
Jaccard (Seen) 83.5 # 13
Jaccard (Unseen) 79.1 # 14
F-Measure (Seen) 88.3 # 12
F-Measure (Unseen) 87.5 # 13
FPS 30.4 # 3

Methods