XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

14 Jul 2022  ยท  Ho Kei Cheng, Alexander G. Schwing ยท

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem

PDF Abstract

Results from the Paper


 Ranked #1 on Video Object Segmentation on YouTube-VOS 2019 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (BL30K) Jaccard (Mean) 90.7 # 8
F-measure (Mean) 93.2 # 15
J&F 92.0 # 11
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem Jaccard (Mean) 90.4 # 17
F-measure (Mean) 92.7 # 19
J&F 91.5 # 19
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (DAVIS only) Jaccard (Mean) 86.7 # 42
F-measure (Mean) 88.9 # 41
J&F 87.8 # 42
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (BL30K, MS) Jaccard (Mean) 92.2 # 2
F-measure (Mean) 94.4 # 3
J&F 93.3 # 2
Video Object Segmentation DAVIS 2016 XMem (BL30K, MS) Jaccard (Mean) 92.2 # 2
F-Score 94.4 # 2
J&F 93.3 # 2
Video Object Segmentation DAVIS 2016 XMem Jaccard (Mean) 90.4 # 5
F-Score 92.7 # 6
J&F 91.5 # 5
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (DAVIS+YouTubeVOS only) Jaccard (Mean) 89.6 # 24
F-measure (Mean) 91.9 # 26
J&F 90.8 # 24
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (MS) Jaccard (Mean) 92.0 # 3
F-measure (Mean) 93.5 # 12
J&F 92.7 # 6
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K, 600p) J&F 82.5 # 11
Jaccard (Mean) 79.1 # 9
F-measure (Mean) 85.8 # 11
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem J&F 81.0 # 15
Jaccard (Mean) 77.4 # 15
F-measure (Mean) 84.5 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (DAVIS and YouTubeVOS only) J&F 79.8 # 20
Jaccard (Mean) 76.3 # 18
F-measure (Mean) 83.4 # 20
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K, MS) J&F 83.7 # 6
Jaccard (Mean) 80.5 # 6
F-measure (Mean) 87.0 # 6
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (MS) J&F 83.1 # 8
Jaccard (Mean) 79.7 # 7
F-measure (Mean) 86.4 # 10
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K) J&F 81.2 # 13
Jaccard (Mean) 77.6 # 13
F-measure (Mean) 84.7 # 14
Video Object Segmentation DAVIS-2017 (test-dev) XMem Mean Jaccard & F-Measure 81.0 # 2
Jaccard 77.4 # 2
F-measure 84.5 # 2
Video Object Segmentation DAVIS-2017 (test-dev) XMem (BL30K, MS) Mean Jaccard & F-Measure 83.7 # 1
Jaccard 80.5 # 1
F-measure 87.0 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (MS) Jaccard (Mean) 85.4 # 6
F-measure (Mean) 91.0 # 7
J&F 88.2 # 5
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (BL30K, MS) Jaccard (Mean) 86.3 # 3
F-measure (Mean) 92.6 # 3
J&F 89.5 # 3
Video Object Segmentation DAVIS 2017 (val) XMem Mean Jaccard & F-Measure 86.2 # 2
Jaccard 82.9 # 2
F-measure 89.5 # 2
Video Object Segmentation DAVIS 2017 (val) XMem (BLK30K, MS) Mean Jaccard & F-Measure 89.5 # 1
Jaccard 86.3 # 1
F-measure 92.6 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem Jaccard (Mean) 82.9 # 15
F-measure (Mean) 89.5 # 11
J&F 86.2 # 14
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (DAVIS only) Jaccard (Mean) 74.1 # 47
F-measure (Mean) 79.3 # 50
J&F 76.7 # 51
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (BL30K) Jaccard (Mean) 84.0 # 11
F-measure (Mean) 91.4 # 5
J&F 87.7 # 9
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (DAVIS and YouTubeVOS only) Jaccard (Mean) 81.4 # 27
F-measure (Mean) 87.6 # 24
J&F 84.5 # 26
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS (no YouTube-VOS training) XMem FPS 29.6 # 7
Semi-Supervised Video Object Segmentation Long Video Dataset XMem J&F 89.8ยฑ0.2 # 2
J 88.0ยฑ0.2 # 2
F 91.6ยฑ0.2 # 2
Semi-Supervised Video Object Segmentation Long Video Dataset (3X) XMem J&F 90.0ยฑ0.4 # 1
J 88.2ยฑ0.3 # 1
F 91.8ยฑ0.4 # 1
Semi-Supervised Video Object Segmentation MOSE XMem J&F 57.6 # 12
J 53.3 # 12
F 62.0 # 12
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (BL30K, MS) F-Measure (Seen) 90.3 # 4
F-Measure (Unseen) 90.2 # 1
Overall 86.9 # 2
Jaccard (Seen) 85.6 # 2
Jaccard (Unseen) 81.7 # 2
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem F-Measure (Seen) 89.3 # 13
F-Measure (Unseen) 88.7 # 6
Overall 85.7 # 9
Speed (FPS) 22.6 # 11
Jaccard (Seen) 84.6 # 10
Jaccard (Unseen) 80.2 # 7
Video Object Segmentation YouTube-VOS 2018 XMem (BL30K, MS) Jaccard (Seen) 85.6 # 1
Jaccard (Unseen) 81.7 # 1
F-Measure (Seen) 90.3 # 1
F-Measure (Unseen) 90.2 # 1
Mean Jaccard & F-Measure 86.9 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (MS) F-Measure (Seen) 89.9 # 8
F-Measure (Unseen) 89.9 # 3
Overall 86.7 # 3
Jaccard (Seen) 85.3 # 5
Jaccard (Unseen) 81.7 # 2
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (YouTubeVOS only) F-Measure (Seen) 88.5 # 18
F-Measure (Unseen) 87.2 # 15
Overall 84.4 # 19
Speed (FPS) 22.6 # 11
Jaccard (Seen) 83.7 # 16
Jaccard (Unseen) 78.2 # 22
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (BL30K) F-Measure (Seen) 89.8 # 10
F-Measure (Unseen) 89.2 # 4
Overall 86.1 # 6
Speed (FPS) 22.6 # 11
Jaccard (Seen) 85.1 # 6
Jaccard (Unseen) 80.3 # 6
Video Object Segmentation YouTube-VOS 2019 XMem (BL30K,MS) Mean Jaccard & F-Measure 86.8 # 1
Jaccard (Seen) 85.5 # 1
Jaccard (Unseen) 81.8 # 1
F-Measure (Seen) 89.8 # 1
F-Measure (Unseen) 89.9 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (MS) Overall 86.4 # 4
Jaccard (Seen) 84.9 # 6
Jaccard (Unseen) 81.8 # 4
F-Measure (Seen) 89.2 # 7
F-Measure (Unseen) 89.8 # 3
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (BL30K) Overall 85.8 # 8
Jaccard (Seen) 84.8 # 7
Jaccard (Unseen) 80.3 # 10
F-Measure (Seen) 89.2 # 7
F-Measure (Unseen) 88.8 # 7
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem Overall 84.3 # 14
Jaccard (Seen) 83.6 # 12
Jaccard (Unseen) 78.5 # 18
F-Measure (Seen) 88.0 # 13
F-Measure (Unseen) 87.1 # 15
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (BL30K, MS) Overall 86.8 # 2
Jaccard (Seen) 85.5 # 2
Jaccard (Unseen) 81.8 # 4
F-Measure (Seen) 89.8 # 5
F-Measure (Unseen) 89.9 # 2
Video Object Segmentation YouTube-VOS 2019 XMem Mean Jaccard & F-Measure 85.5 # 2
Jaccard (Seen) 84.3 # 3
Jaccard (Unseen) 80.3 # 2
F-Measure (Seen) 88.6 # 3
F-Measure (Unseen) 88.6 # 2

Methods