XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

14 Jul 2022  ·  Ho Kei Cheng, Alexander G. Schwing ·

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Object Segmentation DAVIS 2016 XMem Jaccard (Mean) 90.4 # 5
F-Score 92.7 # 6
J&F 91.5 # 5
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem Jaccard (Mean) 90.4 # 17
F-measure (Mean) 92.7 # 19
J&F 91.5 # 19
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (BL30K) Jaccard (Mean) 90.7 # 8
F-measure (Mean) 93.2 # 15
J&F 92.0 # 11
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (DAVIS only) Jaccard (Mean) 86.7 # 42
F-measure (Mean) 88.9 # 41
J&F 87.8 # 42
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (DAVIS+YouTubeVOS only) Jaccard (Mean) 89.6 # 24
F-measure (Mean) 91.9 # 26
J&F 90.8 # 24
Speed (FPS) 29.6 # 13
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (MS) Jaccard (Mean) 92.0 # 3
F-measure (Mean) 93.5 # 12
J&F 92.7 # 6
Semi-Supervised Video Object Segmentation DAVIS 2016 XMem (BL30K, MS) Jaccard (Mean) 92.2 # 2
F-measure (Mean) 94.4 # 3
J&F 93.3 # 2
Video Object Segmentation DAVIS 2016 XMem (BL30K, MS) Jaccard (Mean) 92.2 # 2
F-Score 94.4 # 2
J&F 93.3 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem J&F 81.0 # 12
Jaccard (Mean) 77.4 # 12
F-measure (Mean) 84.5 # 12
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (DAVIS and YouTubeVOS only) J&F 79.8 # 17
Jaccard (Mean) 76.3 # 15
F-measure (Mean) 83.4 # 17
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K, MS) J&F 83.7 # 3
Jaccard (Mean) 80.5 # 3
F-measure (Mean) 87.0 # 3
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (MS) J&F 83.1 # 5
Jaccard (Mean) 79.7 # 4
F-measure (Mean) 86.4 # 7
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K, 600p) J&F 82.5 # 8
Jaccard (Mean) 79.1 # 6
F-measure (Mean) 85.8 # 8
Semi-Supervised Video Object Segmentation DAVIS 2017 (test-dev) XMem (BL30K) J&F 81.2 # 10
Jaccard (Mean) 77.6 # 10
F-measure (Mean) 84.7 # 11
Video Object Segmentation DAVIS-2017 (test-dev) XMem Mean Jaccard & F-Measure 81.0 # 2
Jaccard 77.4 # 2
F-measure 84.5 # 2
Video Object Segmentation DAVIS-2017 (test-dev) XMem (BL30K, MS) Mean Jaccard & F-Measure 83.7 # 1
Jaccard 80.5 # 1
F-measure 87.0 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (DAVIS and YouTubeVOS only) Jaccard (Mean) 81.4 # 24
F-measure (Mean) 87.6 # 21
J&F 84.5 # 23
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (BL30K, MS) Jaccard (Mean) 86.3 # 2
F-measure (Mean) 92.6 # 2
J&F 89.5 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (DAVIS only) Jaccard (Mean) 74.1 # 44
F-measure (Mean) 79.3 # 47
J&F 76.7 # 48
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (MS) Jaccard (Mean) 85.4 # 4
F-measure (Mean) 91.0 # 5
J&F 88.2 # 4
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem Jaccard (Mean) 82.9 # 12
F-measure (Mean) 89.5 # 8
J&F 86.2 # 11
Speed (FPS) 22.6 # 15
Video Object Segmentation DAVIS 2017 (val) XMem (BLK30K, MS) Mean Jaccard & F-Measure 89.5 # 1
Jaccard 86.3 # 1
F-measure 92.6 # 1
Video Object Segmentation DAVIS 2017 (val) XMem Mean Jaccard & F-Measure 86.2 # 2
Jaccard 82.9 # 2
F-measure 89.5 # 2
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) XMem (BL30K) Jaccard (Mean) 84.0 # 8
F-measure (Mean) 91.4 # 4
J&F 87.7 # 6
Speed (FPS) 22.6 # 15
Semi-Supervised Video Object Segmentation DAVIS (no YouTube-VOS training) XMem FPS 29.6 # 7
Semi-Supervised Video Object Segmentation Long Video Dataset XMem J&F 89.8±0.2 # 2
J 88.0±0.2 # 2
F 91.6±0.2 # 2
Semi-Supervised Video Object Segmentation Long Video Dataset (3X) XMem J&F 90.0±0.4 # 1
J 88.2±0.3 # 1
F 91.8±0.4 # 1
Semi-Supervised Video Object Segmentation MOSE XMem J&F 57.6 # 4
J 53.3 # 4
F 62.0 # 4
Video Object Segmentation YouTube-VOS 2018 XMem (BL30K, MS) Jaccard (Seen) 85.6 # 2
Jaccard (Unseen) 81.7 # 1
F-Measure (Seen) 90.3 # 2
F-Measure (Unseen) 90.2 # 1
Mean Jaccard & F-Measure 86.9 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (BL30K, MS) F-Measure (Seen) 90.3 # 3
F-Measure (Unseen) 90.2 # 1
Overall 86.9 # 1
Jaccard (Seen) 85.6 # 1
Jaccard (Unseen) 81.7 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (MS) F-Measure (Seen) 89.9 # 7
F-Measure (Unseen) 89.9 # 2
Overall 86.7 # 2
Jaccard (Seen) 85.3 # 4
Jaccard (Unseen) 81.7 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (YouTubeVOS only) F-Measure (Seen) 88.5 # 17
F-Measure (Unseen) 87.2 # 14
Overall 84.4 # 18
Speed (FPS) 22.6 # 11
Jaccard (Seen) 83.7 # 15
Jaccard (Unseen) 78.2 # 21
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem (BL30K) F-Measure (Seen) 89.8 # 9
F-Measure (Unseen) 89.2 # 3
Overall 86.1 # 5
Speed (FPS) 22.6 # 11
Jaccard (Seen) 85.1 # 5
Jaccard (Unseen) 80.3 # 5
Semi-Supervised Video Object Segmentation YouTube-VOS 2018 XMem F-Measure (Seen) 89.3 # 12
F-Measure (Unseen) 88.7 # 5
Overall 85.7 # 8
Speed (FPS) 22.6 # 11
Jaccard (Seen) 84.6 # 9
Jaccard (Unseen) 80.2 # 6
Video Object Segmentation YouTube-VOS 2018 XMem Jaccard (Seen) 84.6 # 5
Jaccard (Unseen) 80.2 # 4
F-Measure (Seen) 89.3 # 5
F-Measure (Unseen) 88.7 # 4
Mean Jaccard & F-Measure 85.7 # 4
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (BL30K) Overall 85.8 # 7
Jaccard (Seen) 84.8 # 6
Jaccard (Unseen) 80.3 # 9
F-Measure (Seen) 89.2 # 6
F-Measure (Unseen) 88.8 # 6
Video Object Segmentation YouTube-VOS 2019 XMem (BL30K,MS) Mean Jaccard & F-Measure 86.8 # 1
Jaccard (Seen) 85.5 # 1
Jaccard (Unseen) 81.8 # 1
F-Measure (Seen) 89.8 # 1
F-Measure (Unseen) 89.9 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (MS) Overall 86.4 # 3
Jaccard (Seen) 84.9 # 5
Jaccard (Unseen) 81.8 # 3
F-Measure (Seen) 89.2 # 6
F-Measure (Unseen) 89.8 # 2
Video Object Segmentation YouTube-VOS 2019 XMem Mean Jaccard & F-Measure 85.5 # 2
Jaccard (Seen) 84.3 # 3
Jaccard (Unseen) 80.3 # 2
F-Measure (Seen) 88.6 # 3
F-Measure (Unseen) 88.6 # 2
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem (BL30K, MS) Overall 86.8 # 1
Jaccard (Seen) 85.5 # 1
Jaccard (Unseen) 81.8 # 3
F-Measure (Seen) 89.8 # 4
F-Measure (Unseen) 89.9 # 1
Semi-Supervised Video Object Segmentation YouTube-VOS 2019 XMem Overall 84.3 # 13
Jaccard (Seen) 83.6 # 11
Jaccard (Unseen) 78.5 # 17
F-Measure (Seen) 88.0 # 12
F-Measure (Unseen) 87.1 # 14

Methods