XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem
PDF AbstractResults from the Paper
Ranked #1 on Video Object Segmentation on YouTube-VOS 2019 (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem (BL30K) | Jaccard (Mean) | 90.7 | # 8 | ||
F-measure (Mean) | 93.2 | # 15 | |||||
J&F | 92.0 | # 11 | |||||
Speed (FPS) | 29.6 | # 13 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem | Jaccard (Mean) | 90.4 | # 17 | ||
F-measure (Mean) | 92.7 | # 19 | |||||
J&F | 91.5 | # 19 | |||||
Speed (FPS) | 29.6 | # 13 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem (DAVIS only) | Jaccard (Mean) | 86.7 | # 42 | ||
F-measure (Mean) | 88.9 | # 41 | |||||
J&F | 87.8 | # 42 | |||||
Speed (FPS) | 29.6 | # 13 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem (BL30K, MS) | Jaccard (Mean) | 92.2 | # 2 | ||
F-measure (Mean) | 94.4 | # 3 | |||||
J&F | 93.3 | # 2 | |||||
Video Object Segmentation | DAVIS 2016 | XMem (BL30K, MS) | Jaccard (Mean) | 92.2 | # 2 | ||
F-Score | 94.4 | # 2 | |||||
J&F | 93.3 | # 2 | |||||
Video Object Segmentation | DAVIS 2016 | XMem | Jaccard (Mean) | 90.4 | # 5 | ||
F-Score | 92.7 | # 6 | |||||
J&F | 91.5 | # 5 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem (DAVIS+YouTubeVOS only) | Jaccard (Mean) | 89.6 | # 24 | ||
F-measure (Mean) | 91.9 | # 26 | |||||
J&F | 90.8 | # 24 | |||||
Speed (FPS) | 29.6 | # 13 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2016 | XMem (MS) | Jaccard (Mean) | 92.0 | # 3 | ||
F-measure (Mean) | 93.5 | # 12 | |||||
J&F | 92.7 | # 6 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem (BL30K, 600p) | J&F | 82.5 | # 11 | ||
Jaccard (Mean) | 79.1 | # 9 | |||||
F-measure (Mean) | 85.8 | # 11 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem | J&F | 81.0 | # 15 | ||
Jaccard (Mean) | 77.4 | # 15 | |||||
F-measure (Mean) | 84.5 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem (DAVIS and YouTubeVOS only) | J&F | 79.8 | # 20 | ||
Jaccard (Mean) | 76.3 | # 18 | |||||
F-measure (Mean) | 83.4 | # 20 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem (BL30K, MS) | J&F | 83.7 | # 6 | ||
Jaccard (Mean) | 80.5 | # 6 | |||||
F-measure (Mean) | 87.0 | # 6 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem (MS) | J&F | 83.1 | # 8 | ||
Jaccard (Mean) | 79.7 | # 7 | |||||
F-measure (Mean) | 86.4 | # 10 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | XMem (BL30K) | J&F | 81.2 | # 13 | ||
Jaccard (Mean) | 77.6 | # 13 | |||||
F-measure (Mean) | 84.7 | # 14 | |||||
Video Object Segmentation | DAVIS-2017 (test-dev) | XMem | Mean Jaccard & F-Measure | 81.0 | # 2 | ||
Jaccard | 77.4 | # 2 | |||||
F-measure | 84.5 | # 2 | |||||
Video Object Segmentation | DAVIS-2017 (test-dev) | XMem (BL30K, MS) | Mean Jaccard & F-Measure | 83.7 | # 1 | ||
Jaccard | 80.5 | # 1 | |||||
F-measure | 87.0 | # 1 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem (MS) | Jaccard (Mean) | 85.4 | # 6 | ||
F-measure (Mean) | 91.0 | # 7 | |||||
J&F | 88.2 | # 5 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem (BL30K, MS) | Jaccard (Mean) | 86.3 | # 3 | ||
F-measure (Mean) | 92.6 | # 3 | |||||
J&F | 89.5 | # 3 | |||||
Video Object Segmentation | DAVIS 2017 (val) | XMem | Mean Jaccard & F-Measure | 86.2 | # 2 | ||
Jaccard | 82.9 | # 2 | |||||
F-measure | 89.5 | # 2 | |||||
Video Object Segmentation | DAVIS 2017 (val) | XMem (BLK30K, MS) | Mean Jaccard & F-Measure | 89.5 | # 1 | ||
Jaccard | 86.3 | # 1 | |||||
F-measure | 92.6 | # 1 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem | Jaccard (Mean) | 82.9 | # 15 | ||
F-measure (Mean) | 89.5 | # 11 | |||||
J&F | 86.2 | # 14 | |||||
Speed (FPS) | 22.6 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem (DAVIS only) | Jaccard (Mean) | 74.1 | # 47 | ||
F-measure (Mean) | 79.3 | # 50 | |||||
J&F | 76.7 | # 51 | |||||
Speed (FPS) | 22.6 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem (BL30K) | Jaccard (Mean) | 84.0 | # 11 | ||
F-measure (Mean) | 91.4 | # 5 | |||||
J&F | 87.7 | # 9 | |||||
Speed (FPS) | 22.6 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | XMem (DAVIS and YouTubeVOS only) | Jaccard (Mean) | 81.4 | # 27 | ||
F-measure (Mean) | 87.6 | # 24 | |||||
J&F | 84.5 | # 26 | |||||
Speed (FPS) | 22.6 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | XMem | FPS | 29.6 | # 7 | ||
Semi-Supervised Video Object Segmentation | Long Video Dataset | XMem | J&F | 89.8ยฑ0.2 | # 2 | ||
J | 88.0ยฑ0.2 | # 2 | |||||
F | 91.6ยฑ0.2 | # 2 | |||||
Semi-Supervised Video Object Segmentation | Long Video Dataset (3X) | XMem | J&F | 90.0ยฑ0.4 | # 1 | ||
J | 88.2ยฑ0.3 | # 1 | |||||
F | 91.8ยฑ0.4 | # 1 | |||||
Semi-Supervised Video Object Segmentation | MOSE | XMem | J&F | 57.6 | # 12 | ||
J | 53.3 | # 12 | |||||
F | 62.0 | # 12 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | XMem (BL30K, MS) | F-Measure (Seen) | 90.3 | # 4 | ||
F-Measure (Unseen) | 90.2 | # 1 | |||||
Overall | 86.9 | # 2 | |||||
Jaccard (Seen) | 85.6 | # 2 | |||||
Jaccard (Unseen) | 81.7 | # 2 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | XMem | F-Measure (Seen) | 89.3 | # 13 | ||
F-Measure (Unseen) | 88.7 | # 6 | |||||
Overall | 85.7 | # 9 | |||||
Speed (FPS) | 22.6 | # 11 | |||||
Jaccard (Seen) | 84.6 | # 10 | |||||
Jaccard (Unseen) | 80.2 | # 7 | |||||
Video Object Segmentation | YouTube-VOS 2018 | XMem (BL30K, MS) | Jaccard (Seen) | 85.6 | # 1 | ||
Jaccard (Unseen) | 81.7 | # 1 | |||||
F-Measure (Seen) | 90.3 | # 1 | |||||
F-Measure (Unseen) | 90.2 | # 1 | |||||
Mean Jaccard & F-Measure | 86.9 | # 1 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | XMem (MS) | F-Measure (Seen) | 89.9 | # 8 | ||
F-Measure (Unseen) | 89.9 | # 3 | |||||
Overall | 86.7 | # 3 | |||||
Jaccard (Seen) | 85.3 | # 5 | |||||
Jaccard (Unseen) | 81.7 | # 2 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | XMem (YouTubeVOS only) | F-Measure (Seen) | 88.5 | # 18 | ||
F-Measure (Unseen) | 87.2 | # 15 | |||||
Overall | 84.4 | # 19 | |||||
Speed (FPS) | 22.6 | # 11 | |||||
Jaccard (Seen) | 83.7 | # 16 | |||||
Jaccard (Unseen) | 78.2 | # 22 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | XMem (BL30K) | F-Measure (Seen) | 89.8 | # 10 | ||
F-Measure (Unseen) | 89.2 | # 4 | |||||
Overall | 86.1 | # 6 | |||||
Speed (FPS) | 22.6 | # 11 | |||||
Jaccard (Seen) | 85.1 | # 6 | |||||
Jaccard (Unseen) | 80.3 | # 6 | |||||
Video Object Segmentation | YouTube-VOS 2019 | XMem (BL30K,MS) | Mean Jaccard & F-Measure | 86.8 | # 1 | ||
Jaccard (Seen) | 85.5 | # 1 | |||||
Jaccard (Unseen) | 81.8 | # 1 | |||||
F-Measure (Seen) | 89.8 | # 1 | |||||
F-Measure (Unseen) | 89.9 | # 1 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | XMem (MS) | Overall | 86.4 | # 4 | ||
Jaccard (Seen) | 84.9 | # 6 | |||||
Jaccard (Unseen) | 81.8 | # 4 | |||||
F-Measure (Seen) | 89.2 | # 7 | |||||
F-Measure (Unseen) | 89.8 | # 3 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | XMem (BL30K) | Overall | 85.8 | # 8 | ||
Jaccard (Seen) | 84.8 | # 7 | |||||
Jaccard (Unseen) | 80.3 | # 10 | |||||
F-Measure (Seen) | 89.2 | # 7 | |||||
F-Measure (Unseen) | 88.8 | # 7 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | XMem | Overall | 84.3 | # 14 | ||
Jaccard (Seen) | 83.6 | # 12 | |||||
Jaccard (Unseen) | 78.5 | # 18 | |||||
F-Measure (Seen) | 88.0 | # 13 | |||||
F-Measure (Unseen) | 87.1 | # 15 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | XMem (BL30K, MS) | Overall | 86.8 | # 2 | ||
Jaccard (Seen) | 85.5 | # 2 | |||||
Jaccard (Unseen) | 81.8 | # 4 | |||||
F-Measure (Seen) | 89.8 | # 5 | |||||
F-Measure (Unseen) | 89.9 | # 2 | |||||
Video Object Segmentation | YouTube-VOS 2019 | XMem | Mean Jaccard & F-Measure | 85.5 | # 2 | ||
Jaccard (Seen) | 84.3 | # 3 | |||||
Jaccard (Unseen) | 80.3 | # 2 | |||||
F-Measure (Seen) | 88.6 | # 3 | |||||
F-Measure (Unseen) | 88.6 | # 2 |