Putting the Object Back into Video Object Segmentation
We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie
PDF Abstract CVPR 2024 PDF CVPR 2024 AbstractCode
Results from the Paper
Ranked #1 on Video Object Segmentation on MOSE (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Semi-Supervised Video Object Segmentation | BURST-test | Cutie (base, with mose, 600 pixels) | HOTA (all) | 62.6 | # 2 | ||
HOTA (common) | 63.8 | # 2 | |||||
HOTA (uncommon) | 62.3 | # 2 | |||||
Semi-Supervised Video Object Segmentation | BURST-test | Cutie (base, MEGA, 600 pixels) | HOTA (all) | 66.0 | # 1 | ||
HOTA (common) | 66.5 | # 1 | |||||
HOTA (uncommon) | 65.9 | # 1 | |||||
Semi-Supervised Video Object Segmentation | BURST-val | Cutie (base, with mose, 600 pixels) | HOTA (all) | 58.4 | # 2 | ||
HOTA (common) | 61.8 | # 2 | |||||
HOTA (uncommon) | 57.5 | # 2 | |||||
Semi-Supervised Video Object Segmentation | BURST-val | Cutie (base, MEGA, 600 pixels) | HOTA (all) | 61.2 | # 1 | ||
HOTA (common) | 65.0 | # 1 | |||||
HOTA (uncommon) | 60.3 | # 1 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Cutie+ (base, MEGA) | J&F | 88.1 | # 1 | ||
Jaccard (Mean) | 84.7 | # 1 | |||||
F-measure (Mean) | 91.4 | # 1 | |||||
FPS | 17.9 | # 14 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Cutie+ (base) | J&F | 85.9 | # 3 | ||
Jaccard (Mean) | 82.6 | # 2 | |||||
F-measure (Mean) | 89.2 | # 3 | |||||
FPS | 17.9 | # 14 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Cutie (base, MEGA) | J&F | 86.1 | # 2 | ||
Jaccard (Mean) | 82.4 | # 3 | |||||
F-measure (Mean) | 89.9 | # 2 | |||||
FPS | 36.4 | # 6 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Cutie+ (base) | Jaccard (Mean) | 87.5 | # 1 | ||
F-measure (Mean) | 93.4 | # 1 | |||||
J&F | 90.5 | # 1 | |||||
Params(M) | 17.9 | # 15 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Cutie+ (base, MEGA) | Jaccard (Mean) | 85.5 | # 5 | ||
F-measure (Mean) | 90.8 | # 9 | |||||
J&F | 88.1 | # 7 | |||||
Speed (FPS) | 17.9 | # 23 | |||||
Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Cutie (base) | Jaccard (Mean) | 84.6 | # 7 | ||
F-measure (Mean) | 91.1 | # 6 | |||||
J&F | 87.9 | # 8 | |||||
Params(M) | 36.4 | # 17 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (small) | J&F | 62.2 | # 9 | ||
J | 58.2 | # 9 | |||||
F | 66.2 | # 9 | |||||
FPS | 45.5 | # 1 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie+ (base, MEGA) | J&F | 71.7 | # 1 | ||
J | 67.6 | # 1 | |||||
F | 75.8 | # 1 | |||||
FPS | 17.9 | # 10 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (small, MEGA) | J&F | 68.6 | # 4 | ||
J | 64.3 | # 4 | |||||
F | 72.9 | # 4 | |||||
FPS | 45.5 | # 1 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (base, MEGA) | J&F | 69.9 | # 3 | ||
J | 65.8 | # 3 | |||||
F | 74.1 | # 3 | |||||
FPS | 36.4 | # 4 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie+ (small, MEGA) | J&F | 70.3 | # 2 | ||
J | 66.0 | # 2 | |||||
F | 74.5 | # 2 | |||||
FPS | 20.6 | # 9 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (base) | J&F | 64.0 | # 8 | ||
J | 60.0 | # 8 | |||||
F | 67.9 | # 8 | |||||
FPS | 36.4 | # 4 | |||||
Video Object Segmentation | MOSE | Cutie | J&F | 68.3 | # 1 | ||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (base, with mose) | J&F | 68.3 | # 5 | ||
J | 64.2 | # 5 | |||||
F | 72.3 | # 5 | |||||
FPS | 36.4 | # 4 | |||||
Semi-Supervised Video Object Segmentation | MOSE | Cutie (small, with mose) | J&F | 67.4 | # 6 | ||
J | 63.1 | # 6 | |||||
F | 71.7 | # 6 | |||||
FPS | 45.5 | # 1 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Cutie+ (base, MEGA) | F-Measure (Seen) | 91.0 | # 1 | ||
F-Measure (Unseen) | 90.1 | # 2 | |||||
Overall | 87.5 | # 1 | |||||
Jaccard (Seen) | 86.6 | # 1 | |||||
Jaccard (Unseen) | 82.2 | # 1 | |||||
Speed (FPS) | 17.9 | # 9 | |||||
Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | Cutie+ (base, MEGA) | Overall | 87.5 | # 1 | ||
Jaccard (Seen) | 86.3 | # 1 | |||||
Jaccard (Unseen) | 82.7 | # 3 | |||||
F-Measure (Seen) | 90.6 | # 1 | |||||
F-Measure (Unseen) | 90.5 | # 1 | |||||
J&F | 17.9 | # 3 |