|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use.
Ranked #1 on Semi-Supervised Video Object Segmentation on YouTube
Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage.
Ranked #24 on Semantic Segmentation on Cityscapes test
This effectively limits the performance and generalization capabilities of existing video segmentation methods.
This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach.
Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable.
Ranked #8 on Video Semantic Segmentation on Cityscapes val