Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use.
Fourth, in order to shed light on the potential of self-supervised learning on the task of video correspondence flow, we probe the upper bound by training on additional data, \ie more diverse videos, further demonstrating significant improvements on video segmentation.
Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner.
This paper conducts a systematic study on the role of visual attention in Unsupervised Video Object Segmentation (UVOS) tasks.
We validate our method on four benchmark sets that cover single and multiple object segmentation.
#3 best model for Visual Object Tracking on DAVIS 2016
Video object segmentation targets at segmenting a specific object throughout a video sequence, given only an annotated first frame.
#4 best model for Visual Object Tracking on YouTube-VOS