Deep Transport Network for Unsupervised Video Object Segmentation

The popular unsupervised video object segmentation methods fuse the RGB frame and optical flow via a two-stream network. However, they cannot handle the distracting noises in each input modality, which may vastly deteriorate the model performance. We propose to establish the correspondence between the input modalities while suppressing the distracting signals via optimal structural matching. Given a video frame, we extract the dense local features from the RGB image and optical flow, and treat them as two complex structured representations. The Wasserstein distance is then employed to compute the global optimal flows to transport the features in one modality to the other, where the magnitude of each flow measures the extent of the alignment between two local features. To plug the structural matching into a two-stream network for end-to-end training, we factorize the input cost matrix into small spatial blocks and design a differentiable long-short Sinkhorn module consisting of a long-distant Sinkhorn layer and a short-distant Sinkhorn layer. We integrate the module into a dedicated two-stream network and dub our model TransportNet. Our experiments show that aligning motion-appearance yields the state-of-the-art results on the popular video object segmentation datasets.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Unsupervised Video Object Segmentation DAVIS 2016 val TransportNet G 84.8 # 11
J 84.5 # 10
F 85.0 # 10
Unsupervised Video Object Segmentation FBMS test TransportNet J 78.7 # 4

Methods


No methods listed for this paper. Add relevant methods here