RGB-D Salient Object Detection
51 papers with code • 8 benchmarks • 5 datasets
RGB-D Salient object detection (SOD) aims at distinguishing the most visually distinctive objects or regions in a scene from the given RGB and Depth data. It has a wide range of applications, including video/image segmentation, object recognition, visual tracking, foreground maps evaluation, image retrieval, content-aware image editing, information discovery, photosynthesis, and weakly supervised semantic segmentation. Here, depth information plays an important complementary role in finding salient objects. Online benchmark: http://dpfan.net/d3netbenchmark.
Our framework includes two main models: 1) a generator model, which maps the input image and latent variable to stochastic saliency prediction, and 2) an inference model, which gradually updates the latent variable by sampling it from the true or approximate posterior distribution.
The large availability of depth sensors provides valuable complementary information for salient object detection (SOD) in RGBD images.
The use of RGB-D information for salient object detection has been extensively explored in recent years.
In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS).
In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD.
The explicitly extracted edge information goes together with saliency to give more emphasis to the salient regions and object boundaries.
Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture.
In this paper, we answer this question from two perspectives: (1) We argue that if the complementary part can be modelled more explicitly, the cross-modal complement is likely to be better captured.