This paper proposes a new method called Multimodal RNNs for RGB-D scene
semantic segmentation. It is optimized to classify image pixels given two input
sources: RGB color channels and Depth maps...
It simultaneously performs training
of two recurrent neural networks (RNNs) that are crossly connected through
information transfer layers, which are learnt to adaptively extract relevant
cross-modality features. Each RNN model learns its representations from its own
previous hidden states and transferred patterns from the other RNNs previous
hidden states; thus, both model-specific and crossmodality features are
retained. We exploit the structure of quad-directional 2D-RNNs to model the
short and long range contextual information in the 2D input image. We carefully
designed various baselines to efficiently examine our proposed model structure. We test our Multimodal RNNs method on popular RGB-D benchmarks and show how it
outperforms previous methods significantly and achieves competitive results
with other state-of-the-art works.