The core idea of our work is to leverage the Expectation-Maximization (EM) framework in order to design in a well-founded manner a loss function and a training procedure of our motion segmentation neural network that does not require either ground-truth or manual annotation.
We bypass the need for a tailored loss function on the regression parameters by attaching to our model a differentiable hard-wired decoder corresponding to the polynomial operation at hand.
First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets.
With this in mind, we propose a novel approach to occlusion detection where visibility or not of a point in next frame is formulated in terms of visual reconstruction.
The idea is to supply local motion candidates at every pixel in a first step, and then to combine them to determine the global optical flow field in a second step.
Our approach significantly outperforms the state-of-the-art on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences.
Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description.