We propose a novel framework for the task of object-centric video prediction, i. e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to predict the future object states, from which we can then generate subsequent video frames.
In our experiments, we demonstrate that MSPred accurately predicts future video frames as well as high-level representations (e. g. keypoints or semantics) on bin-picking and action recognition datasets, while consistently outperforming popular approaches for future frame prediction.
Ranked #1 on Video Prediction on KTH (LPIPS metric)
The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings.
Recent advances in deep learning have led to significant improvements in single image super-resolution (SR) research.
(2) To improve the already strong results further, we created a small dataset (ClassArch) consisting of ancient Greek vase paintings from the 6-5th century BCE with person and pose annotations.
This method is inspired by the observation that, in the scattering transform domain, the subspaces formed by the eigenvectors corresponding to the few largest eigenvalues of the data matrices of individual classes are nearly shared among different classes.
Ranked #4 on Image Clustering on MNIST-test