Videos depict the change of complex dynamical systems over time in the form of discrete image sequences.
Given two object images, how can we explain their differences in terms of the underlying object properties?
In order to sidestep the main technical difficulty of the multi-object-multi-view scenario -- maintaining object correspondences across views -- MulMON iteratively updates the latent object representations for a scene over multiple views.
We train DyMON on multi-view-dynamic-scene data and show that DyMON learns -- without supervision -- to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space).
Learning object-centric scene representations is crucial for scene structural understanding.