CDNet: A cascaded decoupling architecture for video prediction

29 Sep 2021 · Chuanqi Zang, Mingtao Pei ·

Video prediction is an essential task in the computer vision community, helping to solve many downstream vision tasks by predicting and modeling future motion dynamics and appearance. In the deterministic video prediction task, current methods mainly employ variants of stacked Recurrent Neural Networks (RNN) to capture spatiotemporal coherence, overlooking the conflict between long-term motion dynamics modeling and legible appearance generation. In this work, we propose a Cascaded Decoupling Network (CDNet) to solve the video prediction problem through two modules: motion LSTM to capture the motion trend and variation in the temporal highway without considering the appearance details, and refine LSTM to recover the detailed appearance according to the predicted motion dynamics and historical appearance iteratively. The cascaded structure provides a preliminary solution for the above conflict. We verify the rationality of our model on two real-world challenging video prediction datasets and yield state-of-the-art performance.

PDF Abstract