142 papers with code • 14 benchmarks • 17 datasets
Video Prediction is the task of predicting future frames given past video frames.
This allows for the static scene to remain fixed and to represent motion of the ego-vehicle on the grid like other agents'.
The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions.
Existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame.
The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner.
Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
In this paper, we introduce 3D-CSL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR), and explore a novel self-supervised learning strategy for video similarity learning.
While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge.
We propose a unified model for multiple conditional video synthesis tasks, including video prediction and video frame interpolation.