REGIS: Refining Generated Videos via Iterative Stylistic Redesigning

Research Square Preprint 2023  ·  Jason Abohwo ·

In recent years, generative models have made impressive advancements towards realistic output; in particular, models working in the modalities of text and audio have reached a level of quality at which generated text or audio cannot be easily distinguished from real text or audio Despite these revolutionary advancements, the synthesis of realistic and temporally consistent videos is still in its inception. In this paper, we introduce a novel approach to the creation of realistic videos that focuses on improving the generated video in the latter steps of a video generation process. Specifically, we propose a framework for the iterative refinement of generated videos through repeated passes through a neural network trained to model the spatio-temporal dependencies found in real videos. Through our experiments, we demonstrate that our proposed approach significantly improves upon the generations of text-to-video models and achieves state-of-the-art results of the UCF-101 benchmark; removing the spatio-temporal artifacts and noise that make synthetic videos distinguishable from real videos. In addition, we discuss the ways in which one might augment this framework to achieve better performance.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Text-to-Video Generation UCF-101 REGIS-Fuse (Finetuning, 128x128) FVD16 141 # 1
Video Generation UCF-101 REGIS-Fuse (Finetuning, 128x128, text-conditional) FVD16 141 # 5

Methods


No methods listed for this paper. Add relevant methods here