TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from a pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. As a result, we show that the pretrained T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., w.r.t entity and background). Our TALC-finetuned model outperforms the baseline methods on multi-scene video-text data by 15.5 points on aggregated score, averaging visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.
PDF Abstract