Text-to-Video Generation
49 papers with code • 6 benchmarks • 9 datasets
This task refers to video generation based on a given sentence or sequence of words.
Datasets
Most implemented papers
Latte: Latent Diffusion Transformer for Video Generation
We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.
Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW).
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Diffusion models have shown remarkable results recently but require significant computational resources.
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
MAGVIT: Masked Generative Video Transformer
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.