Text-to-Video Generation
19 papers with code • 3 benchmarks • 4 datasets
This task refers to video generation based on a given sentence or sequence of words.
Most implemented papers
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.
MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation.
Make-A-Video: Text-to-Video Generation without Text-Video Data
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW).
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Diffusion models have shown remarkable results recently but require significant computational resources.
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
MAGVIT: Masked Generative Video Transformer
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.