Text-to-Video Generation
49 papers with code • 6 benchmarks • 9 datasets
This task refers to video generation based on a given sentence or sequence of words.
Datasets
Most implemented papers
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
CelebV-Text: A Large-Scale Facial Text-Video Dataset
This paper presents CelebV-Text, a large-scale, diverse, and high-quality dataset of facial text-video pairs, to facilitate research on facial text-to-video generation tasks.
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human.
Generative Disco: Text-to-Video Generation for Music Visualization
Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it.
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content.
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M.
ControlVideo: Training-free Controllable Text-to-Video Generation
Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling.
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation.