Text-to-Video Generation

49 papers with code • 6 benchmarks • 9 datasets

This task refers to video generation based on a given sentence or sequence of words.

Benchmarks

Add a Result

These leaderboards are used to track progress in Text-to-Video Generation

Dataset	Best Model	Compare
MSR-VTT	Snap Video (512x288)	See all
UCF-101	REGIS-Fuse (Finetuning, 128x128)	See all
EvalCrafter Text-to-Video (ECTV) Dataset	VideoCrafter2	See all
Kinetics	NUWA (128×128)	See all
Something-Something V2	MAGVIT	See all
WebVid	VideoFactory	See all

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Latte: Latent Diffusion Transformer for Video Generation

maxin-cn/Latte • • 5 Jan 2024

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.

Paper
Code

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

ailab-cvc/videocrafter • • 17 Jan 2024

Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.

Paper
Code

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

pku-yuangroup/magictime • • 7 Apr 2024

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Paper
Code

Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures

Singularity42/Sync-DRAW • • 30 Nov 2016

This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW).

Paper
Code

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

mehdidc/DALLE_clip_score • • 30 Apr 2021

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.

Paper
Code

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

lucidrains/nuwa-pytorch • • 24 Nov 2021

To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.

Paper
Code

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

thudm/cogvideo • • 29 May 2022

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.

Paper
Code

Latent Video Diffusion Models for High-Fidelity Long Video Generation

yingqinghe/lvdm • • 23 Nov 2022

Diffusion models have shown remarkable results recently but require significant computational resources.

Paper
Code

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

tsujuifu/pytorch_tvc • • CVPR 2023

Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.

Paper
Code

MAGVIT: Masked Generative Video Transformer

google-research/magvit • • CVPR 2023

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.

Paper
Code

Text-to-Video Generation

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result