TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	MSR-VTT	VideoPoet	CLIPSIM	0.3123	# 2
Text-to-Video Generation	MSR-VTT	VideoPoet	FVD	213	# 3
Video Generation	UCF-101	VideoPoet (text-conditional)	Inception Score	38.44	# 17
Video Generation	UCF-101	VideoPoet (text-conditional)	FVD16	355	# 18
Text-to-Video Generation	UCF-101	VideoPoet	FVD16	355	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videopoet-a-large-language-model-for-zero/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=videopoet-a-large-language-model-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videopoet-a-large-language-model-for-zero/text-to-video-generation-on-ucf-101)](https://paperswithcode.com/sota/text-to-video-generation-on-ucf-101?p=videopoet-a-large-language-model-for-zero)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videopoet-a-large-language-model-for-zero/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=videopoet-a-large-language-model-for-zero)`

VideoPoet: A Large Language Model for Zero-Shot Video Generation

21 Dec 2023 · Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang ·

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Language Modelling

Large Language Model

Text-to-Video Generation

Video Generation

Datasets

UCF101

DAVIS

MSR-VTT

Something-Something V2

Kinetics-600

Results from the Paper

Add Remove

Ranked #3 on Text-to-Video Generation on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	MSR-VTT	VideoPoet	CLIPSIM	0.3123	# 2	Compare
Text-to-Video Generation	MSR-VTT	VideoPoet	FVD	213	# 3	Compare
Video Generation	UCF-101	VideoPoet (text-conditional)	Inception Score	38.44	# 17	Compare
Video Generation	UCF-101	VideoPoet (text-conditional)	FVD16	355	# 18	Compare
Text-to-Video Generation	UCF-101	VideoPoet	FVD16	355	# 7	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove