TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Video Generation	UCF-101	CogVideo (128x128, class-conditional)	Inception Score	51.11	# 13
Video Generation	UCF-101	CogVideo (128x128, class-conditional)	FVD16	305	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cogvideo-large-scale-pretraining-for-text-to/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=cogvideo-large-scale-pretraining-for-text-to)`

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

29 May 2022 · Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang ·

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

PDF Abstract

Code

Add Remove Mark official

thudm/cogvideo official

↳ Quickstart in

Spaces

3,490

Tasks

Add Remove

Text-to-Video Generation

Video Generation

Datasets

UCF101

Kinetics-600

Results from the Paper

Edit

Ranked #12 on Video Generation on UCF-101

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Video Generation	UCF-101	CogVideo (128x128, class-conditional)	Inception Score	51.11	# 13		Compare
Video Generation	UCF-101	CogVideo (128x128, class-conditional)	FVD16	305	# 12		Compare

Methods

Add Remove

ALIGN

Edit Social Preview

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove