T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Motion Synthesis HumanML3D T2M-GPT (τ = 0.5) FID 0.116 # 11
Diversity 9.761 # 4
Multimodality 1.856 # 12
R Precision Top3 0.775 # 15
Motion Synthesis HumanML3D T2M-GPT (τ ∈ U[0, 1]) FID 0.141 # 14
Diversity 9.722 # 6
Multimodality 1.831 # 13
R Precision Top3 0.775 # 15
Motion Synthesis HumanML3D T2M-GPT (τ = 0) FID 0.140 # 13
Diversity 9.844 # 2
Multimodality 3.285 # 1
R Precision Top3 0.685 # 21
Motion Synthesis KIT Motion-Language T2M-GPT (τ ∈ U[0, 1]) FID 0.514 # 12
R Precision Top3 0.745 # 9
Diversity 10.921 # 8
Multimodality 1.570 # 12
Motion Synthesis KIT Motion-Language T2M-GPT (τ = 0) FID 0.737 # 15
R Precision Top3 0.716 # 16
Diversity 11.198 # 1
Multimodality 2.309 # 4
Motion Synthesis KIT Motion-Language T2M-GPT (τ = 0.5) FID 0.717 # 14
R Precision Top3 0.737 # 13
Diversity 10.862 # 11
Multimodality 1.912 # 9

Methods