We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features.
In this work, we present perhaps the first approach for predicting a future 3D mesh model sequence of a person from past video input.
3D HUMAN DYNAMICS 3D HUMAN POSE ESTIMATION FUTURE PREDICTION LANGUAGE MODELLING