We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\% on E-Gait, which is 14--30\% more accurate over prior methods.
Source: STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from GaitsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Automatic Speech Recognition (ASR) | 3 | 10.34% |
Speech Recognition | 3 | 10.34% |
Emotion Recognition | 2 | 6.90% |
Language Modelling | 2 | 6.90% |
Sentence | 2 | 6.90% |
Audio-Visual Speech Recognition | 1 | 3.45% |
Visual Speech Recognition | 1 | 3.45% |
Denoising | 1 | 3.45% |
Mutual Information Estimation | 1 | 3.45% |