Learning spatiotemporal representations for human fall detection in surveillance video

Journal of Visual Communication and Image Representation 2019 · Yongqiang Kong, Jianhui Huang, Shanshan Huang, Zhengang Wei, Shengke Wang ·

In this paper, a computer vision based framework is proposed that detects falls from surveillance videos. Firstly, we employ background subtraction and rank pooling to model spatial and temporal representations in videos, respectively. We then introduce a novel three-stream Convolutional Neural Networks as an event classifier. Silhouettes and their motion history images serve as input to the first two streams, while dynamic images whose temporal duration is equal to motion history images, are used as input to the third stream. Finally, we apply voting on the results of event classification to perform multicamera fall detection. The main novelty of our method against the conventional ones is that highquality spatiotemporal representations in different levels are learned to take full advantage of the appearance and motion information. Extensive experiments have been conducted on two widely used fall datasets. The results have shown to demonstrate the effectiveness of the proposed method.

PDF Abstract