We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #11 on Action Classification on Charades
no code implementations • 2 Apr 2021 • Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe
We propose TubeR: a simple solution for spatio-temporal video action detection.
Multi-label activity recognition is designed for recognizing multiple activities that are performed simultaneously or sequentially in each video.
For the Olympic swimming dataset, our system achieved an accuracy of 88%, an F1-score of 0. 58, a completeness estimation error of 6. 3% and a remaining-time estimation error of 2. 9 minutes.
Our system is the first to address the concurrent activity recognition with multisensory data using a single model, which is scalable, simple to train and easy to deploy.