|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.
We benchmark contemporary action recognition models (TSN, TRN, and TSM) on the recently introduced EPIC-Kitchens dataset and release pretrained models on GitHub (https://github. com/epic-kitchens/action-models) for others to build upon.
HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene.
SOTA for Multi-Task Learning on HVU
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Deep learning approaches have been established as the main methodology for video classification and recognition.
FastRNN addresses these limitations by adding a residual connection that does not constrain the range of the singular values explicitly and has only two extra scalar parameters.
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
#3 best model for Action Classification on Charades (using extra training data)
We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos.
SOTA for Action Classification on HMDB51 (using extra training data)