However, it is quite expensive to annotate every frame in a large corpus of videos to construct a comprehensive supervised training dataset.
We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #11 on Action Classification on Charades
no code implementations • 2 Apr 2021 • Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe
We propose TubeR: a simple solution for spatio-temporal video action detection.
In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-informative features.
In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video.
Video action recognition is one of the representative tasks for video understanding.
Triplet loss processes batch construction in a complicated and fussy way and converges slowly.
We present compositional nearest neighbors (CompNN), a simple approach to visually interpreting distributed representations learned by a convolutional neural network (CNN) for pixel-level tasks (e. g., image synthesis and segmentation).
Despite the fact that many 3D human activity benchmarks being proposed, most existing action datasets focus on the action recognition tasks for the segmented videos.