165 papers with code • 15 benchmarks • 21 datasets
Image source: The Kinetics Human Action Video Dataset
Accurate depth estimation from images is a fundamental task in many applications including scene understanding and reconstruction.
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.