93 papers with code • 0 benchmarks • 7 datasets
These leaderboards are used to track progress in Video Recognition
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.