We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.
Ranked #1 on Action Classification on Moments in Time
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Ranked #8 on Action Classification on Kinetics-400
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
Ranked #34 on Action Classification on Kinetics-400
Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning.
Ranked #1 on Action Classification on Kinetics-400 (using extra training data)
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Ranked #4 on Action Recognition on Something-Something V2 (using extra training data)
In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection.
Ranked #3 on Video Object Detection on ImageNet VID
The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
Ranked #7 on Video Object Detection on ImageNet VID