55 papers with code • 0 benchmarks • 4 datasets
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.
Ranked #1 on Action Recognition on EPIC-KITCHENS-100
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
Ranked #2 on Action Recognition on AVA v2.2
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Ranked #24 on Action Classification on Kinetics-400
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
Ranked #69 on Action Classification on Kinetics-400
Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning.
Ranked #2 on Action Recognition on UCF101 (using extra training data)
In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection.
Ranked #3 on Video Object Detection on ImageNet VID