We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.
Ranked #1 on
Action Classification
on Moments in Time
ACTION CLASSIFICATION ACTION RECOGNITION NEURAL ARCHITECTURE SEARCH VIDEO RECOGNITION
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Ranked #8 on
Action Classification
on Kinetics-400
ACTION CLASSIFICATION FEATURE SELECTION IMAGE CLASSIFICATION VIDEO CLASSIFICATION VIDEO RECOGNITION
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
ACTION CLASSIFICATION ACTION CLASSIFICATION VIDEO RECOGNITION
We present SlowFast networks for video recognition.
Ranked #1 on
Action Recognition In Videos
on AVA v2.1
ACTION CLASSIFICATION ACTION CLASSIFICATION ACTION DETECTION ACTION RECOGNITION ACTION RECOGNITION IN VIDEOS VIDEO RECOGNITION
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
Ranked #34 on
Action Classification
on Kinetics-400
ACTION CLASSIFICATION IMAGE CLASSIFICATION VIDEO RECOGNITION
Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning.
Ranked #1 on
Action Classification
on Kinetics-400
(using extra training data)
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Ranked #4 on
Action Recognition
on Something-Something V2
(using extra training data)
ACTION CLASSIFICATION ACTION RECOGNITION VIDEO OBJECT DETECTION VIDEO RECOGNITION VIDEO UNDERSTANDING
In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection.
Ranked #3 on
Video Object Detection
on ImageNet VID
The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
Ranked #7 on
Video Object Detection
on ImageNet VID