Video Recognition
157 papers with code • 0 benchmarks • 10 datasets
Video Recognition is a process of obtaining, processing, and analysing data that it receives from a visual source, specifically video.
Benchmarks
These leaderboards are used to track progress in Video Recognition
Libraries
Use these libraries to find Video Recognition models and implementationsDatasets
Most implemented papers
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
SlowFast Networks for Video Recognition
We present SlowFast networks for video recognition.
Video Swin Transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
TSM: Temporal Shift Module for Efficient Video Understanding
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization
Batch Normalization (BN) has become an out-of-box technique to improve deep network training.
X3D: Expanding Architectures for Efficient Video Recognition
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
Long-term Recurrent Convolutional Networks for Visual Recognition and Description
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.
Multiscale Vision Transformers
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.