|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame.
We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks.
Ranked #1 on Action Classification on Moments in Time (using extra training data)
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
Ranked #1 on Action Classification on Kinetics-600 (Vid acc@1 metric)
This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples.
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #1 on Action Classification on Kinetics-600 (using extra training data)
Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames.
Ranked #1 on Action Classification on Kinetics-400 (Flops x views metric)
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.
Ranked #1 on Action Classification on Charades
Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
Ranked #22 on Image Classification on CIFAR-100
We demonstrate the performance of our proposed motion representation model both working for a single specific domain (intra-domain action classification) and working for different unseen domains (cross-domain action classification).
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
Ranked #1 on Action Recognition on Diving-48