|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow.
SOTA for Action Recognition In Videos on ActivityNet (using extra training data)
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
#5 best model for Action Classification on Moments in Time (Top 5 Accuracy metric)
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space; (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification; and (c) they model feature interactions in a more expressive way and without loss of information.
One of the challenges in modeling cognitive events from electroencephalogram (EEG) data is finding representations that are invariant to inter- and intra-subject differences, as well as to inherent noise associated with such data.
In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3\times3\times3$ convolutions with $1\times3\times3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3\times1\times1$ convolutions to construct temporal connections on adjacent feature maps in time.
#6 best model for Action Recognition In Videos on Sports-1M