Browse > Computer Vision > Video > Video Classification

Video Classification Edit

28 papers with code · Computer Vision

No evaluation results yet. Help compare methods by submit evaluation metrics.

Group Normalization

GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks.

Non-local Neural Networks

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies.

YouTube-8M: A Large-Scale Video Classification Benchmark

In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow.

Temporal Segment Networks for Action Recognition in Videos

8 May 2017yjxiong/temporal-segment-networks

We present a general and flexible video-level framework for learning action models in videos. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

30 Mar 2017chihyaoma/Activity-Recognition-with-CNN-and-RNN

Building upon our experimental results, we then propose and investigate two different networks to further integrate spatiotemporal information: 1) temporal segment RNN and 2) Inception-style Temporal-ConvNet. We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

Learnable pooling with Context Gating for video classification

Current methods for video analysis often extract frame-level features using pre-trained convolutional neural networks (CNNs). In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3\times3\times3$ convolutions with $1\times3\times3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3\times1\times1$ convolutions to construct temporal connections on adjacent feature maps in time. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.

Deep Temporal Linear Encoding Networks

Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space; (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification; and (c) they model feature interactions in a more expressive way and without loss of information.

Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks

19 Nov 2015pbashivan/EEGLearn

One of the challenges in modeling cognitive events from electroencephalogram (EEG) data is finding representations that are invariant to inter- and intra-subject differences, as well as to inherent noise associated with such data. Herein, we propose a novel approach for learning such representations from multi-channel EEG time-series, and demonstrate its advantages in the context of mental load classification task.

Appearance-and-Relation Networks for Video Classification

Spatiotemporal feature learning in videos is a fundamental problem in computer vision. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.