ID	r3d_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10

ID	r2plus1d_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10

ID	mc3_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10

ResNet 3D

pytorch / vision

Last updated on Feb 12, 2021

Parameters 0 Million

FLOPs 2 Billion

File Size 127.36 MB

Training Data Kinetics-400

Training Resources 64x NVIDIA V100 GPUs

Training Time 24 hours

Training Techniques	Weight Decay, SGD with Momentum
Architecture	1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID	r3d_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10
SHOW MORE
SHOW LESS

Parameters 32 Million

FLOPs 41 Billion

File Size 120.32 MB

Training Data Kinetics-400

Training Resources 64x NVIDIA V100 GPUs

Training Time 24 hours

Training Techniques	Weight Decay, SGD with Momentum
Architecture	1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID	r2plus1d_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10
SHOW MORE
SHOW LESS

Parameters 0 Million

FLOPs 2 Billion

File Size 44.67 MB

Training Data Kinetics-400

Training Resources 64x NVIDIA V100 GPUs

Training Time 24 hours

Training Techniques	Weight Decay, SGD with Momentum
Architecture	1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID	mc3_18
LR	0.01
Epochs	45
LR Gamma	0.1
Momentum	0.9
Batch Size	16
Weight Decay	0.0001
LR Warmup Epochs	10
SHOW MORE
SHOW LESS

README.md

Summary

ResNet 3D is a type of model for video that employs 3D convolutions. This model collection consists of two main variants.

The first formulation is named mixed convolution (MC) and consists in employing 3D convolutions only in the early layers of the network, with 2D convolutions in the top layers. The rationale behind this design is that motion modeling is a low/mid-level operation that can be implemented via 3D convolutions in the early layers of a network, and spatial reasoning over these mid-level motion features (implemented by 2D convolutions in the top layers) leads to accurate action recognition. We show that MC ResNets yield roughly a 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity and they match the performance of 3D ResNets, which have 3 times as many parameters.

The second spatiotemporal variant is a “(2+1)D” convolutional block, which explicitly factorizes 3D convolution into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. The first advantage is an additional nonlinear rectification between these two operations. This effectively doubles the number of nonlinearities compared to a network using full 3D convolutions for the same number of parameters, thus rendering the model capable of representing more complex functions. The second potential benefit is that the decomposition facilitates the optimization, yielding in practice both a lower training loss and a lower testing loss.

How do I load this model?

To load a pretrained model:

import torchvision.models as models
r3d_18 = models.video.r3d_18(pretrained=True)

Replace the model name with the variant you want to use, e.g. r3d_18. You can find the IDs in the model summaries at the top of this page.

To evaluate the model, use the object detection recipes from the library.

How do I train this model?

You can follow the torchvision recipe on GitHub for training a new model afresh.

Citation

@article{DBLP:journals/corr/abs-1711-11248,
  author    = {Du Tran and
               Heng Wang and
               Lorenzo Torresani and
               Jamie Ray and
               Yann LeCun and
               Manohar Paluri},
  title     = {A Closer Look at Spatiotemporal Convolutions for Action Recognition},
  journal   = {CoRR},
  volume    = {abs/1711.11248},
  year      = {2017},
  url       = {http://arxiv.org/abs/1711.11248},
  archivePrefix = {arXiv},
  eprint    = {1711.11248},
  timestamp = {Mon, 13 Aug 2018 16:46:39 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1711-11248.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Results

Action Classification on Kinetics-400

Action Classification

BENCHMARK	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Kinetics-400	ResNet (2+1)D	Clip acc@1	57.5	# 1
		Clip acc@5	78.81	# 1
Kinetics-400	ResNet MC 18	Clip acc@1	53.9	# 2
		Clip acc@5	76.29	# 2
Kinetics-400	ResNet 3D 18	Clip acc@1	52.75	# 3
		Clip acc@5	75.45	# 3