Training Techniques | Weight Decay, SGD with Momentum |
---|---|
Architecture | 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax |
ID | r3d_18 |
SHOW MORE |
Training Techniques | Weight Decay, SGD with Momentum |
---|---|
Architecture | 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax |
ID | r2plus1d_18 |
SHOW MORE |
Training Techniques | Weight Decay, SGD with Momentum |
---|---|
Architecture | 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax |
ID | mc3_18 |
SHOW MORE |
ResNet 3D is a type of model for video that employs 3D convolutions. This model collection consists of two main variants.
The first formulation is named mixed convolution (MC) and consists in employing 3D convolutions only in the early layers of the network, with 2D convolutions in the top layers. The rationale behind this design is that motion modeling is a low/mid-level operation that can be implemented via 3D convolutions in the early layers of a network, and spatial reasoning over these mid-level motion features (implemented by 2D convolutions in the top layers) leads to accurate action recognition. We show that MC ResNets yield roughly a 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity and they match the performance of 3D ResNets, which have 3 times as many parameters.
The second spatiotemporal variant is a “(2+1)D” convolutional block, which explicitly factorizes 3D convolution into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. The first advantage is an additional nonlinear rectification between these two operations. This effectively doubles the number of nonlinearities compared to a network using full 3D convolutions for the same number of parameters, thus rendering the model capable of representing more complex functions. The second potential benefit is that the decomposition facilitates the optimization, yielding in practice both a lower training loss and a lower testing loss.
To load a pretrained model:
import torchvision.models as models
r3d_18 = models.video.r3d_18(pretrained=True)
Replace the model name with the variant you want to use, e.g. r3d_18
. You can find
the IDs in the model summaries at the top of this page.
To evaluate the model, use the object detection recipes from the library.
You can follow the torchvision recipe on GitHub for training a new model afresh.
@article{DBLP:journals/corr/abs-1711-11248,
author = {Du Tran and
Heng Wang and
Lorenzo Torresani and
Jamie Ray and
Yann LeCun and
Manohar Paluri},
title = {A Closer Look at Spatiotemporal Convolutions for Action Recognition},
journal = {CoRR},
volume = {abs/1711.11248},
year = {2017},
url = {http://arxiv.org/abs/1711.11248},
archivePrefix = {arXiv},
eprint = {1711.11248},
timestamp = {Mon, 13 Aug 2018 16:46:39 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1711-11248.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
BENCHMARK | MODEL | METRIC NAME | METRIC VALUE | GLOBAL RANK |
---|---|---|---|---|
Kinetics-400 | ResNet (2+1)D | Clip acc@1 | 57.5 | # 1 |
Clip acc@5 | 78.81 | # 1 | ||
Kinetics-400 | ResNet MC 18 | Clip acc@1 | 53.9 | # 2 |
Clip acc@5 | 76.29 | # 2 | ||
Kinetics-400 | ResNet 3D 18 | Clip acc@1 | 52.75 | # 3 |
Clip acc@5 | 75.45 | # 3 |