ResNet 3D

Last updated on Feb 12, 2021

ResNet 3D 18

Parameters 0 Million
FLOPs 2 Billion
File Size 127.36 MB
Training Data Kinetics-400
Training Resources 64x NVIDIA V100 GPUs
Training Time 24 hours

Training Techniques Weight Decay, SGD with Momentum
Architecture 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID r3d_18
LR 0.01
Epochs 45
LR Gamma 0.1
Momentum 0.9
Batch Size 16
Weight Decay 0.0001
LR Warmup Epochs 10
ResNet (2+1)D

Parameters 32 Million
FLOPs 41 Billion
File Size 120.32 MB
Training Data Kinetics-400
Training Resources 64x NVIDIA V100 GPUs
Training Time 24 hours

Training Techniques Weight Decay, SGD with Momentum
Architecture 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID r2plus1d_18
LR 0.01
Epochs 45
LR Gamma 0.1
Momentum 0.9
Batch Size 16
Weight Decay 0.0001
LR Warmup Epochs 10
ResNet MC 18

Parameters 0 Million
FLOPs 2 Billion
File Size 44.67 MB
Training Data Kinetics-400
Training Resources 64x NVIDIA V100 GPUs
Training Time 24 hours

Training Techniques Weight Decay, SGD with Momentum
Architecture 1x1 Convolution, Bottleneck Residual Block, Batch Normalization, 3D Convolution, Global Average Pooling, Residual Block, Residual Connection, ReLU, Max Pooling, Softmax
ID mc3_18
LR 0.01
Epochs 45
LR Gamma 0.1
Momentum 0.9
Batch Size 16
Weight Decay 0.0001
LR Warmup Epochs 10


ResNet 3D is a type of model for video that employs 3D convolutions. This model collection consists of two main variants.

The first formulation is named mixed convolution (MC) and consists in employing 3D convolutions only in the early layers of the network, with 2D convolutions in the top layers. The rationale behind this design is that motion modeling is a low/mid-level operation that can be implemented via 3D convolutions in the early layers of a network, and spatial reasoning over these mid-level motion features (implemented by 2D convolutions in the top layers) leads to accurate action recognition. We show that MC ResNets yield roughly a 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity and they match the performance of 3D ResNets, which have 3 times as many parameters.

The second spatiotemporal variant is a “(2+1)D” convolutional block, which explicitly factorizes 3D convolution into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. The first advantage is an additional nonlinear rectification between these two operations. This effectively doubles the number of nonlinearities compared to a network using full 3D convolutions for the same number of parameters, thus rendering the model capable of representing more complex functions. The second potential benefit is that the decomposition facilitates the optimization, yielding in practice both a lower training loss and a lower testing loss.

How do I load this model?

To load a pretrained model:

import torchvision.models as models
r3d_18 =

Replace the model name with the variant you want to use, e.g. r3d_18. You can find the IDs in the model summaries at the top of this page.

To evaluate the model, use the object detection recipes from the library.

How do I train this model?

You can follow the torchvision recipe on GitHub for training a new model afresh.


  author    = {Du Tran and
               Heng Wang and
               Lorenzo Torresani and
               Jamie Ray and
               Yann LeCun and
               Manohar Paluri},
  title     = {A Closer Look at Spatiotemporal Convolutions for Action Recognition},
  journal   = {CoRR},
  volume    = {abs/1711.11248},
  year      = {2017},
  url       = {},
  archivePrefix = {arXiv},
  eprint    = {1711.11248},
  timestamp = {Mon, 13 Aug 2018 16:46:39 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}


Action Classification on Kinetics-400

Action Classification
Kinetics-400 ResNet (2+1)D Clip acc@1 57.5 # 1
Clip acc@5 78.81 # 1
Kinetics-400 ResNet MC 18 Clip acc@1 53.9 # 2
Clip acc@5 76.29 # 2
Kinetics-400 ResNet 3D 18 Clip acc@1 52.75 # 3
Clip acc@5 75.45 # 3