Non-local Neural Networks

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract

Results from the Paper


Ranked #7 on Action Classification on Toyota Smarthome dataset (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Keypoint Detection COCO Mask R-CNN + NL blocks (4 in head, 1 in backbone) Validation AP 66.5 # 13
Object Detection COCO minival Mask R-CNN (ResNeXt-152 + 1 NL) box AP 45.0 # 70
AP50 67.8 # 18
AP75 48.9 # 26
Instance Segmentation COCO minival Mask R-CNN (ResNext-152, +1 NL) mask AP 40.3 # 46
Object Detection COCO minival Mask R-CNN (ResNet-50 + 1 NL) box AP 39.0 # 139
AP50 61.1 # 59
AP75 41.9 # 72
Object Detection COCO minival Mask R-CNN (ResNet-101 + 1 NL) box AP 40.8 # 120
AP50 63.1 # 42
AP75 44.5 # 57
Instance Segmentation COCO minival Mask R-CNN (ResNet-50, +1 NL) mask AP 35.5 # 63
Instance Segmentation COCO minival Mask R-CNN (ResNet-101, +1 NL) mask AP 37.1 # 59
Action Classification Toyota Smarthome dataset I3D + Non Local CS 53.6 # 7
CV1 34.3 # 6
CV2 43.9 # 7

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Action Classification Kinetics-400 I3D + NL Vid acc@1 77.7 # 73
Vid acc@5 93.3 # 60
Action Recognition Something-Something V1 NL I3D Top 1 Accuracy 44.4 # 52

Methods