Temporally Distributed Networks for Fast Video Semantic Segmentation

We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Real-Time Semantic Segmentation CamVid TD2-PSP50 mIoU 76.0 # 10
Time (ms) 90 # 15
Frame (fps) 11(TitanX) # 18
Real-Time Semantic Segmentation CamVid TD4-PSP18 mIoU 72.6 # 16
Time (ms) 40 # 13
Frame (fps) 25(TitanX) # 18
Real-Time Semantic Segmentation Cityscapes test TD4-BISE18 mIoU 74.9% # 14
Time (ms) 21 # 13
Frame (fps) 47.6 (Titan X) # 15
Video Semantic Segmentation Cityscapes val TDNet-50 [9] mIoU 79.9 # 2
Semantic Segmentation NYU Depth v2 TD2-PSP50 Mean IoU 43.5 # 87
Real-Time Semantic Segmentation NYU Depth v2 TD4-PSP18 mIoU 37.4 # 9
Speed(ms/f) 19 # 3
Real-Time Semantic Segmentation NYU Depth v2 TD2-PSP50 mIoU 43.5 # 4
Speed(ms/f) 35 # 7
Semantic Segmentation NYU Depth v2 TD4-PSP18 Mean IoU 37.4 # 99
Semantic Segmentation UrbanLF TDNet (ResNet-50) mIoU (Real) 76.48 # 8
mIoU (Syn) 74.71 # 13

Methods