Revisiting Locally Supervised Training of Deep Neural Networks

ICLR 2021 · Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, Gao Huang ·

Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive experiments on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that our method is capable of achieving competitive performance with fewer than 40% memory footprint compared to E2E training, or enables training local modules asynchronously for potential training acceleration.

PDF Abstract