Speeding up Deep Learning Training by Sharing Weights and Then Unsharing
It has been widely observed that increasing deep learning model sizes often leads to significant performance improvements on a variety of natural language processing and computer vision tasks. In the meantime, however, computational costs and training time would dramatically increase when models get larger. In this paper, we propose a simple approach to speed up training for a particular kind of deep networks which contain repeated structures, such as the transformer module. In our method, we first train such a deep network with the weights shared across all the repeated layers. Once an unsharing condition is triggered, we stop weight sharing and continue training until convergence. Empirical results show that our method is able to reduce the training time of BERT by 50%. We also conduct a preliminary theoretic analysis which motivates our approach.
PDF Abstract