Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training

The recent trend of using large-scale deep neural networks (DNN) to boost performance has propelled the development of the parallel pipelining technique for efficient DNN training, which has resulted in the development of several prominent pipelines such as GPipe, PipeDream, and PipeDream-2BW. However, the current leading pipeline, PipeDream-2BW, still suffers from two major drawbacks, namely the excessive memory redundancy and the delayed weight updates across all stages. In this work, we propose a novel pipeline named WPipe, which achieves better memory efficiency and fresher weight updates. WPipe uses a novel pipelining scheme, which divides model partitions into two groups. It moves the forward pass of the next period to the front of the backward pass of the current period in the first group, retains the order in the second group, and updates each group alternatively. This will eliminate half of the delayed gradients and memory redundancy compared to PipeDream-2BW. The results of our experiments, in which large BERT language models were trained, show that compared to PipeDream-2BW, WPipe achieves $1.4\times$ acceleration, reduces the memory footprint by 36% with similar final model accuracy.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods