Hierarchical Cost Analysis for Distributed DL

IEEE International Parallel and Distributed Processing Symposium Workshops 2021 · Haoran Wang ·

Deep Learning (DL) developed rapidly during the past decade. DNN models become larger and more complex. Increasing size of datasets and models, requires efficient distributed approaches. Different parallelism strategies result in different performance depending on the structure of DNNs. In order to obtain better performance and overcome the memory restrictions, Hybrid Parallelisms (HP), which apply different basic parallelism strategies on different parts of DNNs, are also explored. However, different parallelisms bring about mixed extra costs which are difficult to distinguish and evaluate. It is crucial to provide an approach which could clearly evaluate the costs caused by parallelisms and systematically find efficient hybrid strategies. Current approaches only consider one or two kinds of parallelisms. In this work, we firstly present the training process of DNNs and give the explanations of the DNN notions. Three basic parallelism strategies (DP: data parallelism, OP: operator parallelism, PP: pipeline parallelism) are introduced and their relative merits are compared. The computation and communication are naturally distinguished when only considering the HP of DP/PP. DP determines the synchronous communication, but has no effect on forward/backward propagation (FPG/BPG), PP causes a bubble and little communication but will not affect the synchronization. Fundamental cost analysis can be easily applied. However, when taking OP into consideration, the FPG/BPG are changed to a mixed process of communication and computation where the previous cost analysis is no longer suitable. Existing approaches have not created a concrete execution model for distributed DNN training, but simply evaluate the total cost of computation and communication. Without distinguishing different kinds of communication, the proper HP can only be obtained through traversal or tuning, and loss further optimization opportunities. In order to formalize the behaviors of the HP in distributed DL and quantitatively evaluate the cost caused by HP, we are studying Bridging DL composed by a double-level execution model associated with a symbolic cost model. The double-level execution model is used to explore the details of the HPs. The training process of the whole DNN model is abstracted by a super-step while the training of an operator is abstracted by a micro-step. With the two steps, the mixed communication and computation caused by three basic parallelism strategies are properly separated and the training process is clearly described. Based on the double-level execution model, the cost model of distributed DL helps to choose efficient HP strategies. Details of the cost model can be found in the poster and extended abstracts. To conclude, Bridging DL enables systematical HP searching for distributed DNN training and provides a good opportunity for further analyzing and optimizations. Besides, Bridging DL also helps to guide the DNN framework for code generation of HP. A preliminary model (for micro-step) has been implemented on MindSpore. The following parts will be implemented and evaluated.

PDF