Wavelet Distributed Training

Wavelet is an asynchronous data parallel approach that interleaves waves of training tasks on the same group of GPUs, such that tasks belong to one wave can leverage on-device memory from tasks in another wave during their memory valley period, thus boost-up the training throughput. As shown in the Figure, Wavelet divides dataparallel training tasks into two waves, namely tick-wave and tock-wave. The task launching offset is achieved by delaying the launch time of tock-wave tasks for half of a whole forward-backward training cycle. Therefore, the tock-wave tasks can directly leverage GPU memory valley period of tick-wave tasks (e.g. 0.4s-0.6s in Figure 2(a)), since backward propagation of tick-wave tasks is compute-heavy but memory is often unused. Similarly, tick-wave tasks can leverage memory valley period of tock-wave tasks in the same way.