In this paper, we focus on approaches to parallelizing stochastic gradient
descent (SGD) wherein data is farmed out to a set of workers, the results of
which, after a number of updates, are then combined at a central master node. Although such synchronized SGD approaches parallelize well in idealized
computing environments, they often fail to realize their promised computational
acceleration in practical settings...
One cause is slow workers, termed
stragglers, who can cause the fusion step at the master node to stall, which
greatly slowing convergence. In many straggler mitigation approaches work
completed by these nodes, while only partial, is discarded completely. In this
paper, we propose an approach to parallelizing synchronous SGD that exploits
the work completed by all workers. The central idea is to fix the computation
time of each worker and then to combine distinct contributions of all workers. We provide a convergence analysis and optimize the combination function. Our
numerical results demonstrate an improvement of several factors of magnitude in
comparison to existing methods.