Data Parallel Methods

Edit

General • Distributed Methods • 16 methods

This section contains a compilation of distributed data parallel methods for deep learning. For each node we use the same model parameters to do forward propagation, but we send a small batch of different data to each node, compute the gradient normally, and send it back to the main node. Once we have all the gradients, we calculate the weighted average and use this to update the model parameters.

Image credit: Jordi Torres.

Subcategories

1 Asynchronous Data Parallel

2 Replicated Data Parallel

3 Sharded Data Parallel Methods

Methods

Add a Method

Method	Year	Papers
Local SGD Local SGD Converges Fast and Communicates Little	2018	58
Gradient Sparsification Gradient Sparsification for Communication-Efficient Distributed Optimization	2017	30
ZeRO ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	2019	6
Accordion Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification	2020	3
PowerSGD PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization	2019	3
Crossbow CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers	2019	2
PyTorch DDP PyTorch Distributed: Experiences on Accelerating Data Parallel Training	2020	2
BAGUA BAGUA: Scaling up Distributed Learning with System Relaxations	2021	1
ZeRO-Offload ZeRO-Offload: Democratizing Billion-Scale Model Training	2021	1
ByteScheduler	2019	1
NUQSGD NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization	2021	1
ZeRO-Infinity ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning	2021	1
DABMD	2020	1
ALQ and AMQ Adaptive Gradient Quantization for Data-Parallel SGD	2020	1
SlowMo SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum	2019	1
Wavelet Distributed Training	2021	0

Data Parallel Methods Edit

Methods Add a Method

Data Parallel Methods

Edit

Methods

Add a Method