Distributed Methods

Edit

General • 43 methods

This section contains a compilation of distributed methods for scaling deep learning to very large models. There are many different strategies for scaling training across multiple devices, including:

Data Parallel : for each node we use the same model parameters to do forward propagation, but we send a small batch of different data to each node, compute the gradient normally, and send it back to the main node. Once we have all the gradients, we calculate the weighted average and use this to update the model parameters.
Model Parallel : for each node we assign different layers to it. During forward propagation, we start in the node with the first layers, then move onto the next, and so on. Once forward propagation is done we calculate gradients for the last node, and update model parameters for that node. Then we backpropagate onto the penultimate node, update the parameters, and so on.
Additional methods including Hybrid Parallel, Auto Parallel, and Distributed Communication.

Image credit: Jordi Torres.

Subcategories

1 Auto Parallel Methods

2 Data Parallel Methods

3 Distributed Communication

4 Hybrid Parallel Methods

5 Model Parallel Methods

Methods

Add a Method

Method	Year	Papers
Local SGD Local SGD Converges Fast and Communicates Little	2018	58
Parallax	2019	47
Gradient Sparsification Gradient Sparsification for Communication-Efficient Distributed Optimization	2017	30
IMPALA IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures	2018	15
Chimera Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines	2021	9
GPipe GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	2018	7
GShard GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	2020	6
ZeRO ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	2019	6
DistDGL DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs	2020	6
Tofu Supporting Very Large Models using Automatic Dataflow Graph Partitioning	2018	6
Herring	2020	4
PipeDream	2019	4
Accordion Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification	2020	3
PipeDream-2BW Memory-Efficient Pipeline-Parallel DNN Training	2020	3
PowerSGD PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization	2019	3
Crossbow CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers	2019	2
TorchBeast TorchBeast: A PyTorch Platform for Distributed RL	2019	2
PyTorch DDP PyTorch Distributed: Experiences on Accelerating Data Parallel Training	2020	2
Mesh-TensorFlow Mesh-TensorFlow: Deep Learning for Supercomputers	2018	2
FastMoE FastMoE: A Fast Mixture-of-Expert Training System	2021	2
SEED RL SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference	2019	2
Pipelined Backpropagation Pipelined Backpropagation at Scale: Training Large Models without Batches	2020	1
ByteScheduler	2019	1
PipeMare PipeMare: Asynchronous Pipeline Parallel DNN Training	2019	1
ZeRO-Infinity ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning	2021	1
K-Maximal Word Allocation DiMSum: Distributed and Multilingual Summarization of Financial Narratives		1
Blink Communication Blink: Fast and Generic Collectives for Distributed ML	2019	1
DABMD	2020	1
ALQ and AMQ Adaptive Gradient Quantization for Data-Parallel SGD	2020	1
PipeTransformer PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers	2021	1
SlowMo SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum	2019	1
AutoSync AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning	2020	1
BAGUA BAGUA: Scaling up Distributed Learning with System Relaxations	2021	1
E2EAdaptiveDistTraining End-to-end Adaptive Distributed Training on PaddlePaddle	2021	1
ZeRO-Offload ZeRO-Offload: Democratizing Billion-Scale Model Training	2021	1
NUQSGD NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization	2021	1
BytePS	2020	1
Dorylus Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads	2021	1
FlexFlow	2019	0
KungFu	2020	0
Wavelet Distributed Training	2021	0
HetPipe	2020	0

Distributed Methods Edit

Methods Add a Method

Distributed Methods

Edit

Methods

Add a Method