Mesh-TensorFlow: Deep Learning for Supercomputers

NeurIPS 2018 Noam ShazeerYoulong ChengNiki ParmarDustin TranAshish VaswaniPenporn KoanantakoolPeter HawkinsHyoukJoong LeeMingsheng HongCliff YoungRyan SepassiBlake Hechtman

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling One Billion Word Mesh Tensorflow PPL 24.0 # 5
Number of params 4.9B # 1

Methods used in the Paper