no code implementations • 1 Nov 2024 • Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa
Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques.
1 code implementation • 9 Oct 2024 • Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos
By stacking training optimizations, we demonstrate accelerations of 65. 08% with 1D parallelism at the 128-GPU scale (Llama 3. 1 8B), an additional 12. 59% with 2D parallelism at the 256-GPU scale (Llama 3. 1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3. 1 405B) on NVIDIA H100 GPUs over optimized baselines.