Search Results for author: Wanchao Liang

Found 2 papers, 1 papers with code

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

no code implementations1 Nov 2024 Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa

Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques.

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

1 code implementation9 Oct 2024 Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos

By stacking training optimizations, we demonstrate accelerations of 65. 08% with 1D parallelism at the 128-GPU scale (Llama 3. 1 8B), an additional 12. 59% with 2D parallelism at the 256-GPU scale (Llama 3. 1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3. 1 405B) on NVIDIA H100 GPUs over optimized baselines.

Cannot find the paper you are looking for? You can Submit a new open access paper.