1 code implementation • 6 Feb 2023 • Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang You
To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans.
2 code implementations • 10 Dec 2022 • Haichen Huang, Jiarui Fang, Hongxin Liu, Shenggui Li, Yang You
To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed.
no code implementations • 6 Sep 2022 • Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, Yang You
Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints.
1 code implementation • 8 Aug 2022 • Jiarui Fang, Geng Zhang, Jiatong Han, Shenggui Li, Zhengda Bian, Yongbin Li, Jin Liu, Yang You
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
1 code implementation • 24 Feb 2022 • Jie Zhu, Shenggui Li, Yang You
In this paper, we proposed Sky Computing, a load-balanced model parallelism framework to adaptively allocate the weights to devices.
1 code implementation • 28 Oct 2021 • Shenggui Li, Jiarui Fang, Zhengda Bian, Hongxin Liu, Yuliang Liu, Haichen Huang, Boxiang Wang, Yang You
The success of Transformer models has pushed the deep learning model scale to billions of parameters.
1 code implementation • 12 Aug 2021 • Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie zhou, Yang You
PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data.
no code implementations • 8 Aug 2021 • Zhengda Bian, Shenggui Li, Wei Wang, Yang You
ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency.
no code implementations • 26 May 2021 • Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You
That is, with sparse attention, our sequence parallelism enables us to train transformer with infinite long sequence.
1 code implementation • 12 Apr 2021 • Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You
However, due to memory constraints, model parallelism must be utilized to host large models that would otherwise not fit into the memory of a single device.