Search Results for author: Zherui Liu

Found 4 papers, 1 papers with code

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

1 code implementation • 23 Feb 2024 • Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu

Training LLMs at this scale brings unprecedented challenges to training efficiency and stability.

Language Modelling Large Language Model

339

Paper
Code

Aryl: An Elastic Cluster Scheduler for Deep Learning

no code implementations • 16 Feb 2022 • Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, Cong Wang

We introduce Aryl, a new cluster scheduler to address these problems.

Management Multiple-choice +1

Paper
Add Code

Prediction of GPU Failures Under Deep Learning Workloads

no code implementations • 27 Jan 2022 • Heting Liu, Zhichao Li, Cheng Tan, Rongqiu Yang, Guohong Cao, Zherui Liu, Chuanxiong Guo

To improve the precision and stability of predictions, we propose several techniques, including parallel and cascade model-ensemble mechanisms and a sliding training method.

Paper
Add Code

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

no code implementations • 18 Sep 2021 • Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, Chuanxiong Guo

With MIG, A100 can be the most cost-efficient GPU ever for serving Deep Neural Networks (DNNs).

Scheduling

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.