Search Results for author: Zuquan Song

Found 4 papers, 1 papers with code

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

no code implementations4 Nov 2024 Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu

To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks.

Fault Detection

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

no code implementations29 Jul 2024 Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu

In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales.

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

1 code implementation11 Jun 2024 Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

Overall, it can achieve up to 1. 24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1. 66x and 1. 30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

Cannot find the paper you are looking for? You can Submit a new open access paper.