Search Results for author: Chengji Yao

Found 1 papers, 1 papers with code

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

1 code implementation11 Jun 2024 Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

Overall, it can achieve up to 1. 24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1. 66x and 1. 30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

Cannot find the paper you are looking for? You can Submit a new open access paper.