Search Results for author: Qiuli Mao

Found 3 papers, 0 papers with code

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

no code implementations28 Apr 2025 Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, Guohao Dai, Yu Wang

We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives.

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

no code implementations28 Apr 2025 Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang

In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i. e., partitioning the computational resource to enable the asynchronous computation of two phases.

Large Language Model Scheduling

FlashDecoding++: Faster Large Language Model Inference on GPUs

no code implementations2 Nov 2023 Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang

A single and static dataflow may lead to a 50. 25% performance loss for GEMMs of different shapes in LLM inference.

Language Modeling Language Modelling +2

Cannot find the paper you are looking for? You can Submit a new open access paper.