1 code implementation • 12 Mar 2025 • Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution.
no code implementations • 10 Dec 2024 • Yao Fu, Yinsicheng Jiang, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Kai Zou, Edoardo Ponti, Luo Mai
Its key innovation is a sparsity-aware CAP analysis model, the first to integrate cost, performance, and accuracy metrics into a single diagram while estimating the impact of sparsity on system performance.
2 code implementations • 25 Jan 2024 • Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina
This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity.
1 code implementation • 25 Jan 2024 • Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai
This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs).
no code implementations • 1 Apr 2019 • Leyang Xue, Peng Zhang, An Zeng
Notably, an optimal parameter n* of ARL existed in long-term recommendation, indicating that there is a trade-off between keeping diversity of item and user's preference to maximize the long-term recommendation accuracy.
no code implementations • 29 Mar 2019 • Peng Zhang, Leyang Xue, An Zeng
The results show that the higher recommendation accuracy based on diffusion algorithms can still be achieved by optimizing the way of resource allocation on a density network.