Search Results for author: Leyang Xue

Found 6 papers, 3 papers with code

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

1 code implementation12 Mar 2025 Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai

This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution.

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

no code implementations10 Dec 2024 Yao Fu, Yinsicheng Jiang, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Kai Zou, Edoardo Ponti, Luo Mai

Its key innovation is a sparsity-aware CAP analysis model, the first to integrate cost, performance, and accuracy metrics into a single diagram while estimating the impact of sparsity on system performance.

Benchmarking

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

2 code implementations25 Jan 2024 Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity.

model

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

1 code implementation25 Jan 2024 Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs).

Scheduling

Enhancing the long-term performance of recommender system

no code implementations1 Apr 2019 Leyang Xue, Peng Zhang, An Zeng

Notably, an optimal parameter n* of ARL existed in long-term recommendation, indicating that there is a trade-off between keeping diversity of item and user's preference to maximize the long-term recommendation accuracy.

Diversity Recommendation Systems

Predictability of diffusion-based recommender systems

no code implementations29 Mar 2019 Peng Zhang, Leyang Xue, An Zeng

The results show that the higher recommendation accuracy based on diffusion algorithms can still be achieved by optimizing the way of resource allocation on a density network.

Recommendation Systems

Cannot find the paper you are looking for? You can Submit a new open access paper.