Search Results for author: Yuxin Lai

Found 1 papers, 0 papers with code

KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

no code implementations24 Dec 2024 Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen

The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning, causing latency spikes due to queuing incoming requests.

Language Modeling Language Modelling +2

Cannot find the paper you are looking for? You can Submit a new open access paper.