no code implementations • 24 Dec 2024 • Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen
The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning, causing latency spikes due to queuing incoming requests.
no code implementations • 6 May 2024 • Rongxin Cheng, Yifan Peng, Xingda Wei, Hongrui Xie, Rong Chen, Sijie Shen, Haibo Chen
In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5. 7$\times$ and 1. 7$\times$, these indexes have to pay a 5. 8$\times$ storage amplification and 7. 7$\times$ with respect to the dataset size, respectively.