Search Results for author: Xingda Wei

Found 2 papers, 0 papers with code

KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

no code implementations24 Dec 2024 Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen

The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning, causing latency spikes due to queuing incoming requests.

Language Modeling Language Modelling +2

Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

no code implementations6 May 2024 Rongxin Cheng, Yifan Peng, Xingda Wei, Hongrui Xie, Rong Chen, Sijie Shen, Haibo Chen

In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5. 7$\times$ and 1. 7$\times$, these indexes have to pay a 5. 8$\times$ storage amplification and 7. 7$\times$ with respect to the dataset size, respectively.

RAG

Cannot find the paper you are looking for? You can Submit a new open access paper.