Search Results for author: Mengdi Wu

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests.

1,509

Paper
Code

Besides, an efficient deployment flow for the mobile CPU is developed, achieving up to 7. 46$\times$ inference acceleration on an octa-core ARM CPU.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.