no code implementations • 28 Feb 2025 • Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, Baris Kasikci
To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
no code implementations • 17 Feb 2025 • Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci
Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding.
no code implementations • 25 Nov 2024 • Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica
Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications.
1 code implementation • 16 Jun 2024 • Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han
By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy.
1 code implementation • 10 Feb 2024 • Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, Baris Kasikci
Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks.
1 code implementation • 29 Oct 2023 • Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss.
no code implementations • 1 Jan 2016 • Jerry Luo, Kayla Shapiro, Hao-Jun Michael Shi, Qi Yang, Kan Zhu
Motivated by non-negative matrix factorization, we reformulate our problem into a Frobenius norm minimization problem, which is solved by the Alternating Direction Method of Multipliers (ADMM) and develop an algorithm, FroMax.