Search Results for author: Kan Zhu

Found 7 papers, 3 papers with code

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

no code implementations17 Feb 2025 Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding.

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

no code implementations25 Nov 2024 Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

1 code implementation16 Jun 2024 Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy.

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

1 code implementation10 Feb 2024 Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, Baris Kasikci

Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks.

Mixture-of-Experts

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

1 code implementation29 Oct 2023 Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss.

Quantization Sentiment Analysis

Practical Algorithms for Learning Near-Isometric Linear Embeddings

no code implementations1 Jan 2016 Jerry Luo, Kayla Shapiro, Hao-Jun Michael Shi, Qi Yang, Kan Zhu

Motivated by non-negative matrix factorization, we reformulate our problem into a Frobenius norm minimization problem, which is solved by the Alternating Direction Method of Multipliers (ADMM) and develop an algorithm, FroMax.

Dimensionality Reduction

Cannot find the paper you are looking for? You can Submit a new open access paper.