Search Results for author: Jiaming Tang

Found 4 papers, 3 papers with code

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

1 code implementation14 Oct 2024 Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han

Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.

Quantization Retrieval

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

1 code implementation16 Jun 2024 Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy.

DCRMTA: Unbiased Causal Representation for Multi-touch Attribution

no code implementations16 Jan 2024 Jiaming Tang

Multi-touch attribution (MTA) currently plays a pivotal role in achieving a fair estimation of the contributions of each advertising touchpoint to-wards conversion behavior, deeply influencing budget allocation and advertising recommenda-tion.

counterfactual

Cannot find the paper you are looking for? You can Submit a new open access paper.