1 code implementation • 23 Feb 2024 • Lu Ye, Ze Tao, Yong Huang, Yang Li
In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache.