Search Results for author: Hiva Mohammadzadeh

Found 2 papers, 1 papers with code

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

1 code implementation31 Jan 2024 Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.

Quantization

SPEED: Speculative Pipelined Execution for Efficient Decoding

no code implementations18 Oct 2023 Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao

For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference.

Cannot find the paper you are looking for? You can Submit a new open access paper.