Search Results for author: Woosuk Kwon

Found 4 papers, 4 papers with code

Efficient Memory Management for Large Language Model Serving with PagedAttention

4 code implementations12 Sep 2023 Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.

Language Modelling Large Language Model +1

A Fast Post-Training Pruning Framework for Transformers

2 code implementations29 Mar 2022 Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining.

Learned Token Pruning for Transformers

1 code implementation2 Jul 2021 Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer

We extensively test the performance of LTP on GLUE tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ~2. 5% higher accuracy with the same amount of FLOPs.

Sentence

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

1 code implementation NeurIPS 2020 Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs.

Scheduling

Cannot find the paper you are looking for? You can Submit a new open access paper.