no code implementations • 28 Feb 2024 • June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs).
1 code implementation • 27 Feb 2024 • Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Sejung Kwon, Dongsuk Jeon, Dongsoo Lee
In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs while maintaining accuracy.
1 code implementation • 27 Sep 2023 • Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee
Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers.
no code implementations • 1 Jun 2023 • Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee
As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output.
no code implementations • 8 Oct 2022 • Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, Dongsoo Lee
To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
no code implementations • 22 Sep 2022 • Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim
DFX is also 8. 21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.
no code implementations • 20 Jun 2022 • Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee
By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference.
1 code implementation • 27 May 2022 • Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon
Whereas diverse variations of diffusion models exist, extending the linear diffusion into a nonlinear diffusion process is investigated by very few works.
Ranked #4 on Image Generation on CelebA 64x64
no code implementations • 29 Sep 2021 • Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon
Specifically, PDM utilizes the flow to non-linearly transform a data variable into a latent variable, and PDM applies the diffusion process to the transformed latent distribution with the linear diffusing mechanism.
no code implementations • 5 May 2021 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Baeseong Park, Yongkweon Jeon
While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time. In this paper, we introduce regularization frequency (i. e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware training method.
no code implementations • ICLR 2022 • Baeseong Park, Se Jung Kwon, Daehwan Oh, Byeongwook Kim, Dongsoo Lee
Then, as an effort to push the compression ratio to the theoretical maximum (by entropy), we propose a sequential fixed-to-fixed encoding scheme.
no code implementations • 5 May 2021 • Byeongwook Kim, Dongsoo Lee, Yeonju Ro, Yongkweon Jeon, Se Jung Kwon, Baeseong Park, Daehwan Oh
When the number of quantization bits is relatively low, however, non-convex optimization is unavoidable to improve model accuracy.
no code implementations • 1 Jan 2021 • Se Jung Kwon, Dongsoo Lee, Yongkweon Jeon, Byeongwook Kim, Bae Seong Park, Yeonju Ro
As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint.
no code implementations • 14 Dec 2020 • Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han
Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • Findings of the Association for Computational Linguistics 2020 • Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners.
no code implementations • NeurIPS 2020 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun
Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables.
no code implementations • 20 May 2020 • Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee
Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs.
no code implementations • 25 Sep 2019 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei
In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data significantly.
no code implementations • 25 Sep 2019 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun, Gu-Yeon Wei
Using various models, we show that simple weight updates to comply with compression formats along with long NR period is enough to achieve high compression ratio and model accuracy.
no code implementations • CVPR 2020 • Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, Gu-Yeon Wei
Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations.
no code implementations • 24 May 2019 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Gu-Yeon Wei
Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations.
no code implementations • 14 May 2019 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei
Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs).
no code implementations • ICLR 2019 • Daehyun Ahn, Dongsoo Lee, Taesu Kim, Jae-Joon Kim
In this paper, we propose a new sparse matrix format in order to enable a highly parallel decoding process of the entire sparse matrix.
no code implementations • 30 Oct 2018 • Dongsoo Lee, Parichay Kapoor, Byeongwook Kim
Model compression has been introduced to reduce the required hardware resources while maintaining the model accuracy.
no code implementations • 27 Sep 2018 • Parichay Kapoor, Dongsoo Lee, Byeongwook Kim, Saehyung Lee
We present a non-intrusive quantization technique based on re-training the full precision model, followed by directly optimizing the corresponding binary model.
no code implementations • 29 May 2018 • Dongsoo Lee, Byeongwook Kim
We show that iterative retraining generates new sets of weights which can be quantized with decreasing quantization loss at each iteration.
no code implementations • ICLR 2018 • Dongsoo Lee, Daehyun Ahn, Taesu Kim, Pierce I. Chuang, Jae-Joon Kim
Hence, pruning is usually restricted to inference with a batch size of one, for which an efficient parallel matrix-vector multiplication method exists.