Search Results for author: Dongsoo Lee

Found 27 papers, 3 papers with code

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

no code implementations28 Feb 2024 June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee

Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs).

Quantization

DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

1 code implementation27 Feb 2024 Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Sejung Kwon, Dongsuk Jeon, Dongsoo Lee

In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs while maintaining accuracy.

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

1 code implementation27 Sep 2023 Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers.

Language Modelling Quantization

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

no code implementations1 Jun 2023 Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output.

Image Classification Natural Language Understanding +2

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

no code implementations8 Oct 2022 Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, Dongsoo Lee

To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.

Language Modelling Model Compression +1

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

no code implementations22 Sep 2022 Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim

DFX is also 8. 21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

Language Modelling Text Generation

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

no code implementations20 Jun 2022 Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee

By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference.

Quantization Self-Supervised Learning

Maximum Likelihood Training of Implicit Nonlinear Diffusion Models

1 code implementation27 May 2022 Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon

Whereas diverse variations of diffusion models exist, extending the linear diffusion into a nonlinear diffusion process is investigated by very few works.

Image Generation

Maximum Likelihood Training of Parametrized Diffusion Model

no code implementations29 Sep 2021 Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon

Specifically, PDM utilizes the flow to non-linearly transform a data variable into a latent variable, and PDM applies the diffusion process to the transformed latent distribution with the linear diffusing mechanism.

Image Generation

Modulating Regularization Frequency for Efficient Compression-Aware Model Training

no code implementations5 May 2021 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Baeseong Park, Yongkweon Jeon

While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time. In this paper, we introduce regularization frequency (i. e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware training method.

Model Compression

Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression

no code implementations ICLR 2022 Baeseong Park, Se Jung Kwon, Daehwan Oh, Byeongwook Kim, Dongsoo Lee

Then, as an effort to push the compression ratio to the theoretical maximum (by entropy), we propose a sequential fixed-to-fixed encoding scheme.

Model Compression

Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization

no code implementations5 May 2021 Byeongwook Kim, Dongsoo Lee, Yeonju Ro, Yongkweon Jeon, Se Jung Kwon, Baeseong Park, Daehwan Oh

When the number of quantization bits is relatively low, however, non-convex optimization is unavoidable to improve model accuracy.

Quantization

Post-Training Weighted Quantization of Neural Networks for Language Models

no code implementations1 Jan 2021 Se Jung Kwon, Dongsoo Lee, Yongkweon Jeon, Byeongwook Kim, Bae Seong Park, Yeonju Ro

As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint.

Model Compression Quantization

A review of on-device fully neural end-to-end automatic speech recognition algorithms

no code implementations14 Dec 2020 Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar, Sungsoo Kim, Abhinav Garg, Changwoo Han

Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

FleXOR: Trainable Fractional Quantization

no code implementations NeurIPS 2020 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun

Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables.

Quantization

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

no code implementations20 May 2020 Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee

Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs.

Quantization

Network Pruning for Low-Rank Binary Index

no code implementations25 Sep 2019 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei

In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data significantly.

Model Compression Network Pruning +1

Decoupling Weight Regularization from Batch Size for Model Compression

no code implementations25 Sep 2019 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun, Gu-Yeon Wei

Using various models, we show that simple weight updates to comply with compression formats along with long NR period is enough to achieve high compression ratio and model accuracy.

Model Compression

Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

no code implementations CVPR 2020 Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, Gu-Yeon Wei

Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations.

Model Compression Quantization

Learning Low-Rank Approximation for CNNs

no code implementations24 May 2019 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Gu-Yeon Wei

Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations.

Model Compression

Network Pruning for Low-Rank Binary Indexing

no code implementations14 May 2019 Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei

Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs).

Model Compression Network Pruning

DeepTwist: Learning Model Compression via Occasional Weight Distortion

no code implementations30 Oct 2018 Dongsoo Lee, Parichay Kapoor, Byeongwook Kim

Model compression has been introduced to reduce the required hardware resources while maintaining the model accuracy.

Model Compression Quantization

Computation-Efficient Quantization Method for Deep Neural Networks

no code implementations27 Sep 2018 Parichay Kapoor, Dongsoo Lee, Byeongwook Kim, Saehyung Lee

We present a non-intrusive quantization technique based on re-training the full precision model, followed by directly optimizing the corresponding binary model.

Quantization

Retraining-Based Iterative Weight Quantization for Deep Neural Networks

no code implementations29 May 2018 Dongsoo Lee, Byeongwook Kim

We show that iterative retraining generates new sets of weights which can be quantized with decreasing quantization loss at each iteration.

Model Compression Quantization

Viterbi-based Pruning for Sparse Matrix with Fixed and High Index Compression Ratio

no code implementations ICLR 2018 Dongsoo Lee, Daehyun Ahn, Taesu Kim, Pierce I. Chuang, Jae-Joon Kim

Hence, pruning is usually restricted to inference with a batch size of one, for which an efficient parallel matrix-vector multiplication method exists.

Cannot find the paper you are looking for? You can Submit a new open access paper.