no code implementations • 13 Feb 2025 • Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang
To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm.
1 code implementation • 24 Jun 2024 • Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang
The transformer's context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory.
no code implementations • 14 Jun 2024 • Heejun Lee, Geon Park, Youngwan Lee, Jaduk Suh, Jina Kim, Wonyoung Jeong, Bumsik Kim, Hyemin Lee, Myeongjae Jeon, Sung Ju Hwang
In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only $O(\log T)$ tokens on the GPU while maintaining similar decoding throughput.
1 code implementation • 3 Oct 2023 • Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang
SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then subsequently creates a sparse attention matrix with a top-k selection to perform a sparse attention operation.
1 code implementation • COLING 2022 • Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yangsok Kim, HeeGeun Yoon, Soyeon Caren Han
Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited.
no code implementations • 1 Mar 2022 • Javier Hidalgo, Heejun Lee, Jungyoon Lee, Myung Hwan Seo
We derive a risk lower bound in estimating the threshold parameter without knowing whether the threshold regression model is continuous or not.